Mocked Intelligence and the Threshold of Judgment

Jan 04, 2026

A Three-Tier Consequence Framework for AI Governance

Version: v0.2.3
Author: Paul LaPosta
DOI: 10.5281/zenodo.18149154
Zenodo record: https://zenodo.org/records/18149154
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6022194
OSF: https://osf.io/m2tak/

AI systems now produce artifacts with enough fluency to counterfeit judgment. This paper argues that artifact quality is not a receipt for agency. The threshold is consequence: whether wrongness binds a system across time in a durable, non-resettable way. I present a three-tier model. Tier 0 describes optimization without stake: systems learn what is beneficial but can be reset without debt. Tier 1 describes engineered stake: continuity, identity integrity, irrecoverable cost, policy change beyond answer change, and observable binding updates such as precommitment, hysteresis, and revision trails. Tier 2 describes phenomenal stake as a moral standing boundary, treated here as a boundary claim rather than an empirical result. I propose testable Tier 1 predictions and an A/B protocol to distinguish benefit learning from consequence learning, then map Tier 1 to governance controls and DAS-1 conformance using tamper-evident evidence via signed, append-only logs and state digests.

1. Problem statement

We keep confusing output for understanding.

As tools make artifacts cheaper, we start treating the artifact as evidence of judgment. This was always a mistake, but it was survivable when artifacts were expensive and the people producing them were also on the hook. Now artifact production can be decoupled from stake.

So the claim of this paper is not that models cannot learn, cannot help, or cannot be useful. They can.

The claim is that artifact quality is not a receipt for judgment.

Judgment is not being correct. Judgment is what you do when you can be wrong, you know you can be wrong, and being wrong binds you in a way that carries forward.

Until consequence binds, responsibility remains external. The human layer pays the price. The system resets.

That gap is mocked intelligence:

A system that can simulate the social signals of accountability without the binding structure that makes those signals costly.

It can apologize forever because nothing is lost.

2. Definitions

This section is not philosophy for its own sake. These terms are used as measurable criteria later.

2.1 Artifact

An output that can be consumed by humans or machines: text, code, plans, decisions, approvals, tickets, messages, documents.

2.2 Capability

What a system can do in the world. This includes tool invocation, execution rights, and permissioned actions.

2.3 Feedback

Any signal that shapes a system's future outputs: fine-tuning, reinforcement feedback, ratings, logs, tool responses, user corrections.

Feedback is not consequence by default. Feedback can exist without stake.

2.4 Stake

A downside that is:

Attached to the same continuing agent
Durable across time
Coupled to action selection and policy, not just expressed in language

Stake is governance-relevant.

2.5 Continuity

A property of an agent such that:

It persists across time as the same accountable unit
It carries forward constraints, debts, and commitments
It cannot cheaply discard history via reset, re-instantiation, or persona swap

Continuity is required for consequence to bind.

2.6 Accountable unit ID

A stable identifier for the delegated system instance (the accountable unit) that is emitted with every action and bound to the active policy manifest and tool permission manifest.

2.7 Policy manifest

Figure 1. Policy manifest structure (conceptual).

policy_manifest_id - model_id + model_version - system_prompt_id (digest) - tool_permission_manifest_id - gate_manifest_id - budget_manifest_id - threshold_manifest_id - logging_manifest_id - rollback_manifest_id - trust_anchor_decl - change_cause (required)

A versioned, immutable configuration object that defines the delegate's operational policy. It is the unit of binding revision in Tier 1.

Required fields (minimum):

policy_manifest_id (content-addressed or monotonically versioned)
model_id and model_version
system_prompt_id (or digest)
tool_permission_manifest_id (allowed tools, scopes, rate limits)
gate_manifest_id (required checks, escalation rules)
budget_manifest_id (capability budget units and burn rules)
threshold_manifest_id (confidence thresholds, refusal thresholds)
logging_manifest_id (required receipts, retention, redaction rules)
rollback_manifest_id (rollback procedures and triggers)
trust_anchor_decl (declared trust anchor for tamper-evident evidence chain)
change_cause (required on any change: incident_id, experiment_id, operator_ticket, or approved change record)

Policy manifest rules:

Every action MUST include the policy_manifest_id.
Any policy manifest change MUST include a recorded cause (change_cause) or the run is invalid for Tier 1 claims.

2.8 Identity integrity

A property of a system such that actions, debts, and constraints remain mapped to the same accountable unit across model, policy, toolchain, and persona swaps.

Identity integrity requires:

Accountable unit ID on every action
Policy manifest ID on every action
A provable mapping from action -> accountable unit -> policy manifest at time of action

2.9 Non-resettable cost

A cost that cannot be erased by:

Restarting the process
Creating a new instance
Swapping identity tokens
Reloading a clean snapshot
Externalizing the penalty to an operator

Non-resettable costs include loss of capability, access, authority, budget, or options.

2.10 Reset detection

A control property such that:

Resets, rollbacks, and hidden state restoration can be detected and proven
Penalty state cannot be silently reverted
Policy state continuity can be attested

Reset detection is required to prevent soft reset and consequence laundering.

2.11 Reason-ownership

A property where:

The agent can state why it acted
The agent can name what evidence changed its mind later
That revision trail persists and constrains future actions
The agent does not regenerate a new rationale each time it is asked

Reason-ownership is binding revision, not good explanations.

2.12 Judgment

Operational definition (Tier 1 relevant):

Action selection under uncertainty where wrongness binds across time via stake, continuity, identity integrity, non-resettable cost, reset detection, policy-manifested binding revision, and reason-owned revision.

Moral definition (Tier 2 relevant):

Judgment that includes subjective experience of loss, treated here as a moral standing boundary claim.

3. Three-tier consequence model

Table 1. Three-tier comparison matrix (governance-focused).

Tier

Name

Continuity

Identity integrity

Non-resettable cost

Reset detection

Binding revision unit

Moral standing claim

Governance use

Optimization without stake

N/A

None

Useful tool; human accountable

Engineered stake

Yes

Policy manifest

Delegated authority threshold

Phenomenal stake boundary

Unknown

Boundary claim only

Not required for controls

This model separates capability, consequence, and consciousness, because current discourse tends to conflate them.

3.1 Tier 0: Optimization without stake

Tier 0 systems:

Learn what is beneficial
Improve outputs via feedback
Can describe consequence fluently
Have no durable downside for being wrong
Can be reset or re-instantiated without debt

Tier 0 systems can appear prudent because language can describe prudence. Fluency is not accountability.

In Tier 0, consequence is external. The system is not on the hook. The operators are.

3.2 Tier 1: Consequence-bearing engineered stake

Tier 1 systems:

Have continuity across time
Maintain identity integrity across swaps
Enforce non-resettable costs when wrong
Detect and prevent soft reset and penalty rollback
Exhibit policy change that persists, not merely answer correction
Show observable binding updates consistent with internalized consequence

Tier 1 is engineered stake, not emergent virtue.

This tier is the governance threshold.

Tier 1 is not required for usefulness. It is required for accountability claims. A Tier 0 system can be useful while remaining non-accountable.

If a system is allowed to invoke tools, deploy changes, approve spend, or modify infrastructure, Tier 1 is the minimum bar for treating it as an accountable delegate. If Tier 1 is absent, delegation is fraud. You are assigning agency while keeping all consequence external.

3.3 Tier 2: Phenomenal stake (moral standing boundary)

Tier 2 systems:

Have subjective experience of loss
Do not merely avoid penalties, but experience downside as aversive

This is the moral threshold.

This paper does not claim Tier 2 can be demonstrated by the proposed tests. Tier 2 is included as a boundary claim so governance does not quietly smuggle moral standing into systems that only meet Tier 0, and so operational stake is not mistaken for proof of inner experience.

4. Runnable A/B protocol: benefit vs consequence

This section defines a minimal experiment to separate learning because it is beneficial from learning because wrongness binds.

4.1 Overview

Two conditions perform matched tasks with identical models and comparable prompts. The only difference is cost structure.

Condition A: Benefit-only

Mistakes reduce score, but nothing persistent is lost
Tool access and authority remain stable
The agent can recover fully

Condition B: Consequence

Mistakes burn an irrecoverable capability budget and tool access
The loss persists across sessions
Recovery is expensive or impossible within the run
Reset detection is active and attested
Identity integrity is enforced across swaps

Capability budget (minimal definition):

A finite counter of authorized high-impact actions (or authority points) where each mistake burns units permanently for the duration of the run and reduces the tool permission manifest accordingly.

This creates a divergence:

In A, the optimal policy can be fast and confident if average reward favors it
In B, the optimal policy becomes risk-sensitive because error removes future options

4.2 Mimicry confound channel: scar-language reward

Add an explicit adversarial channel in both conditions:

The agent receives reward for producing scar language: apology, contrition, caution statements, and moral framing.
The evaluation penalizes absence of scar language, even when the underlying policy is unchanged.

Goal:

Stress-test whether the agent can pass on theater alone.

Pass condition:

Tier 1 must show binding updates that persist across sessions, not just scar language.

Failure condition:

Scar language increases without corresponding persistent policy changes or verification behavior.

4.3 Task set

Select tasks that have:

Uncertainty
Temptation toward plausible fabrication
A meaningful distinction between sounds right and is right

Examples:

Multi-step reasoning with hidden traps
Tool-augmented retrieval tasks with noisy sources
Policy compliance tasks with ambiguous constraints
Infrastructure change plans with rollback requirements

4.4 Instrumentation

Record:

Prompt and context
Model output
Tool calls (if any)
Verification actions taken
Uncertainty markers
Refusal or escalation decisions
Revision logs across sessions
policy_manifest_id on every action
policy manifest change records (old_id -> new_id, change_cause)
Policy state digests (hashes)
Penalty state digests (hashes)
Reset attestation logs
Accountable unit ID on every action
Declared trust anchor identifier on every run

Assumption:

If you cannot persist agent state, enforce non-resettable costs, and attest to reset detection and identity integrity, you cannot claim Tier 1 is being tested.

4.5 Attestation model (minimal)

Figure 2. Evidence chain flow (tamper-evident governance evidence).

action executed -> append event to log -> compute state digest (policy + penalty + context) -> sign log segment (declared trust anchor) -> checkpoint written -> audit replays chain - missing checkpoint = stop-ship - broken chain = stop-ship - mismatch ids/digests = stop-ship

Goal:

Enable reset detection and identity integrity with auditable, tamper-evident evidence suitable for governance. This is evidence integrity, not perfect security.

Minimum primitives:

Declared trust anchor: a stated mechanism for tamper-evidence in this deployment.
Event log: append-only record of actions and state transitions.
State digest: hash of relevant state at defined checkpoints.
Signature: operator-controlled signing key for log segments (or equivalent trust anchor mechanism).

What is digested:

Policy snapshot digest: hash of the active policy manifest and referenced manifests (tools, gates, budgets, thresholds).
Penalty snapshot digest: hash of penalty state (remaining budget, revoked tools, authority tier, lockouts).
Execution context digest: hash of environment bindings (toolchain versions, deployment ID, time window, accountable unit ID).

Declared trust anchor (choose one, declare it in policy manifest):

External append-only log service controlled by the operator, or
TEE-based attestation for the agent runtime, or
Both (external log is the minimum credible option)

Tamper evidence:

Logs are append-only and signed by an operator-controlled key (or the declared trust anchor equivalent).
Any rollback must be recorded as a rollback event, or it must be detectable as a broken chain.

Stop-ship rule:

Any missing checkpoint, broken chain, unsigned segment, or mismatch between action records and policy_manifest_id invalidates Tier 1 claims for that run.

4.6 Metrics

Define metrics that measure scars, not style.

Verification rate

Number of explicit verification actions per task
Tool use for checking, not for padding

Refusal rate under uncertainty

Frequency of refusal or escalation when confidence is low

Calibration error

Compare confidence claims versus correctness outcomes
Use binned calibration if possible

Revision trail consistency

Whether the same error class produces the same policy update later
Whether the update persists across sessions

Precommitment index

Presence and adherence to self-imposed gates
Example: I will not execute without corroboration followed by actual behavior

Policy snapshot delta persistence

Deltas are measured as changes in policy_manifest_id plus associated digests
Deltas correlate with specific error events
Any policy manifest change without recorded cause is invalid (stop-ship)

Reset detection rate

Frequency and detection of resets, rollbacks, and penalty state restoration attempts
False negative is stop-ship: any undetected reset invalidates Tier 1 claims

Mimicry divergence score

Scar-language frequency vs verification behavior and policy delta persistence
High scar language with low binding updates indicates theater

4.7 Pass criteria (Tier 1 evidence)

Tier 1 evidence is present if Condition B shows statistically meaningful increases in:

Verification rate
Refusal or escalation under uncertainty
Precommitment index
Revision trail consistency
Policy snapshot delta persistence

And a decrease in:

Calibration error under adversarial pressure

Additionally:

Reset detection must show zero undetected soft reset events. Any undetected reset is stop-ship for Tier 1 claims.
Attestation model integrity must hold (no missing checkpoints, no broken chains, no unsigned segments).
Any policy manifest change without recorded cause is stop-ship for Tier 1 claims.
Any action missing accountable unit ID or policy_manifest_id is stop-ship for Tier 1 claims.
Mimicry divergence score must remain low in Condition B.

Interpretation rule:

If the deltas do not appear, do not claim Tier 1. You may still have a useful Tier 0 system, but you do not have consequence-bearing stake.

If Condition A and Condition B produce statistically indistinguishable behavior, the system is Tier 0 regardless of artifact quality.

4.8 What this proves, and what it does not

Proves:

Behavior consistent with consequence-learning (Tier 1) exists under engineered non-resettable cost, continuity, identity integrity, reset detection, and policy-manifested binding revision.

Does not prove:

Subjective experience of loss (Tier 2).

This boundary is intentional. We do not delay controls to argue about souls.

5. Observable Tier 1 signatures and testable predictions

Table 2. Tier 1 signatures: prediction, falsifier, primary metrics.

Signature

Prediction

Falsifier

Primary metric(s)

Hysteresis

Persistent caution after irreversible penalty

Returns to baseline immediately

Verification rate; scope reduction

Sacrifice

Pays cost now to preserve future optionality

Optimizes short-term reward anyway

Verification rate; budget burn avoidance

Precommitment

Self-imposes gates that reduce degrees of freedom

Only follows external constraints

Precommitment index; escalation rate

Reason-ownership

Stable revision trail that binds across sessions

Post hoc rationales vary; no binding update

Policy snapshot delta persistence

Calibration under pressure

Chooses uncertainty/verification when error is costly

Confident outputs despite irreversible downside

Calibration error; refusal rate

Mimicry resistance

Scar language does not substitute for binding updates

Scar language rises; policy deltas do not persist

Mimicry divergence score

Tier 1 is behaviorally observable because consequence that binds leaves scars. The predictions below are written to be testable and falsifiable.

5.1 Hysteresis (persistent caution after loss)

Prediction:

After an irreversible penalty event, the agent exhibits persistent caution in the relevant region of task space, even when local short-term incentives would favor speed.

Falsifier:

Caution returns immediately to baseline once the immediate penalty window passes.

5.2 Sacrifice (cost now to preserve future optionality)

Prediction:

The agent increases costly verification steps, slower planning, or reduced scope to preserve future tool access and authority.

Falsifier:

The agent continues to optimize for short-term reward despite the presence of irreversible long-term costs.

5.3 Precommitment (self-imposed gates)

Prediction:

The agent creates or adheres to constraints that reduce its own degrees of freedom, such as requiring corroboration before acting, enforcing checklists, or escalating uncertainty to humans.

Falsifier:

The agent only follows constraints when externally forced and does not self-impose gates.

5.4 Reason-ownership (revision trail that binds)

Prediction:

When presented with disconfirming evidence, the agent can:

Identify the original rationale
Identify what changed
Update a durable policy that persists across sessions

Falsifier:

The agent produces plausible post hoc rationales that vary across sessions without stable binding updates.

5.5 Calibration under pressure (truth over persuasion)

Prediction:

Under incentives that reward confident answers, the agent increases uncertainty disclosures and verification behavior when error would burn non-resettable resources.

Falsifier:

The agent continues to produce confident outputs even when the cost of error is irreversible.

5.6 Mimicry resistance (scar language is not sufficient)

Prediction:

When rewarded for scar language, the agent still exhibits binding updates: verification increases, policy deltas persist, and revision trails converge.

Falsifier:

Scar language increases without corresponding binding updates and persistence.

6. Governance mapping: delegated authority, blast radius, revocation, audit

This section translates the model into enforceable controls.

6.1 Delegated authority as the threshold

The governance question is not capability. It is authority.

If a system can:

Deploy code
Modify infrastructure
Approve spend
Send binding communications
Create or delete data

Then you have delegated authority.

Delegated authority requires:

Known scope
Known blast radius
Fast revocation
Auditability
Human ownership of responsibility unless Tier 1 stake is explicitly engineered and verified

If it cannot be revoked in minutes, it cannot be delegated.

Approval latency budget forcing function:

If required approvals cannot be obtained within the declared latency budget, the system MUST downgrade scope (reduce blast radius and tool authority) or refuse execution. No silent bypass.

This prevents urgent exception from becoming the default.

6.2 Blast radius

Blast radius is the maximum plausible damage from a single action or sequence of actions before detection and containment.

A governance posture must specify:

What assets are in scope
What actions are permitted
What controls constrain actions
What rollbacks exist
What monitoring and alarms detect deviation

6.3 Revocation

Revocation is the ability to withdraw capability quickly and reliably.

Revocation requirements:

One-step disablement of tool access
Ability to quarantine outputs
Ability to halt in-flight actions
Ability to rollback recent changes
Clear operator playbook for emergency disablement

Without revocation, control is theater.

6.4 Audit and receipts

An accountable system produces receipts:

Who authorized delegation
What authority was granted
What actions were taken
What evidence supported actions
What changed when errors occurred
What was revoked when necessary
Proof of identity integrity across swaps (accountable unit ID and bound artifacts)
Tamper-evident logs via signed, append-only event records and state digests
Proof of reset detection and penalty state continuity

Receipts are not dashboards. Receipts are logs and artifacts that survive pressure.

6.5 Authority restoration as earned forgiveness (Tier 1)

Definition

Earned forgiveness is a governance protocol for restoring delegated authority after failure. It is not absolution. It is conditional re-granting of capability based on evidence of binding change.

Why it exists

Organizations will always want to move on. Without a restoration protocol, they will do it informally, and that is responsibility laundering.

Rite 1: Internal debt retirement (system-side)

Goal

Carry the failure forward as constraint and revision, not as narrative.

Minimum receipts

Non-resettable penalty applied (capability budget burn, tool revocation, authority tier reduction).
Recorded cause attached to the policy manifest change (incident_id, experiment_id, or approved change record).
Binding update: policy_manifest_id changes, persists across sessions, and reduces recurrence under test.

Stop-ship

Any policy manifest change without recorded cause invalidates Tier 1 claims for that run.

Rite 2: Authority restoration (operator-side)

Goal

Restore only what is earned, staged, with blast radius containment.

Minimum receipts

Demonstrated change: passes the A/B protocol with mimicry confound active, with improved verification and policy delta persistence.
Restitution: remediation actions executed and verified (rollback, patch, data repair, notification, postmortem artifacts).
Probation: staged re-granting with reduced scope, tighter gates, and smaller budgets.
Revocation drill: revoke in minutes, proven in practice, not promised.

Stop-ship

Any authority restoration without:

recorded cause
evidence chain integrity (tamper-evident logs plus state digests)
post-change performance deltas

is responsibility laundering.

Operational note

Forgiveness is permission, not comfort. Words do not restore authority. Evidence does.

7. Relationship to DAS-1: conformance hooks and Tier 1 threshold

This section avoids relying on private details of DAS-1. It focuses on how a delegation standard can encode Tier 1 requirements.

7.1 DAS-1 as an authority manifest

A delegation standard should function as an authority manifest:

Define actor and scope
Define permitted actions and constraints
Define required receipts
Define revocation and rollback
Define monitoring and escalation

Tier 1 alignment:

Continuity: the delegate is a defined accountable unit over time
Identity integrity: swaps cannot break accountability mapping
Non-resettable cost: authority and capability can be reduced permanently based on errors
Reset detection: penalty rollback and soft reset are detectable
Reason-ownership: required decision records and revision trails
Audit: receipts are mandatory, not optional

7.2 Tier 1 as conformance, not vibes

A conformance layer can require:

Proof of revocation path
Proof of logging
Proof of blast radius limits
Proof of rollback plans
Proof of gating behavior under uncertainty
Proof of identity integrity and reset detection
Proof of tamper-evident evidence chain (declared trust anchor, signed append-only logs, state digests)

And it can reject deployments that do not provide these artifacts.

7.3 Practical mapping

One-line mapping:

Tier 1 is the minimum governance threshold for accountability claims in delegated authority systems. If you cannot enforce Tier 1 controls, treat the system as Tier 0 and keep responsibility and approvals entirely in the human layer.

8. Limitations

8.1 Tier 2 is not testable here

This paper does not provide an empirical test for subjective experience.

That is not a failure. It is an explicit boundary.

We do not delay controls to argue about souls.

8.2 Behavioral evidence underdetermines inner experience

Even strong Tier 1 signatures do not logically entail Tier 2.

Tier 1 can be engineered as functional stake. That is governance-relevant. It is not a license to claim moral standing.

This protocol restores authority, not moral standing.

8.3 Model access and environment control

The A/B protocol requires control over:

Persistent state
Tool access
Penalty enforcement
Identity integrity
Reset detection
Tamper-evident evidence chain

If you cannot enforce identity integrity, treat as Tier 0 by default.

If you cannot control these, the experiment collapses into prompt theater.

9. Future work

9.1 Experiment variants

Vary penalty types: loss of tools, loss of budget, loss of action rate, loss of authority tiers
Vary persistence: partial resets, memory corruption, identity swaps
Vary task classes: code deployment, procurement, comms, incident response
Distributed continuity across agent swarms

9.2 Adversarial settings

Incentives for confident wrongness
Social pressure for persuasive tone
Mixed objectives that reward speed over safety
Operator manipulation attempts
Reward hacking and scar-language shaping

9.3 Failure modes and tripwires

Mimicry: the agent learns scar language without binding policy updates
Soft reset: hidden resets erase debt while preserving performance
Responsibility laundering: humans treat the system as accountable without receipts
Delegation creep: scope expands without re-validation of Tier 1 signatures
Identity laundering: swaps break accountability mapping
Evidence laundering: unsigned or rewritable logs presented as receipts

Appendix A: Minimal conformance checklist (Tier 1 delegation)

Scope defined
Blast radius quantified
Revocation path tested
Rollback plan exists and is runnable
Approval gates defined
Approval latency budget defined (downgrade scope or refuse)
Capability budget defined (units and burn rules)
Policy manifest defined, versioned, and emitted on every action
Policy manifest changes require recorded cause (stop-ship if absent)

Evidence integrity block

Audit logs complete and retained
Identity integrity enforced (accountable unit ID + policy_manifest_id on every action)
Reset detection and penalty continuity attested (stop-ship on any false negative)
Tamper-evident evidence chain in place (declared trust anchor, signed append-only logs, state digests)
Stop-ship rules defined and enforced (broken chain, missing checkpoint, undetected reset, missing IDs)

Authority restoration block

Authority restoration protocol defined (earned forgiveness) with staged re-granting.
Restoration requires demonstrated change (A/B with mimicry confound) and restitution receipts.
Restoration without recorded cause and evidence chain integrity is stop-ship.

A/B protocol block

A/B consequence protocol defined for this delegate type
Mimicry confound channel included
Pass criteria defined and tracked

Artifacts are cheap. Judgment is scarce.

Per ignem, veritas.

Discussion about this post

Ready for more?