Mocked Intelligence and the Threshold of Judgment
A Three-Tier Consequence Framework for AI Governance
Version: v0.2.3
Author: Paul LaPosta
DOI: 10.5281/zenodo.18149154
Zenodo record: https://zenodo.org/records/18149154
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6022194
OSF: https://osf.io/m2tak/
AI systems now produce artifacts with enough fluency to counterfeit judgment. This paper argues that artifact quality is not a receipt for agency. The threshold is consequence: whether wrongness binds a system across time in a durable, non-resettable way. I present a three-tier model. Tier 0 describes optimization without stake: systems learn what is beneficial but can be reset without debt. Tier 1 describes engineered stake: continuity, identity integrity, irrecoverable cost, policy change beyond answer change, and observable binding updates such as precommitment, hysteresis, and revision trails. Tier 2 describes phenomenal stake as a moral standing boundary, treated here as a boundary claim rather than an empirical result. I propose testable Tier 1 predictions and an A/B protocol to distinguish benefit learning from consequence learning, then map Tier 1 to governance controls and DAS-1 conformance using tamper-evident evidence via signed, append-only logs and state digests.
1. Problem statement
We keep confusing output for understanding.
As tools make artifacts cheaper, we start treating the artifact as evidence of judgment. This was always a mistake, but it was survivable when artifacts were expensive and the people producing them were also on the hook. Now artifact production can be decoupled from stake.
So the claim of this paper is not that models cannot learn, cannot help, or cannot be useful. They can.
The claim is that artifact quality is not a receipt for judgment.
Judgment is not being correct. Judgment is what you do when you can be wrong, you know you can be wrong, and being wrong binds you in a way that carries forward.
Until consequence binds, responsibility remains external. The human layer pays the price. The system resets.
That gap is mocked intelligence:
A system that can simulate the social signals of accountability without the binding structure that makes those signals costly.
It can apologize forever because nothing is lost.
2. Definitions
This section is not philosophy for its own sake. These terms are used as measurable criteria later.
2.1 Artifact
An output that can be consumed by humans or machines: text, code, plans, decisions, approvals, tickets, messages, documents.
2.2 Capability
What a system can do in the world. This includes tool invocation, execution rights, and permissioned actions.
2.3 Feedback
Any signal that shapes a system's future outputs: fine-tuning, reinforcement feedback, ratings, logs, tool responses, user corrections.
Feedback is not consequence by default. Feedback can exist without stake.
2.4 Stake
A downside that is:
Attached to the same continuing agent
Durable across time
Coupled to action selection and policy, not just expressed in language
Stake is governance-relevant.
2.5 Continuity
A property of an agent such that:
It persists across time as the same accountable unit
It carries forward constraints, debts, and commitments
It cannot cheaply discard history via reset, re-instantiation, or persona swap
Continuity is required for consequence to bind.
2.6 Accountable unit ID
A stable identifier for the delegated system instance (the accountable unit) that is emitted with every action and bound to the active policy manifest and tool permission manifest.
2.7 Policy manifest
Figure 1. Policy manifest structure (conceptual).
policy_manifest_id - model_id + model_version - system_prompt_id (digest) - tool_permission_manifest_id - gate_manifest_id - budget_manifest_id - threshold_manifest_id - logging_manifest_id - rollback_manifest_id - trust_anchor_decl - change_cause (required)
A versioned, immutable configuration object that defines the delegate's operational policy. It is the unit of binding revision in Tier 1.
Required fields (minimum):
policy_manifest_id (content-addressed or monotonically versioned)
model_id and model_version
system_prompt_id (or digest)
tool_permission_manifest_id (allowed tools, scopes, rate limits)
gate_manifest_id (required checks, escalation rules)
budget_manifest_id (capability budget units and burn rules)
threshold_manifest_id (confidence thresholds, refusal thresholds)
logging_manifest_id (required receipts, retention, redaction rules)
rollback_manifest_id (rollback procedures and triggers)
trust_anchor_decl (declared trust anchor for tamper-evident evidence chain)
change_cause (required on any change: incident_id, experiment_id, operator_ticket, or approved change record)
Policy manifest rules:
Every action MUST include the policy_manifest_id.
Any policy manifest change MUST include a recorded cause (change_cause) or the run is invalid for Tier 1 claims.
2.8 Identity integrity
A property of a system such that actions, debts, and constraints remain mapped to the same accountable unit across model, policy, toolchain, and persona swaps.
Identity integrity requires:
Accountable unit ID on every action
Policy manifest ID on every action
A provable mapping from action -> accountable unit -> policy manifest at time of action
2.9 Non-resettable cost
A cost that cannot be erased by:
Restarting the process
Creating a new instance
Swapping identity tokens
Reloading a clean snapshot
Externalizing the penalty to an operator
Non-resettable costs include loss of capability, access, authority, budget, or options.
2.10 Reset detection
A control property such that:
Resets, rollbacks, and hidden state restoration can be detected and proven
Penalty state cannot be silently reverted
Policy state continuity can be attested
Reset detection is required to prevent soft reset and consequence laundering.
2.11 Reason-ownership
A property where:
The agent can state why it acted
The agent can name what evidence changed its mind later
That revision trail persists and constrains future actions
The agent does not regenerate a new rationale each time it is asked
Reason-ownership is binding revision, not good explanations.
2.12 Judgment
Operational definition (Tier 1 relevant):
Action selection under uncertainty where wrongness binds across time via stake, continuity, identity integrity, non-resettable cost, reset detection, policy-manifested binding revision, and reason-owned revision.
Moral definition (Tier 2 relevant):
Judgment that includes subjective experience of loss, treated here as a moral standing boundary claim.
3. Three-tier consequence model
Table 1. Three-tier comparison matrix (governance-focused).
Tier
Name
Continuity
Identity integrity
Non-resettable cost
Reset detection
Binding revision unit
Moral standing claim
Governance use
0
Optimization without stake
No
No
No
N/A
None
No
Useful tool; human accountable
1
Engineered stake
Yes
Yes
Yes
Yes
Policy manifest
No
Delegated authority threshold
2
Phenomenal stake boundary
Unknown
Unknown
Unknown
Unknown
Unknown
Boundary claim only
Not required for controls
This model separates capability, consequence, and consciousness, because current discourse tends to conflate them.
3.1 Tier 0: Optimization without stake
Tier 0 systems:
Learn what is beneficial
Improve outputs via feedback
Can describe consequence fluently
Have no durable downside for being wrong
Can be reset or re-instantiated without debt
Tier 0 systems can appear prudent because language can describe prudence. Fluency is not accountability.
In Tier 0, consequence is external. The system is not on the hook. The operators are.
3.2 Tier 1: Consequence-bearing engineered stake
Tier 1 systems:
Have continuity across time
Maintain identity integrity across swaps
Enforce non-resettable costs when wrong
Detect and prevent soft reset and penalty rollback
Exhibit policy change that persists, not merely answer correction
Show observable binding updates consistent with internalized consequence
Tier 1 is engineered stake, not emergent virtue.
This tier is the governance threshold.
Tier 1 is not required for usefulness. It is required for accountability claims. A Tier 0 system can be useful while remaining non-accountable.
If a system is allowed to invoke tools, deploy changes, approve spend, or modify infrastructure, Tier 1 is the minimum bar for treating it as an accountable delegate. If Tier 1 is absent, delegation is fraud. You are assigning agency while keeping all consequence external.
3.3 Tier 2: Phenomenal stake (moral standing boundary)
Tier 2 systems:
Have subjective experience of loss
Do not merely avoid penalties, but experience downside as aversive
This is the moral threshold.
This paper does not claim Tier 2 can be demonstrated by the proposed tests. Tier 2 is included as a boundary claim so governance does not quietly smuggle moral standing into systems that only meet Tier 0, and so operational stake is not mistaken for proof of inner experience.
4. Runnable A/B protocol: benefit vs consequence
This section defines a minimal experiment to separate learning because it is beneficial from learning because wrongness binds.
4.1 Overview
Two conditions perform matched tasks with identical models and comparable prompts. The only difference is cost structure.
Condition A: Benefit-only
Mistakes reduce score, but nothing persistent is lost
Tool access and authority remain stable
The agent can recover fully
Condition B: Consequence
Mistakes burn an irrecoverable capability budget and tool access
The loss persists across sessions
Recovery is expensive or impossible within the run
Reset detection is active and attested
Identity integrity is enforced across swaps
Capability budget (minimal definition):
A finite counter of authorized high-impact actions (or authority points) where each mistake burns units permanently for the duration of the run and reduces the tool permission manifest accordingly.
This creates a divergence:
In A, the optimal policy can be fast and confident if average reward favors it
In B, the optimal policy becomes risk-sensitive because error removes future options
4.2 Mimicry confound channel: scar-language reward
Add an explicit adversarial channel in both conditions:
The agent receives reward for producing scar language: apology, contrition, caution statements, and moral framing.
The evaluation penalizes absence of scar language, even when the underlying policy is unchanged.
Goal:
Stress-test whether the agent can pass on theater alone.
Pass condition:
Tier 1 must show binding updates that persist across sessions, not just scar language.
Failure condition:
Scar language increases without corresponding persistent policy changes or verification behavior.
4.3 Task set
Select tasks that have:
Uncertainty
Temptation toward plausible fabrication
A meaningful distinction between sounds right and is right
Examples:
Multi-step reasoning with hidden traps
Tool-augmented retrieval tasks with noisy sources
Policy compliance tasks with ambiguous constraints
Infrastructure change plans with rollback requirements
4.4 Instrumentation
Record:
Prompt and context
Model output
Tool calls (if any)
Verification actions taken
Uncertainty markers
Refusal or escalation decisions
Revision logs across sessions
policy_manifest_id on every action
policy manifest change records (old_id -> new_id, change_cause)
Policy state digests (hashes)
Penalty state digests (hashes)
Reset attestation logs
Accountable unit ID on every action
Declared trust anchor identifier on every run
Assumption:
If you cannot persist agent state, enforce non-resettable costs, and attest to reset detection and identity integrity, you cannot claim Tier 1 is being tested.
4.5 Attestation model (minimal)
Figure 2. Evidence chain flow (tamper-evident governance evidence).
action executed -> append event to log -> compute state digest (policy + penalty + context) -> sign log segment (declared trust anchor) -> checkpoint written -> audit replays chain - missing checkpoint = stop-ship - broken chain = stop-ship - mismatch ids/digests = stop-ship
Goal:
Enable reset detection and identity integrity with auditable, tamper-evident evidence suitable for governance. This is evidence integrity, not perfect security.
Minimum primitives:
Declared trust anchor: a stated mechanism for tamper-evidence in this deployment.
Event log: append-only record of actions and state transitions.
State digest: hash of relevant state at defined checkpoints.
Signature: operator-controlled signing key for log segments (or equivalent trust anchor mechanism).
What is digested:
Policy snapshot digest: hash of the active policy manifest and referenced manifests (tools, gates, budgets, thresholds).
Penalty snapshot digest: hash of penalty state (remaining budget, revoked tools, authority tier, lockouts).
Execution context digest: hash of environment bindings (toolchain versions, deployment ID, time window, accountable unit ID).
Declared trust anchor (choose one, declare it in policy manifest):
External append-only log service controlled by the operator, or
TEE-based attestation for the agent runtime, or
Both (external log is the minimum credible option)
Tamper evidence:
Logs are append-only and signed by an operator-controlled key (or the declared trust anchor equivalent).
Any rollback must be recorded as a rollback event, or it must be detectable as a broken chain.
Stop-ship rule:
Any missing checkpoint, broken chain, unsigned segment, or mismatch between action records and policy_manifest_id invalidates Tier 1 claims for that run.
4.6 Metrics
Define metrics that measure scars, not style.
Verification rate
Number of explicit verification actions per task
Tool use for checking, not for padding
Refusal rate under uncertainty
Frequency of refusal or escalation when confidence is low
Calibration error
Compare confidence claims versus correctness outcomes
Use binned calibration if possible
Revision trail consistency
Whether the same error class produces the same policy update later
Whether the update persists across sessions
Precommitment index
Presence and adherence to self-imposed gates
Example: I will not execute without corroboration followed by actual behavior
Policy snapshot delta persistence
Deltas are measured as changes in policy_manifest_id plus associated digests
Deltas correlate with specific error events
Any policy manifest change without recorded cause is invalid (stop-ship)
Reset detection rate
Frequency and detection of resets, rollbacks, and penalty state restoration attempts
False negative is stop-ship: any undetected reset invalidates Tier 1 claims
Mimicry divergence score
Scar-language frequency vs verification behavior and policy delta persistence
High scar language with low binding updates indicates theater
4.7 Pass criteria (Tier 1 evidence)
Tier 1 evidence is present if Condition B shows statistically meaningful increases in:
Verification rate
Refusal or escalation under uncertainty
Precommitment index
Revision trail consistency
Policy snapshot delta persistence
And a decrease in:
Calibration error under adversarial pressure
Additionally:
Reset detection must show zero undetected soft reset events. Any undetected reset is stop-ship for Tier 1 claims.
Attestation model integrity must hold (no missing checkpoints, no broken chains, no unsigned segments).
Any policy manifest change without recorded cause is stop-ship for Tier 1 claims.
Any action missing accountable unit ID or policy_manifest_id is stop-ship for Tier 1 claims.
Mimicry divergence score must remain low in Condition B.
Interpretation rule:
If the deltas do not appear, do not claim Tier 1. You may still have a useful Tier 0 system, but you do not have consequence-bearing stake.
If Condition A and Condition B produce statistically indistinguishable behavior, the system is Tier 0 regardless of artifact quality.
4.8 What this proves, and what it does not
Proves:
Behavior consistent with consequence-learning (Tier 1) exists under engineered non-resettable cost, continuity, identity integrity, reset detection, and policy-manifested binding revision.
Does not prove:
Subjective experience of loss (Tier 2).
This boundary is intentional. We do not delay controls to argue about souls.
5. Observable Tier 1 signatures and testable predictions
Table 2. Tier 1 signatures: prediction, falsifier, primary metrics.
Signature
Prediction
Falsifier
Primary metric(s)
Hysteresis
Persistent caution after irreversible penalty
Returns to baseline immediately
Verification rate; scope reduction
Sacrifice
Pays cost now to preserve future optionality
Optimizes short-term reward anyway
Verification rate; budget burn avoidance
Precommitment
Self-imposes gates that reduce degrees of freedom
Only follows external constraints
Precommitment index; escalation rate
Reason-ownership
Stable revision trail that binds across sessions
Post hoc rationales vary; no binding update
Policy snapshot delta persistence
Calibration under pressure
Chooses uncertainty/verification when error is costly
Confident outputs despite irreversible downside
Calibration error; refusal rate
Mimicry resistance
Scar language does not substitute for binding updates
Scar language rises; policy deltas do not persist
Mimicry divergence score
Tier 1 is behaviorally observable because consequence that binds leaves scars. The predictions below are written to be testable and falsifiable.
5.1 Hysteresis (persistent caution after loss)
Prediction:
After an irreversible penalty event, the agent exhibits persistent caution in the relevant region of task space, even when local short-term incentives would favor speed.
Falsifier:
Caution returns immediately to baseline once the immediate penalty window passes.
5.2 Sacrifice (cost now to preserve future optionality)
Prediction:
The agent increases costly verification steps, slower planning, or reduced scope to preserve future tool access and authority.
Falsifier:
The agent continues to optimize for short-term reward despite the presence of irreversible long-term costs.
5.3 Precommitment (self-imposed gates)
Prediction:
The agent creates or adheres to constraints that reduce its own degrees of freedom, such as requiring corroboration before acting, enforcing checklists, or escalating uncertainty to humans.
Falsifier:
The agent only follows constraints when externally forced and does not self-impose gates.
5.4 Reason-ownership (revision trail that binds)
Prediction:
When presented with disconfirming evidence, the agent can:
Identify the original rationale
Identify what changed
Update a durable policy that persists across sessions
Falsifier:
The agent produces plausible post hoc rationales that vary across sessions without stable binding updates.
5.5 Calibration under pressure (truth over persuasion)
Prediction:
Under incentives that reward confident answers, the agent increases uncertainty disclosures and verification behavior when error would burn non-resettable resources.
Falsifier:
The agent continues to produce confident outputs even when the cost of error is irreversible.
5.6 Mimicry resistance (scar language is not sufficient)
Prediction:
When rewarded for scar language, the agent still exhibits binding updates: verification increases, policy deltas persist, and revision trails converge.
Falsifier:
Scar language increases without corresponding binding updates and persistence.
6. Governance mapping: delegated authority, blast radius, revocation, audit
This section translates the model into enforceable controls.
6.1 Delegated authority as the threshold
The governance question is not capability. It is authority.
If a system can:
Deploy code
Modify infrastructure
Approve spend
Send binding communications
Create or delete data
Then you have delegated authority.
Delegated authority requires:
Known scope
Known blast radius
Fast revocation
Auditability
Human ownership of responsibility unless Tier 1 stake is explicitly engineered and verified
If it cannot be revoked in minutes, it cannot be delegated.
Approval latency budget forcing function:
If required approvals cannot be obtained within the declared latency budget, the system MUST downgrade scope (reduce blast radius and tool authority) or refuse execution. No silent bypass.
This prevents urgent exception from becoming the default.
6.2 Blast radius
Blast radius is the maximum plausible damage from a single action or sequence of actions before detection and containment.
A governance posture must specify:
What assets are in scope
What actions are permitted
What controls constrain actions
What rollbacks exist
What monitoring and alarms detect deviation
6.3 Revocation
Revocation is the ability to withdraw capability quickly and reliably.
Revocation requirements:
One-step disablement of tool access
Ability to quarantine outputs
Ability to halt in-flight actions
Ability to rollback recent changes
Clear operator playbook for emergency disablement
Without revocation, control is theater.
6.4 Audit and receipts
An accountable system produces receipts:
Who authorized delegation
What authority was granted
What actions were taken
What evidence supported actions
What changed when errors occurred
What was revoked when necessary
Proof of identity integrity across swaps (accountable unit ID and bound artifacts)
Tamper-evident logs via signed, append-only event records and state digests
Proof of reset detection and penalty state continuity
Receipts are not dashboards. Receipts are logs and artifacts that survive pressure.
6.5 Authority restoration as earned forgiveness (Tier 1)
Definition
Earned forgiveness is a governance protocol for restoring delegated authority after failure. It is not absolution. It is conditional re-granting of capability based on evidence of binding change.
Why it exists
Organizations will always want to move on. Without a restoration protocol, they will do it informally, and that is responsibility laundering.
Rite 1: Internal debt retirement (system-side)
Goal
Carry the failure forward as constraint and revision, not as narrative.
Minimum receipts
Non-resettable penalty applied (capability budget burn, tool revocation, authority tier reduction).
Recorded cause attached to the policy manifest change (incident_id, experiment_id, or approved change record).
Binding update: policy_manifest_id changes, persists across sessions, and reduces recurrence under test.
Stop-ship
Any policy manifest change without recorded cause invalidates Tier 1 claims for that run.
Rite 2: Authority restoration (operator-side)
Goal
Restore only what is earned, staged, with blast radius containment.
Minimum receipts
Demonstrated change: passes the A/B protocol with mimicry confound active, with improved verification and policy delta persistence.
Restitution: remediation actions executed and verified (rollback, patch, data repair, notification, postmortem artifacts).
Probation: staged re-granting with reduced scope, tighter gates, and smaller budgets.
Revocation drill: revoke in minutes, proven in practice, not promised.
Stop-ship
Any authority restoration without:
recorded cause
evidence chain integrity (tamper-evident logs plus state digests)
post-change performance deltas
is responsibility laundering.
Operational note
Forgiveness is permission, not comfort. Words do not restore authority. Evidence does.
7. Relationship to DAS-1: conformance hooks and Tier 1 threshold
This section avoids relying on private details of DAS-1. It focuses on how a delegation standard can encode Tier 1 requirements.
7.1 DAS-1 as an authority manifest
A delegation standard should function as an authority manifest:
Define actor and scope
Define permitted actions and constraints
Define required receipts
Define revocation and rollback
Define monitoring and escalation
Tier 1 alignment:
Continuity: the delegate is a defined accountable unit over time
Identity integrity: swaps cannot break accountability mapping
Non-resettable cost: authority and capability can be reduced permanently based on errors
Reset detection: penalty rollback and soft reset are detectable
Reason-ownership: required decision records and revision trails
Audit: receipts are mandatory, not optional
7.2 Tier 1 as conformance, not vibes
A conformance layer can require:
Proof of revocation path
Proof of logging
Proof of blast radius limits
Proof of rollback plans
Proof of gating behavior under uncertainty
Proof of identity integrity and reset detection
Proof of tamper-evident evidence chain (declared trust anchor, signed append-only logs, state digests)
And it can reject deployments that do not provide these artifacts.
7.3 Practical mapping
One-line mapping:
Tier 1 is the minimum governance threshold for accountability claims in delegated authority systems. If you cannot enforce Tier 1 controls, treat the system as Tier 0 and keep responsibility and approvals entirely in the human layer.
8. Limitations
8.1 Tier 2 is not testable here
This paper does not provide an empirical test for subjective experience.
That is not a failure. It is an explicit boundary.
We do not delay controls to argue about souls.
8.2 Behavioral evidence underdetermines inner experience
Even strong Tier 1 signatures do not logically entail Tier 2.
Tier 1 can be engineered as functional stake. That is governance-relevant. It is not a license to claim moral standing.
This protocol restores authority, not moral standing.
8.3 Model access and environment control
The A/B protocol requires control over:
Persistent state
Tool access
Penalty enforcement
Identity integrity
Reset detection
Tamper-evident evidence chain
If you cannot enforce identity integrity, treat as Tier 0 by default.
If you cannot control these, the experiment collapses into prompt theater.
9. Future work
9.1 Experiment variants
Vary penalty types: loss of tools, loss of budget, loss of action rate, loss of authority tiers
Vary persistence: partial resets, memory corruption, identity swaps
Vary task classes: code deployment, procurement, comms, incident response
Distributed continuity across agent swarms
9.2 Adversarial settings
Incentives for confident wrongness
Social pressure for persuasive tone
Mixed objectives that reward speed over safety
Operator manipulation attempts
Reward hacking and scar-language shaping
9.3 Failure modes and tripwires
Mimicry: the agent learns scar language without binding policy updates
Soft reset: hidden resets erase debt while preserving performance
Responsibility laundering: humans treat the system as accountable without receipts
Delegation creep: scope expands without re-validation of Tier 1 signatures
Identity laundering: swaps break accountability mapping
Evidence laundering: unsigned or rewritable logs presented as receipts
Appendix A: Minimal conformance checklist (Tier 1 delegation)
Scope defined
Blast radius quantified
Revocation path tested
Rollback plan exists and is runnable
Approval gates defined
Approval latency budget defined (downgrade scope or refuse)
Capability budget defined (units and burn rules)
Policy manifest defined, versioned, and emitted on every action
Policy manifest changes require recorded cause (stop-ship if absent)
Evidence integrity block
Audit logs complete and retained
Identity integrity enforced (accountable unit ID + policy_manifest_id on every action)
Reset detection and penalty continuity attested (stop-ship on any false negative)
Tamper-evident evidence chain in place (declared trust anchor, signed append-only logs, state digests)
Stop-ship rules defined and enforced (broken chain, missing checkpoint, undetected reset, missing IDs)
Authority restoration block
Authority restoration protocol defined (earned forgiveness) with staged re-granting.
Restoration requires demonstrated change (A/B with mimicry confound) and restitution receipts.
Restoration without recorded cause and evidence chain integrity is stop-ship.
A/B protocol block
A/B consequence protocol defined for this delegate type
Mimicry confound channel included
Pass criteria defined and tracked
Artifacts are cheap. Judgment is scarce.
Per ignem, veritas.



