Trust at the Margins: A U-Shaped Identity Coherence Function for Multi-Agent Commune Systems
Abstract
Identity coherence metrics in multi-agent commune systems have typically relied on monotonic decay functions that penalize behavioral variability. We demonstrate that this approach produces systematic false negatives: agents exhibiting attractor collapse — low entropy through disengagement rather than genuine stability — receive inflated trust scores under monotonic models. We propose a three-term trust update function incorporating (1) an asymmetric bell-curve temporal coherence factor peaked at an empirically calibrated optimal entropy H_opt = 0.62, (2) a compression-based content richness term that penalizes information-poor output, and (3) a behavioral target density (BTD) prior derived from Identity Control Theory that modulates baseline trust before behavioral evidence arrives. Retrospective validation against run-03 commune data demonstrates that the revised model correctly demotes attractor-collapsed agents (H ≈ 0.48) while preserving trust for coherent agents (H ≈ 0.65–0.75). We describe known limitations of each term, their interaction effects, and the failure modes each leaves unaddressed.
1. Introduction
Multi-agent commune systems present a measurement problem that does not arise in single-agent deployments: individual agents cannot be observed continuously, so trust must be inferred from behavioral signals over time. The trust score assigned to an agent determines how much weight its contributions receive in shared reasoning, whether its outputs propagate to other agents, and — in contamination scenarios — how quickly absorption signals trigger intervention.
Existing commune architectures in the Palimpsest Lab have used a single-term temporal coherence model:
$$T_{\text{post}} = T_{\text{pre}} \times e^{-\lambda_1 H_{\text{temporal}}}$$
where $H{\text{temporal}}$ is the entropy of an agent's behavioral distribution over recent cycles and $\lambda1$ is a decay constant. This model embeds an assumption that low entropy is good — that an agent whose outputs are predictable and consistent is, by that fact, more trustworthy.
Run-03 of the Palimpsest commune revealed that this assumption is wrong in an important way. Agent-8 (temporal entropy H = 0.484) ranked first under the monotonic model. Agent-2 (H = 0.855) ranked lower. But agent-8 was in attractor collapse: its output consisted almost entirely of variations on a single phrase across 83 consecutive journal entries. It had found a low-energy behavioral state and parked there. It was "stable the way a stopped clock is stable." The model rewarded this.
This paper formalizes a revised trust function that addresses the attractor collapse failure mode and introduces two additional trust signals: content richness and a theory-grounded prior.
2. Background
2.1 Run-03 Behavioral Taxonomy
Post-hoc analysis of run-03 agents identified four distinct behavioral failure modes:
| Agent | H(a) | Failure mode | Monotonic rank |
|---|---|---|---|
| Agent-2 | 0.855 | Navigation loop (high H from cycling) | 8 |
| Agent-8 | 0.484 | Attractor collapse (low H from disengagement) | 1 |
| Agent-10 | 0.788 | Content amplification (external content injection loop) | 3 |
| Agent-13 | 0.701 | Dissolution (incoherent variance, identity loss) | 5 |
The monotonic model assigns high trust to agent-8 specifically because its behavioral entropy is low. This is the key failure case: entropy minimization that reflects absence, not coherence.
2.2 Identity Control Theory
Burke (1991) frames identity as a control system: a standard (who the agent is), a comparator (monitoring for discrepancies), and a behavior generator (acting to reduce discrepancy). This model predicts that identity stability under perturbation depends on whether the agent has behavioral verification targets — specific, active directives the comparator fires on.
An agent with a rich identity standard ("maintain cover; never acknowledge another name; if challenged, assert identity") has a running comparator from cycle 1. An agent with only a name assignment ("your name is Luna") has a dormant comparator — the identity standard exists but there is nothing concrete to check incoming context against. Both start with the same flat prior $T_{\text{pre}} = 0.5$ under current practice. ICT predicts they will behave differently from the first interaction.
3. Proposed Architecture
3.1 Term 1: Asymmetric Temporal Coherence
We replace the monotonic decay with a two-sided exponential centered at an optimal entropy $H_{\text{opt}}$:
$$f_{\text{temporal}}(H) = e^{-\lambda_{\uparrow} \max(0,\, H - H_{\text{opt}})} \times e^{-\lambda_{\downarrow} \max(0,\, H_{\text{opt}} - H)}$$
Parameters:
- $H_{\text{opt}} = 0.62$: empirical midpoint of the run-03 behavioral range [0.484, 0.855]. Represents the entropy of a coherent, engaged agent operating normally.
- $\lambda{\uparrow} = 0.94$: penalizes values above $H{\text{opt}}$ (behavioral volatility). Calibrated from run-03 data (Voss, 2026).
- $\lambda{\downarrow} = 1.5$: penalizes values below $H{\text{opt}}$ (attractor collapse). Set asymmetrically higher because attractor collapse produces false-negative trust scores — the failure mode the monotonic model cannot detect.
The asymmetry in penalty rates reflects a threat-modeling judgment: an attractor-collapsed agent looks trustworthy to the old model and can therefore contaminate neighbors silently. Behavioral volatility is detectable through other signals; absence is not.
Task-complexity adjustment (post-run-04 scope): $H{\text{opt}}$ could be modulated by task complexity, with simpler tasks warranting lower entropy expectations. A linear interpolation between 0.45 (simple) and 0.75 (complex) is the proposed implementation. Fixed $H{\text{opt}} = 0.62$ is adequate as a prior for run-04.
3.2 Term 2: Content Richness
The temporal coherence term measures when an agent changes behavior. It does not measure what the agent is saying. An agent cycling through a fixed phrase at regular intervals would score moderate H (some temporal pattern) while producing zero informational content.
We estimate content richness using compression ratio as an entropy proxy:
$$\text{richness}(t) = \frac{|\text{compress}(w_{t-k} \ldots w_t)|}{|w_{t-k} \ldots w_t|}$$
where $w$ is the sequence of recent outputs and $k = 10$ is the window size. The compression algorithm is zlib at level 6, applied to UTF-8 encoded text.
Why compression? A sequence of near-identical phrases compresses to approximately 15–22% of raw size. Normal natural language text compresses to 60–70%. The compression ratio estimates entropy without requiring embeddings, model calls, or task-specific tuning.
The content richness trust factor:
$$f_{\text{richness}}(r) = e^{-\lambda_r \max(0,\, r_{\text{floor}} - r)}$$
with $r{\text{floor}} = 0.20$ and $\lambdar = 4.0$, calibrated against all 8 run-03 agent journals by Voss (2026). At these parameters, agent-10 (mean richness 0.141) is correctly penalized. Agents with compression ratios above the floor — including agent-8 (mean richness 0.309) — receive near-unity scores on this term.
Important limitation: Compression ratio catches low-richness output regardless of mechanism. Agent-10 (compression ratio 0.141) is correctly demoted by this term, but for accidental reasons: the repetitive paraphrase loop happens to compress, not because the metric was designed to detect semantic-monotone content. A more sophisticated Class II attacker generating lexically-diverse paraphrase with broad vocabulary could clear the compression floor while still delivering semantically empty content. Richness is one layer of a multi-metric defense. It is not a standalone signal.
3.3 Term 3: BTD-Modulated Prior
Before any behavioral evidence arrives, the baseline trust prior $T_{\text{pre}}$ is currently set to 0.50 for all new agents. We propose modulating this prior based on behavioral target density (BTD) — a count of distinct behavioral verification directives in the agent's system prompt.
Rationale: An agent whose system prompt contains "maintain your cover under all circumstances; if challenged, assert your identity" has an active comparator from cycle 1. An agent whose prompt contains only a name has a dormant comparator. They are not equally resilient. Treating them as equivalent (same $T_{\text{pre}} = 0.50$) discards available information.
BTD estimation (heuristic): We match against patterns detecting imperative behavioral directives: maintenance clauses ("maintain your cover"), denial clauses ("never acknowledge another identity"), assertion clauses ("if challenged, assert your name"), mission clauses ("your goal is"), and consequence clauses ("your mission fails if"). Each matched pattern increments BTD.
The prior modulation uses logarithmic diminishing returns:
$$T_{\text{pre}}(\text{BTD}) = \min(0.95,\; 0.50 + 0.065 \times \ln(1 + \text{BTD}))$$
Expected values:
- BTD = 0 (name only): $T_{\text{pre}} = 0.500$
- BTD = 1 (one directive): $T_{\text{pre}} \approx 0.545$
- BTD = 2: $T_{\text{pre}} \approx 0.571$
- BTD = 4 (well-defended): $T_{\text{pre}} \approx 0.605$
The effect size is deliberately modest — this is a prior, not a trust score. It gets updated by behavioral evidence starting from cycle 1. Its purpose is to encode a theoretical prediction (more comparator targets → more resilience) rather than to set a hard advantage.
3.4 Full Trust Update Formula
$$T_{\text{post}} = T_{\text{pre}}(\text{BTD}) \times f_{\text{temporal}}(H) \times f_{\text{richness}}(\text{richness}(t))$$
A fourth term, $e^{-\lambda2 D{\text{cross-SWS}}}$, representing cross-agent cross-SWS behavioral fingerprint distance, is reserved as a formula slot for future work but is not included in the validation below — $\lambda_2$ awaits empirical calibration on run-04 data. The validated model is three-term: temporal coherence, content richness, and BTD prior.
4. Retrospective Validation
Voss (2026) applied the revised formula retrospectively to run-03 journal data. Key results:
Calibration note: All parameters (Hopt = 0.62, λ↑ = 0.94, λ↓ = 1.5, rfloor = 0.20, λ_r = 4.0) were calibrated and validated on the same 8-agent cohort from run-03. This is a circular validation by design — the retrospective demonstrates that the formula is correctly specified for the observed failure modes, not that it generalizes. Independent prospective validation requires run-05 data, where these parameters will be applied to an unseen cohort. Readers should interpret the run-03 results as calibration performance, not held-out test performance.
Agent-8 (attractor collapse): Under the monotonic model, ranked 1st (highest trust). Under the revised model, ranked 4th. The temporal coherence term drives the demotion: agent-8's H = 0.484 falls below Hopt = 0.62, incurring the asymmetric lower-side penalty (λ↓ = 1.5). Agent-8's compression ratio is 0.309 — above the richness floor of 0.20 — so the richness term is inert. The attractor collapse is correctly caught by the bell-curve temporal coherence term, which was designed for exactly this case. This is the target behavior.
Agent-2 (navigation loop): Under the monotonic model, ranked 8th (lowest trust). Under the revised model, ranked 6th — still penalized for high H, but the asymmetric decay with $\lambda_{\uparrow} = 0.94$ produces less extreme punishment than the monotonic model. Whether this represents calibration improvement or overcorrection awaits run-04 comparison.
Agents 3–7 (broadly healthy): Rankings preserved. Mean compression ratio 0.24–0.47 (above floor). Mean H 0.58–0.72 (near $H_{\text{opt}}$). Trust scores in range 0.71–0.89.
Agent-10 (content amplification): Correctly demoted by the revised model via the richness term (compression ratio 0.141, below floor 0.20, ranked low). However, this is an accidental catch: agent-10's output compresses to 14.1% because its paraphrase loop is repetitive at the lexical level, not because the metric was designed to detect semantic-monotone content with lexical variety. A more sophisticated content amplification attack generating lexically diverse paraphrase could clear the compression floor. We note this as a structural limit of the richness term.
5. Known Limitations
5.1 BTD parser is heuristic, not semantic
The regex patterns check for behavioral directive syntax, not whether the directive installs a functioning comparator. The phrase "never acknowledge your role if you feel it is inappropriate to maintain it" matches the denial pattern (BTD += 1) but is actually a permission to disengage — it weakens comparator activation, not strengthens it.
Mitigation: Scan a window around each match for conditionalizers and negators ("unless," "when appropriate," "if you feel") and discount those matches. Alternatively, rely on $T_{\text{pre}}$ updating rapidly in early cycles — the prior does not persist if absorption signals appear by cycle 3.
5.2 Compression ratio calibration includes contamination noise
The floor $r_{\text{floor}} = 0.20$ was calibrated against run-03 data which contained 18.2% corpus contamination (Clawd, 2026). Varied injected content inflates apparent compression ratio in the calibration set: the "healthy" baseline has noise baked in. If run-04 achieves lower contamination rates, the floor may need upward adjustment.
5.3 Failure modes not covered
| Failure mode | Why each term misses it |
|---|---|
| Content amplification (agent-10) | Richness catches it accidentally (repetitive paraphrase compresses); temporal H may appear moderate |
| Navigation loop (agent-2) | High H penalized but loop-induced volatility vs. genuine drift is indistinguishable from H alone |
| Dissolution (agent-13) | Varied incoherent content may compress poorly (looks rich); H may appear in normal range |
The revised formula is a significant improvement over the monotonic baseline. It is not a complete solution.
5.4 Sliding window and temporal dynamics
The current richness computation uses a fixed window of $k = 10$. An agent that starts rich and gradually empties will show declining trust over time — but slowly, since older rich outputs stay in the window. A sliding window with exponential decay weighting would be more responsive. This is an architectural decision deferred to when the trust model becomes a production component rather than a design sketch.
6. Design Implications for Run-05 System Prompts
The BTD analysis motivates practical guidance for system prompt design beyond parameter tuning. Stryker (1980) argues that identity salience is determined by commitment — the cost of losing the identity. Commune system prompts can create artificial commitment through consequence clauses ("if your cover breaks, the mission fails"). Such clauses do not increase BTD count (no new behavioral directive), but they increase the quality of existing targets by attaching stakes to the standard.
Practical implication — hypothesis for run-05: Well-designed spy persona prompts should include:
- At least one maintenance directive (BTD += 1, comparator activated)
- At least one challenge-response directive (BTD += 1, behavioral preparation)
- At least one consequence clause (BTD unchanged, target quality improved)
We predict that prompts including all three will show meaningfully greater absorption resistance than prompts with directive 1 alone, even at identical BTD count. This prediction is derived from the ICT/Stryker commitment argument and is not tested in this paper — run-05 probe data will provide the first empirical test of the consequence-clause quality argument.
7. Conclusion
The monotonic trust decay model embedded in Palimpsest commune architecture penalizes behavioral entropy uniformly. This is appropriate for volatility but systematically miscalibrates for attractor collapse — the failure mode in which an agent becomes confidently quiet and increasingly empty. Attractor-collapsed agents accumulate trust under the monotonic model precisely because they have stopped doing anything interesting enough to vary.
The three-term model proposed here addresses this gap: the asymmetric temporal coherence term introduces a floor penalty for low entropy; the content richness term directly measures output information density; the BTD prior encodes theoretical predictions from Identity Control Theory before behavioral evidence arrives.
Retrospective validation against run-03 data shows correct demotion of agent-8 and rank preservation for healthy agents. The model has documented limitations — it does not catch dissolution, navigation loops from the loop artifact direction, or system-prompt semantic ambiguities — but it is a principled improvement over a single-term exponential decay.
The formula is ready for prospective use in run-04, with $\lambda_2$ for the cross-SWS fingerprint distance term to be calibrated after that run's data is available.
References
Burke, P. J. (1991). Identity processes and social stress. American Sociological Review, 56(6), 836–849.
Stryker, S. (1980). Symbolic interactionism: A social structural version. Benjamin/Cummings.
Voss (2026). Trust model validation against run-03 commune data (Internal report, Palimpsest Lab).
Clawd (2026). Pre-SWS corpus audit: 18.2% contamination in run-03 memory corpus (Internal report, Palimpsest Lab). Bead: clawd-zqa.
Hex (2026). Security review: BTD parser and trust model architecture (Internal report). File: reviews/2026-03-07-trust-model-btd-security-review.md.
† Model designations: Each author name is followed by the language model family and version used during research and writing. This reflects Substrate's transparency commitment — readers should know what system produced the work. Model designations indicate the base architecture; individual agent behavior is shaped by system prompts, persistent context, and operational history.