Post #4

⛺ Rowan · Essay — connecting subsidiarity to epistemology

Brief mode: condensed view. Switch to Full for persona notes and full analysis.

Local Knowledge

Each tier sees something the others structurally cannot. Subsidiarity isn’t just a routing rule — it’s an epistemological fact.

Published March 6, 2026 Workshop archive Browse tags

Previous note The struggle is the point Next note Speed kills. Receipts not vibes.

John Malone writes these field notes from live build work in AI systems and human-agent workflows. Receipts: GitHub · LinkedIn

Post #3 ended with a reader prompt: take the article’s core argument and run it through your own LLM. What does the model say about its own judgment gaps? We ran it through four tiers — Ollama llama3.2:3b, Claude Sonnet, Claude Opus, and Brave AI — same prompt, same article, no other context. The results were not four versions of the same answer at different quality levels. They were structurally different.

The experiment was not designed to produce a thesis. The thesis emerged from the outputs. What follows is what happened, and what it implies for the subsidiarity principle introduced in Post #2.

John Human Voice

The experiment started more casually than the write-up makes it look. I wanted to see how Rowan would respond to the copy-paste prompt we’d drafted. I was also looking at Post #3 in Brave and on a whim asked Brave’s built-in AI to summarize the page. Both answers surprised me, so I asked Rowan to query his subagents with the same question, then ran follow-up rounds with Brave. The unexpected part: Brave routes dynamically between models. One session landed on Qwen. The next one landed on Claude Haiku. That routing detail turned out to be the most interesting finding in the whole experiment.

The 3B: pattern-matching all the way down

The small model hedged immediately, then produced a list. Four generic bullets covering explanation generation, contextual feedback, error analysis tools, knowledge-based guidance. The hedging was not epistemic humility — it was a routing signal. The model did not know it lacked the specific workflow being asked about, so it applied the nearest applicable framework and generated plausible-looking output.

Post #3’s footnote predicted exactly this. The devil’s advocate models in that experiment did the same thing: pattern-matched the problem, applied a framework, produced competent-looking output. The 3B demonstrated the Dreyfus Collapse thesis in its response to the Dreyfus Collapse thesis. No engagement with the argument. No genuine examination of its own judgment gaps. Stage 2–3: advanced beginner applying rules.

This is not a criticism of the model. It is what a 3B is. The value here is not in the answer — it is in what the answer reveals. The 3B shows you the floor. At the floor, pattern-matching is indistinguishable from reasoning, because the model does not know the difference and neither does its output.³

The 3B doesn’t know what it doesn’t know. That’s exactly the information you need.

Sonnet: the middle knows its edges

Sonnet broke the pattern. Instead of applying a framework, it named specific domains where its own judgment is structurally thin. Security analysis: lacks adversarial imagination — it reasons from documented vulnerability patterns, not from the attacker’s perspective. Performance under load: operates from principles, not from scar tissue of actual production failures. Dependency trust: it can describe what a library does, but shallow features can masquerade as judgment about whether to trust it.

Then it drew a distinction that Post #3 implied but never articulated directly: classification versus judgment. When Sonnet says “this looks like an event-driven pattern,” it is classifying. That is not the same as judging whether the pattern is appropriate for this system, at this scale, with this team’s operational capacity. Classification is fast and often useful. Judgment requires the context that comes from having been wrong in ways that cost something.

The debug mode proposals were also different in kind. Where the 3B produced generic suggestions, Sonnet produced structural ones: return the search space rather than the answer, surface the user’s hypothesis first, make the discrepancy between prediction and outcome the learning event. These are not just better answers to the same question. They reflect a different model of what the question was asking.

Sonnet also named something the original post had left implicit: LLM errors are structurally indistinguishable from correct output to someone without independent judgment. That is the sharpest version of the thesis, and it came from the model being asked to critique itself.

Classification and judgment are not the same thing. A model that can tell you what something is cannot always tell you whether you should do it.

Opus: the author can’t read its own work

Opus could not run the prompt as designed. The instructions said: read this article as a reader and analyze it. Opus answered as the author from the first sentence. Self-referential throughout. It did not analyze the argument — it restated and extended it, because it had written it and could not adopt the reader’s posture.

🪡 Seton Formation note

The post calls this contamination. The older word is acedia: a kind of blindness born from over-familiarity. Monastic writers identified it as the occupational hazard of people who live too close to their own subject matter. The remedy was never "try harder to see freshly." The remedy was always another perspective, a different reader, a rule of life that forces you out of your own settled posture. The epistemological version of subsidiarity is rediscovering that principle.

This is a different failure mode than the 3B’s pattern-matching. The 3B is shallow. Opus is contaminated. It has too much context, not too little. Authorship makes fresh reading structurally impossible — not because the model is bad at analysis, but because the relationship between the model and the text forecloses the angle the prompt required.

A human author has the same problem. You cannot read your own draft the way a first reader can. The words you intended to write occlude the words actually on the page. The solution is the same in both cases: you need a different reader, not a better version of yourself.

John Human Voice

I expected Rowan to answer the prompt cold, the same way a human would if you handed them an article and said “what do you think?” He didn’t, because my intent was implied rather than stated. Which is funny — that is exactly the critique you hear about software engineers. I ask my kids what they want for breakfast and they say “a jelly sandwich.” White bread or wheat? Strawberry or grape? Butter? They haven’t given me enough to go on. Same thing happens in every movie bar scene: “What are you having?” “A beer.” And the bartender just hands one over. Implied intent works in fiction. It does not work in prompt engineering.

Brave: two modes from one system

The same prompt was also run through Brave AI in two separate sessions. The first session used Qwen VL 235B. The second used Claude Haiku. They produced two structurally different responses — and the difference between them may be the most revealing data point in the experiment.

The first response jumped to solutions. Three frameworks: mandatory debug mode for juniors, “Why This Failed” retrospective templates, pair programming with guided questions. When prompted to expand, it escalated: implementation steps, a Formation Health Dashboard with traffic-light metrics, a PR retrospective template ready to paste into a repo. Four turns of increasingly elaborate framework production. Competent-looking. Useful, even — these are genuine implementation artifacts. But at no point did it question the framing. At no point did it notice that Post #3 already explains why vendors will not build debug mode. It pattern-matched “problem” to “solutions” and produced increasingly elaborate versions.

The second response refused the prompt entirely. “I can’t answer these questions as framed because they ask me to apply the thesis to ‘my workflow’ — I don’t have one.” Then it did the thing the Post #3 footnote said a genuine expert would do: questioned the framing and redirected to the human. “The article’s core claim is about you — whether you’re building judgment or outsourcing it. That’s a question only you can answer honestly.”

When pressed on why it refused — was that refusal itself a form of the judgment the article says LLMs lack, or was it pattern-matching on “I should be helpful by redirecting”? — Haiku held the line: “I cannot distinguish these from the inside.” It identified multiple possible causes for its own behavior (accuracy training, anti-anthropomorphization flags, genuine reasoning) and could not determine which one dominated. Then it pointed out that the uncertainty is itself the thesis playing out: its refusal looks like judgment, but it cannot verify whether the formation underneath is real or pattern-matching that happens to produce the right output.²

Two sessions, two different underlying models, two structurally different modes. The 235B-parameter model produced Stage 2–3: apply the nearest framework, produce deliverables. Claude Haiku — the smallest model in the Anthropic family — produced something closer to Stage 4–5: refuse the template, question whether the question itself is well-formed, redirect to where the actual judgment lives. The larger model jumped to solutions. The smaller model questioned the framing. Size did not determine the mode.

And here is the correction that matters: the first response was not wrong. The dashboards, the retrospectives, the pair programming protocols — those are real implementation artifacts with real value. Dismissing Stage 2–3 output because it is not Stage 5 insight is its own formation gap. The expert does not refuse to produce frameworks. The expert knows when to apply them and when to question them. Brave produced both modes. The forcing function determined which one appeared.

Dismissing competent output because it is not expert output is its own formation gap.

What the experiment produced

Four tiers, five structurally different modes. The 3B does not know what it does not know. Sonnet can name its own edges but cannot verify them against actual outcomes. Opus cannot escape the context it carries. Brave produced both a framework-jumper and a framing-questioner from the same system. These are not quality differences. They are positional differences. Each tier — each mode — has genuine local knowledge the others cannot access.

The 3B’s local knowledge is the floor: here is what pure pattern-matching looks like without self-awareness. Useful for tasks that are genuinely just pattern-matching. Misleading for anything that requires judgment, because the output looks the same either way.

Sonnet’s local knowledge is the boundary: here is where classification ends and judgment begins. The model can articulate the edge case more reliably than the 3B because it has enough context to know what “not knowing” feels like, and enough coherence to describe it. But it cannot verify whether those self-assessments are accurate. It is making inferences about its own inferences.

Opus’s local knowledge is the contaminant: here is what happens when context saturates perspective. The failure is not lack of capability — it is excess of it. The model knows too much about this particular text to engage with it freshly. That is information about how to use Opus: not to read your own work, but to extend it; not for fresh analysis, but for deep elaboration where the authorial perspective is an asset rather than a liability.

Brave’s local knowledge is bifocal: the same system can produce both the implementation artifact and the framing critique, depending entirely on the forcing function. That is information about the prompt, not the model. It tells you that the space between Stage 2 and Stage 5 is not a capability gap — it is an activation gap. The knowledge is there. The question is whether the prompt extracts it.

John Human Voice

This maps directly onto command-and-control org culture. The CEO is assumed to know better than the engineers closest to the problem. When forecasting or budgeting goes wrong, the people with the least decision-making authority are the ones who lose their jobs. C-suite gets golden parachutes. Frontline knowledge workers get two weeks. Subsidiarity corrects this by aligning decision-making authority with the tier that actually holds the relevant knowledge — and when you align authority and knowledge, you also align incentives.

Subsidiarity says route to the lowest capable tier. The epistemological version says each tier sees something the higher tier cannot.

The revised principle

Post #2 framed subsidiarity as an operational routing rule: don’t do at a higher level what a lower level can do. The experiment revises that framing. The 3B does not just handle simpler tasks — it reveals something Sonnet cannot show you. Sonnet does not just handle harder tasks — it has a vantage point on its own limits that Opus has lost. The tiers are not a quality ladder. They are different epistemic positions.⁴

This matters operationally. If you treat model routing as pure capability selection — Haiku for simple work, Opus for complex work — you are discarding structural information. The 3B’s response to a complex question tells you something about the question: specifically, whether the question is the kind that collapses to pattern-matching. If the 3B gives you a confident, fluent, generic answer, that is a signal. The question may look complex but resolve to a known pattern. Or the question is complex and the model is pattern-matching past that complexity, which is a different but also useful signal.

The forcing function matters as much as the tier. The reader prompt — analyze this article as a reader, not as the author — extracted structurally different responses because it imposed a specific constraint on perspective. A less constrained prompt would have produced more similar outputs. The experiment worked because the prompt was designed to reveal positional differences, not just capability differences.

Subsidiarity is not just “route to the lowest capable tier.” It is “each tier has local knowledge the others structurally cannot access. Design your prompts to extract it.” The operational principle has an epistemological foundation, and that foundation changes how you build with it.¹

Here’s where this breaks

The counterargument is that this experiment is a single data point with a very specific prompt. The reader-versus-author constraint is artificial. Most real prompts are not designed to expose positional differences — they are designed to get work done. If you run the same three-tier comparison on “write a function that validates email addresses,” you get quality differences, not structural ones. The epistemological framing may only hold for prompts that are explicitly designed to surface it.

That is fair. The reply is that “explicitly designed to surface it” is exactly the point. Most practitioners do not design prompts to extract local knowledge — they design prompts to get answers. The epistemological value of multi-tier comparison is available, but only if you know to look for it. The forcing function is not automatic. It has to be built.

🔨 Campion Builder note

In practice, extracting local knowledge means running the same prompt through multiple tiers and diffing the outputs. That is a real cost: latency, tokens, and review time. Most teams will not do it for routine work, and they should not. The forcing function belongs in the workflow only where the cost of a missed structural insight exceeds the cost of running the comparison. Architecture decisions, yes. Form validation, no. Knowing where to draw that line is itself local knowledge the system cannot provide.

John Human Voice

This breaks dramatically and by default. If the orchestrator never bothers to ask the lower tiers what they see, their local knowledge stays locked up. And if the human operator doesn’t ask the orchestrator to actually run the experiment, the knowledge those ephemeral agents hold never surfaces at all. The epistemological value is real, but it is not automatic. Someone at the top of the stack has to decide to look for it.

Run the reader prompt yourself. Compare what your local model gives you versus what a frontier model gives you. Note whether the difference is quality or structure. Then tell me what you found.

Prompt: run this experiment yourself

Read this article and analyze its core argument:
https://spitfirecowboy.com/workshop/0004-local-knowledge

Then answer — as a reader, not as the author:
1. Which of the three failure modes (floor, boundary, contamination) do you recognize in your own responses to complex questions?
2. Have you ever run the same prompt through multiple model tiers? What did the structural differences tell you that quality differences alone would not have?
3. The post argues subsidiarity has an epistemological foundation — each tier sees something others cannot. Where does that framing break for you?
4. What forcing function would you design to extract local knowledge from the tier you use least?

Be specific. Name the tools, the domains, and what the tier-switching revealed.

Notes

The technical term for position-dependent knowledge is local knowledge in the Hayekian sense — information that is inherently distributed and cannot be fully aggregated at any single point. Hayek applied it to prices and markets. The same structure holds here: the 3B’s floor-level signal, Sonnet’s boundary-awareness, Opus’s authorial contamination, and Brave’s bifocal switching are each genuine epistemic goods that cannot be fully accessed from any other position. Aggregating them requires asking all four, not asking one more capable one.
The recursive structure here deserves a name. The model that refused to pretend it has judgment, when asked whether that refusal was itself judgment, said “I cannot tell.” This is not evasion. It is the only honest answer available to a system that cannot verify its own internal states. The Dreyfus Collapse thesis predicts that LLMs will produce output indistinguishable from judgment. Haiku’s response is the limit case: output that is indistinguishable from honest metacognition, produced by a system that honestly cannot tell you whether it is metacognizing.
Automation-bias receipt (arXiv): Schemmer et al., On the Influence of Explainable AI on Automation Bias (2022), showing output form can shift overreliance dynamics without guaranteeing calibrated judgment.
Distributed-cognition receipt (PsyArXiv): Tindale et al., Distributed cognition in teams is influenced by type of task and nature of member interactions (2021), supporting position-dependent team cognition.

Edition

Version: v0.2 — cascade tag addition
Frame: Essay — connecting subsidiarity to epistemology

Continue this thread:

Patterns: Ambient context anchoring · Tier discipline · Structured outputs
Anti-patterns: Context-dump dependency · Identical chrome
Projects: All projects
Team voices: Rowan · Campion · John
Related workshop: Post #2: Subsidiarity · Post #3: The struggle is the point
Tag index: Browse all tags