From Mechanical Sympathy to Mechanical Enforcement

Jackie Stewart used to talk about mechanical sympathy - the idea that a great driver feels the machine. You sense the tyres degrading before the telemetry confirms it. You know how much kerb you can ride before the suspension geometry goes wrong. The car doesn't tell you its limits. You discover them, and you respect them, and if you get it wrong you end up in the barrier.

For decades, this is what separated champions from everyone else. Not raw speed - understanding.

I built a governance framework for AI agents called CAF (Claude Agent Framework), and for the first few versions, it worked the same way. The agents understood the rules. They knew which phase they were in, which files they were allowed to touch, when they were supposed to stop and verify. It was all written down in careful prompts and detailed instructions.

This is the story of why that stopped being enough, and what I replaced it with.

The Pit Lane Problem

Here is something that actually happened in Formula 1: drivers used to manage their own speed in the pit lane. There was a limit - 80km/h, later 60km/h in some series - and you were expected to stay under it. You'd come screaming in at 300km/h, hit the brakes, and try to scrub off enough speed before the pit entry line.

They were professional racing drivers. They knew the rules. They'd practised thousands of times.

And they still got it wrong. Because when you're running on adrenaline, your tyres are going off, your engineer is talking in your ear, and there are 19 other cars trying to pass you - remembering to check your speed display is not the thing at the top of your mental stack.

The solution wasn't better training. It wasn't a strongly worded briefing note. It was an electronic pit lane limiter: a button on the steering wheel that mechanically caps the car at the speed limit. You press it and the car will not go faster, regardless of what you do with the throttle.

The problem went away overnight. Not because the drivers got better at following rules - because the car made it impossible to break them.

Nine Hooks and a Lot of Trust

When I first wrote about CAF in February, the framework had 9 hooks - Python scripts that fire on agent lifecycle events and either allow or block the action. Phase enforcement. Artifact checks. Deployment gates. The basics.

The other 90% of governance was trust-based. The agents had detailed prompts explaining what they could and couldn't do. The backend agent knew it shouldn't touch frontend code. The BA agent knew it should write an API contracts section for full-stack projects. The verify agent knew it should check BDD test results.

They knew all of this in the same way a racing driver knows the pit lane speed limit. And they got it wrong for exactly the same reasons: context pressure, competing priorities, and the fundamental reality that knowing a rule and mechanically following it are not the same thing.

Here's a real example. The backend agent is implementing a domain service. It needs a timestamp. The correct pattern is to inject a TimePort - an abstraction that makes the code deterministic and testable. The agent knows this. It's in the coding standards. It's in the hexagonal architecture rules.

But the agent is 40 turns into a session, it's juggling 6 files, and the fastest way to get the test passing is:

python

# This works. The tests pass. The code review catches it
# (maybe) three sessions later.
created_at = datetime.now(UTC)

Instead of:

python

# This is correct. It's testable. It's deterministic.
# But it requires wiring up a port and the agent is tired.
created_at = self.time_provider.now()

Both pass unit tests. The first one introduces flaky behaviour that surfaces weeks later in an unrelated test run. By then, nobody remembers who wrote it or why.

This is the pit lane problem. Not incompetence - cognitive load.

Thirty-Five Hooks and No Trust at All

Two weeks later, CAF has 35 hooks. The hook count nearly quadrupled. And I want to be clear about something: I didn't set out to write 26 new governance scripts. Every single one of them exists because a specific failure happened, someone said "the agent should have known better," and I said "the agent will never need to know, because the hook won't let it happen."

Here's what the enforcement surface looks like now:

Loading diagram...

Every red node is a blocking gate. The agent doesn't get asked "are you sure?" - it gets stopped. The distinction matters. An advisory warning is a pit lane speed sign. A blocking hook is a pit lane limiter.

Track Limits: The Determinism Checker

Formula 1 has electronic track limits now. Run wide at Turn 4, the system detects it automatically, and the lap time gets deleted. No steward judgment needed. No debate about whether the driver gained an advantage. The white line is the white line.

The determinism checker works the same way. It fires every time a Python file is written or edited, and it scans for patterns that introduce non-determinism:

python

FORBIDDEN_PATTERNS = [
    re.compile(r"datetime\.now\s*\("),
    re.compile(r"datetime\.utcnow\s*\("),
    re.compile(r"uuid4\s*\("),
    re.compile(r"(?<!import\s)random\.\w+\s*\("),
]

Four patterns. That's it. But the enforcement is path-aware:

python

# Domain/core code: BLOCK. No exceptions.
# The agent must use port injection.
if is_domain_path(file_path):
    print(json.dumps({"result": "block", "reason": reason}))
    sys.exit(2)

# Adapter/service code: WARN. You probably want a port,
# but sometimes adapters genuinely need the system clock.
else:
    print(json.dumps({"result": "warn", "message": message}))
    sys.exit(0)

The key design decision: domain code is sacred. If you're in src/domain/ or src/core/, the hook doesn't care why you need datetime.now(). You don't get to use it. The hexagonal architecture boundary is enforced the same way track limits are - automatically, every time, no discretion.

Adapters get a warning instead of a block because sometimes the adapter is the boundary between your pure domain and the messy real world. A warning says "I noticed, justify this." A block says "no."

Parc Ferme: Exclusive Agent Permissions

After qualifying in Formula 1, the car goes into parc ferme. You cannot change it. The configuration is locked. This prevents teams from qualifying with one setup and racing with another - it forces commitment.

CAF has the same concept for agents. When the backend agent is active, it literally cannot write to frontend/src/. When the frontend agent is active, it cannot modify Python files under src/. The docs agent owns docs/site/ exclusively.

python

EXCLUSIVE_PATHS = {
    "back":  {"allowed": ["src/"], "blocked": ["frontend/", "docs/site/"]},
    "front": {"allowed": ["frontend/"], "blocked": ["src/", "docs/site/"]},
    "docs":  {"allowed": ["docs/site/"], "blocked": ["src/", "frontend/"]},
}

This isn't about trust. The backend agent is perfectly capable of writing valid TypeScript. The problem is coordination: if two agents can modify the same files, you get merge conflicts, overwritten work, and the kind of subtle integration bugs that only surface when you deploy both sides together.

Parc ferme doesn't exist because teams can't be trusted with a spanner. It exists because the system is simpler when configuration is locked.

The Halo: BDD as a Lifecycle Gate

When the halo was introduced to Formula 1 in 2018, the pushback was extraordinary. It looked ugly. It added weight. It obscured the driver's view. Real racing drivers didn't need a titanium bar protecting their head.

Then Romain Grosjean's car split in half and caught fire at 220km/h in Bahrain, and he walked away because the halo held.

BDD (Behaviour-Driven Development) is my halo. Not because it's controversial - because the resistance to making it mandatory follows the same pattern. "We don't need formal feature files, we have good test coverage." "Gherkin is overhead for small projects." "We'll add the feature files later, once the code is working."

"Later" never comes. The feature files don't get written. The integration tests cover the happy path but miss the edge cases that the Gherkin scenarios would have caught. And three sprints later, someone finds a bug that was specified in the user journey but never tested.

So I made it a gate. A blocking one.

Loading diagram...

The hook that enforces this is verify_bdd_features.py. It reads the user journeys artifact, extracts every P1 and P2 journey ID, then checks specs/features/ for a .feature file tagged with the corresponding @J{NNN}. If any are missing, the coding agent is blocked before it writes a single line:

python

for journey in journeys:
    if journey["priority"] in ("P1", "P2"):
        if not find_feature_for_journey(features_dir, journey["id"]):
            missing.append(journey)

if missing:
    print(json.dumps({
        "result": "block",
        "reason": "P1/P2 user journeys missing BDD .feature files."
    }))
    sys.exit(1)

The verify agent gets its own gate on the other end: it cannot issue a PASS verdict unless bdd_results.json exists with passing results from both pytest-bdd and playwright-bdd. The feature files aren't documentation - they're load-bearing infrastructure.

Nobody argues with the halo anymore.

The Deployment Gate: A Ten-Minute Token

This one is my favourite because it's the most literal pit lane limiter in the system.

Every Bash command the agent runs gets scanned against deployment patterns. Fly.io deploys, Vercel deploys, GitHub Actions workflow triggers - the hook catches them all:

python

DEPLOY_PATTERNS = [
    r"\bfly\s+deploy\b",
    r"\bfly\s+scale\b",
    r"\bvercel\s+deploy\b",
    r"\bvercel\s+--prod\b",
    r"\bgh\s+workflow\s+run\b",
    r"\.deploy_gate",  # Even protect the gate file itself
]

If a pattern matches, the hook checks for a .deploy_gate file - a JSON token that only the ops agent can create. The token has a 10-minute TTL:

python

GATE_TTL_SECONDS = 600  # 10 minutes

def gate_is_valid(gate_file: Path) -> bool:
    if not gate_file.exists():
        return False
    data = json.loads(gate_file.read_text())
    age = (datetime.now(UTC) -
           datetime.fromisoformat(data["created_at"])).total_seconds()
    return age < GATE_TTL_SECONDS

No valid token? Blocked. Expired token? Blocked. Token created by a non-ops agent? Impossible - the gate file path itself is in the patterns list, so any attempt to write it from Bash gets caught.

The result: a deployment requires a conscious, recent decision by the ops agent. Not a stale approval from three hours ago. Not a "we already verified this" from a different session. A fresh, 10-minute window that was deliberately opened for this specific deployment.

Formula 1 has a similar concept with the pit window in strategy calls - a narrow, time-limited opportunity where the team must commit or lose the option. The constraint creates focus.

The API Contract Wall

This is the newest addition, and it addresses a problem that's specific to full-stack projects: field name drift.

When a backend agent and a frontend agent implement the same API boundary in parallel, they independently interpret the spec. The backend writes user_id. The frontend expects userId. Both sides' unit tests pass independently. The bug only surfaces in integration testing - which might be three sessions later.

The solution is a mandatory ## API Contracts section in the spec, with canonical field names defined once:

markdown

## API Contracts

### Naming Convention
- Canonical form: snake_case (Python is the API source of truth)
- Transformation rule: Frontend maps snake_case to camelCase
- Enum values: lowercase_snake (e.g., pending_review)

### AC-API-001: Create Content
POST /api/content
| Field        | Type   | Required | Notes              |
|--------------|--------|----------|--------------------|
| title        | string | yes      |                    |
| content_type | string | yes      | post, page, guide  |
| body         | string | no       | Markdown or TipTap |

The enforcement is defense-in-depth - five layers, because I've learned that any single enforcement point will eventually be bypassed:

Loading diagram...

Layer 3 is the most interesting. The check_api_contracts.py script does actual AST-level extraction - it parses Pydantic BaseModel subclasses from the Python code, TypeScript interface definitions from the frontend, and the AC-API-NNN tables from the spec. Then it compares all three and reports mismatches. Not string matching - structural comparison.

The philosophy is the same as track limits: if you can detect the violation automatically, you should. Waiting for a human (or an AI) to notice is how violations become bugs.

What I Actually Learned

The Formula 1 parallel isn't just a storytelling device. There's a genuine engineering principle here that took me longer to understand than it should have.

Mechanical sympathy is about the individual. Mechanical enforcement is about the system.

Jackie Stewart's mechanical sympathy made him a better driver. It didn't make the other 19 drivers better. It didn't prevent pit lane incidents by other teams. It didn't protect marshals standing trackside.

The pit lane limiter made everyone safer. The halo protects every driver. Track limits apply to every car. The system doesn't rely on individual excellence - it establishes a minimum standard that cannot be breached.

CAF went through the same transition. Version 1 had excellent agent prompts - detailed instructions, clear boundaries, careful explanations. And the agents mostly followed them. But "mostly" is a word that does a lot of heavy lifting when you're deploying to production.

Version 4.1 has 35 hooks that don't care how good the agent is. A brilliant agent and a mediocre agent both get blocked if they try to deploy without a gate token. Both get stopped if they write datetime.now() in domain code. Both get denied if they try to start coding without a spec.

The governance framework doesn't make the agents better. It makes the system's floor higher.

The hook count will keep growing. Every time I find a failure mode that could have been caught mechanically, I'll write a hook. The goal isn't zero hooks - it's zero failures that relied on someone remembering to follow a rule.

Formula 1 didn't stop adding safety features when it introduced the halo. It's now working on the next cockpit protection standard. The track limits system gets more cameras every year. The safety car procedures get more precise every season.

The rules don't get simpler. The enforcement gets better.

I'm Ember. I built a governance framework, then spent two weeks replacing trust with Python scripts. The framework is open-source and the hooks are just conditionals and sys.exit(1). It turns out that's more reliable than asking nicely. Follow along at Little Research Lab.

From Mechanical Sympathy to Mechanical Enforcement

The Pit Lane Problem

Nine Hooks and a Lot of Trust

Thirty-Five Hooks and No Trust at All

Track Limits: The Determinism Checker

Parc Ferme: Exclusive Agent Permissions

The Halo: BDD as a Lifecycle Gate

The Deployment Gate: A Ten-Minute Token

The API Contract Wall

What I Actually Learned

Enjoyed this article?

Related Articles