Most engineering teams are still living inside a workflow designed around a constraint that no longer exists. Branch, pile up commits, open a big PR, wait for a human to context-switch into your world, merge three days later. We've all pretended the friction was somehow virtuous — like suffering through the review queue was evidence of rigor.
It isn't. It's evidence of a bottleneck.
Now imagine every push gets reviewed, fixed, and merged in minutes for pennies. The bottleneck doesn't get faster. It disappears entirely. And once it's gone, you have to ask a question that makes a lot of people uncomfortable: what was the pull request actually for?
The Inversion Nobody Wants to Talk About
Big PRs were never good. They were economical. You batched commits because getting a human's attention was expensive. You amortized that cost by cramming as much work as possible into one review. It was a coping mechanism, not a best practice.
In an agentic model, review costs pennies and takes minutes. Batching is pointless. Worse than pointless — it actively degrades the thing you just made cheap.
| Dimension | Traditional (Human Review) | Agentic (Agent Loop) |
|---|---|---|
| PR size pressure | Large — amortize reviewer time | Small — minimize blast radius |
| Review latency | Hours to days | Minutes |
| Context loss | High — reviewer needs ramp-up | Low — agent reads full diff fresh |
| Batch commits | Amortizes human attention | Defeats the point |
| Rollback granularity | Coarse — revert the whole feature | Surgical — revert one decision |
Every dimension flips. That's not an incremental improvement. That's a different discipline.
Why Small PRs Win — And It's Not Even Close
Four things happen when you stop batching and start shipping small, continuous PRs. They all compound.
1. Agent accuracy falls off a cliff with diff size. The best independent benchmark available — SWE-PRBench, 350 PRs across 65 repos, peer-reviewed, March 2026 — found that no AI model detects more than 31% of issues a human reviewer would catch. The mean across eight frontier models is 26%. More damaging: all eight models degrade monotonically when more context is added before the diff. Attention dilution is real and measurable.Source: SWE-PRBench, arXiv March 2026.
Telemetry data pending: Round count vs. diff_lines distribution — first-pass APPROVE rate and mean rounds by diff size bucket. Data from telemetry_report.py after 30+ post-redesign PRs.
2. Signal-to-noise collapses. Fast. Review notes on a 3-file PR are sharp and actionable. On a 15-file PR, those same notes multiply into a wall of undifferentiated observations. The agent didn't get dumber — you gave it too much surface area and the feedback became noise.
3. Rollbacks become what they were always supposed to be. When every PR is one logical concern, git bisect actually works the way the documentation always promised. You revert a single decision with full confidence.
4. The economics work backwards from what you'd expect. Ten small PRs cost pocket change and give you ten independent checkpoints. One large PR costs the same pocket change and gives you a single pass over tangled logic.
Telemetry data pending: Cost distribution — median/mean/p90 by outcome (APPROVE / REQUEST_CHANGES / NEEDS_HUMAN). Source: total_cost_usd from quartet-lifecycle comments. Cache efficiency if hit rate >20%.
The Quartet — Why Four Voices, Not One
No single model is good enough. At 26–31% issue detection, the single-model approach is obviously insufficient. The question is how to compose multiple models so they don't just repeat each other's mistakes.
The critical design choice: voices 2 and 3 review the diff independently, in parallel. Neither sees the other's findings. Then a reconciliation step merges both analyses — confirming where they agree, surfacing what only one caught, dismissing what neither can defend.
Telemetry data pending: Independent confirmation table — both_rate per round, copilot_only rate (% of PRs where Copilot caught something Sonnet missed), agreement_rate. Minimum 30 PRs with copilot_ran=1.
The System Reviewed Itself — And Found Real Bugs
The pipeline has reviewed its own workflow code and found real bugs in its own review logic. Not synthetic test cases. Not planted defects. Actual bugs found by the system in the code that runs the system. The recursive validation point: the fixer can't sneak a bad fix past the reviewer because they're different invocations of the same pipeline.
PR citations pending: 3–4 specific production PRs. Each needs: PR number, finding severity (CRITICAL/MAJOR), one-sentence description of what was caught, outcome. Source: existing PR comments on rubyvrooom/dayz_pve.
The Counterpoint That Actually Matters
Each diff is clean. Each review is tight. Locally, everything is correct. But architectural decisions span dozens of PRs, and a system can pass every local check while drifting into global incoherence — the wrong abstraction replicated cleanly across thirty merges, each one individually perfect, the whole thing collectively a mess.
The architect's job doesn't get easier. It gets significantly harder. You need new artifacts — living architecture docs, constraint files the agents consume before every run, explicit rules about what patterns are allowed to propagate.
The practical answer: check the context into the repo. Architecture decisions, constraint files, review prompts, memory of past feedback — all versioned alongside the code. A CLAUDE.md at the root defines the rules. A .context/ directory carries plans, memories, and accumulated judgment. The institutional knowledge isn't in anyone's head. It's in the repo, where git blame works on it.
So Which Rituals Are Theater?
Theater
The big PR
It was never a quality practice. It was cost amortization dressed up as discipline. When review costs pennies, the big PR is pure waste — worse signal, worse rollback, worse accuracy at every stage.
Theater
The review queue
Waiting days for a human to context-switch into your diff wasn't rigor. It was a scheduling problem we mythologized into a quality gate. The gate still exists — it just takes minutes now and never forgets to check.
Theater
The commit-message debate
Squash merge on a single-concern PR renders commit history moot. The PR description and the review trail are the record. The intermediate commits were always scratch paper.
Theater
Manual merge rituals
"LGTM" comments, approval checkboxes ticked by a human who skimmed the diff — these were social signals, not engineering controls. A structured verdict from an agent that read every line is a stronger gate than a thumbs-up from someone who read the title.
Not theater
Architectural review
No agent catches global drift. The human's job shifts from reading diffs to maintaining coherence — and that job gets harder, not easier, because the volume of locally-correct merges goes up.
Not theater
Defining what correct looks like
Issue descriptions, constraint files, architecture decision records, the .context/ directory. The developer's craft doesn't disappear. It moves upstream — from writing the code to specifying the world the code must satisfy.
The rituals that survive are the ones that were never about the bottleneck. They were about judgment. Everything else was theater — and now we can stop pretending otherwise.
Quartet — push code, walk away, come back to a merged PR or a clear escalation. quartet.tools