AI strategy for small accounting firms: build the loop

AI strategy for small accounting firms got harder to fudge this week. On Monday I flagged the OpenAI/Thrive tax-agent story in the weekly roundup as the strongest standalone story of the week and promised I’d come back Friday with the firm-side strategy. This is that piece.

On May 27, OpenAI and Thrive Holdings announced a tax-prep agent built for the Crete Professionals Alliance — a network of more than 30 CPA firms Thrive rolled up. One senior accountant in the pilot went from 180 hours of tax prep last year to 15 this year. She used the freed time to call every client and walk them through their return, and to take on new clients. That’s the headline number, and it’s the one your owner is going to ask you about.

Before you sketch the firm-side version, the context. The agent hitting 97% draft accuracy didn’t come from a ChatGPT login. For six months, OpenAI’s forward-deployed engineers and researchers worked alongside Thrive Holdings’ engineers inside Crete to build the agent on Codex, with practitioner corrections feeding back into targeted evaluations and code modifications written by the model itself. That’s not a service OpenAI sells — it’s what OpenAI does for portfolio companies it has equity in, like Thrive Holdings. The senior accountant didn’t replicate any of that — she used something resourced at Tomoro scale.

You cannot replicate the agent at any price, so stop trying

Three pieces of what Thrive built sit outside any 10-person firm’s reach, and pretending otherwise is the fastest way to waste a quarter. The 7,000-return training corpus assembled across 30 firms is the first — your book doesn’t have that volume in that vertical. The six months of embedded OpenAI engineers and researchers writing your evaluation suites and code changes is the second — the OpenAI Deployment Company model, same one running at Tomoro, deployed where OpenAI has equity, not where it has customers. The Codex-written self-improving update mechanism is the third — your team isn’t writing production code modifications to a custom agent stack.

If your AI plan is “the cheap version of what Thrive built,” it’s going to miss. Replicate the loop, not the agent.

The loop is the lesson, and it sits one layer down the stack

The mechanism Crete’s flywheel runs on is structural, not technological. Corrections become captured production traces, traces become evaluations, evaluations become an updated agent, and the cycle repeats. At Thrive’s scale, the evaluations and code modifications get written by Codex against a custom agent stack. At your scale, the same loop runs one layer down — corrections become captured entries in a structured log, the log becomes encoded firm methodology in Claude skills, prompts, and SOPs, and the encoded methodology runs against commodity AI that gets meaningfully better every month on the work you care about.

This is the encoding gap operating at firm scale. Same loop, different layer of the stack, dramatically smaller resource requirement. Discipline, not capital — and that’s a fight your firm can actually pick.

The Champion is the firm’s forward-deployed engineer

OpenAI didn’t ship a product and walk away. They put engineers in those buildings for six months because someone has to operate the loop — corrections don’t capture themselves, methodology doesn’t encode itself, and skills don’t sharpen without a human pushing them.

The firm-scale version of that role isn’t a vendor relationship or an outsourced contractor. The Champion is the firm’s forward-deployed engineer — same work Tomoro’s engineers do for Fidelity International and Virgin Atlantic, against the firm’s own stack instead of a client’s. That’s you.

PCAOB and PwC have already modeled the move — the regulator recruiting technologists for its Inspections Modernization Council, the Big Four firm embedding 30,000 Claude-certified professionals across delivery. The smallest version is one Champion inside your 10-person firm. Same shape, smaller scale, same job.

Narrow your vertical or the loop produces nothing

Crete’s 7,000 returns weren’t 7,000 different things — 1040s and 1041s, repeated. Volume times similarity equals encodable signal. Without the narrowing, you get thin, noisy data on every workflow and meaningful improvement on none.

The firm-side discipline is to pick one workflow — one client industry, one return type, one monthly close pattern — and go deep before opening a second loop. The instinct to encode the whole firm is the wrong instinct.

Monday morning, week 1

Open a shared sheet — five columns: workflow, AI draft, human correction, why AI was wrong, what to encode. Run your chosen workflow through Claude this week and log every correction. By week four, recurring patterns surface — three or four errors pointing at one or two missing pieces of firm methodology. Those become your first Claude skills. That’s the loop running.

One reinforcing habit from Wednesday’s piece on prompting drops straight into this column structure: ask AI for the spec before the deliverable. The corrections you catch at the spec stage are the cleanest encoding signal you’ll log all month — they show you where firm methodology was supposed to constrain the work before generation, not after it. And the QC Starter Kit is the matching tool one layer up: where this kit captures the corrections that feed methodology, the QC Starter Kit is what your Champion uses to catch AI errors before they ship. Layer 4 and Layer 6 of the same operating system, running off the same correction log.

A quarter of this gets you something Thrive doesn’t have

A documented correction log, two or three Claude skills against your chosen vertical, measurable improvement smaller than 86% but real and yours, and a Champion defending the role with evidence rather than enthusiasm. Not a 97% tax agent — something better positioned. A firm running its own encoding loop builds a moat that thickens every month, owned by it. A firm waiting for the vendor-built agent generates the training data the vendor will eventually sell back to it. The difference in 18 months isn’t a benchmark point on a model release page — it’s whether the work your team does still belongs to your firm.

The Encoding Loop Starter Kit walks the first 90 days — vertical narrowing, the correction-capture template, the Champion role spec, and the first-loop roadmap. Download it and start the loop this quarter.