Vendor Test Pack

Free Resource

Catch model drift before your clients do.

Anthropic just confirmed six weeks of silent Claude Cowork degradation — caught by customers, not by internal monitoring. Three operational tests every CAS firm needs to know when the AI underneath your workflows quietly breaks.

An afternoon to set up
Five minutes a week to run
Three prompts + worked example

The Problem

If your AI vendor's model degraded on Tuesday, would your firm know by Friday?

Most CAS practices can't answer that. Most haven't asked. Anthropic published a post-mortem this week confirming three separate degradations of Claude Cowork between March 4 and April 16 — six weeks of silent drift while users complained and internal monitoring stayed quiet. AMD's AI director called the product "dumber, lazier" in public before the vendor confirmed.

Every major AI vendor will have a week like this. Models degrade silently. Plans get rewritten. Features ship the same day quality breaks. Your firm has built workflows, deliverables, and client expectations on a category of tool that changes underneath you — and right now, you have no instrument that would tell you when it does.

You need three.

What's Inside

Three prompts, one template, one worked example. All plain text.

Paste directly into Claude, ChatGPT, Gemini, Copilot, or any frontier LLM. No tooling change required.

Prompt 1

The rubric generator

Paste your workflow in, get a workflow-specific scoring rubric out. Specific dimensions, a scoring guide, and a target threshold tuned to that workflow. Run it once per workflow; save the rubric.

Prompt 2

The cross-LLM scorer

Score new outputs against the rubric — using a different AI model than the one that produced them. Self-checking is self-defeating; the cross-LLM step is what makes the score trustworthy.

Template

Version-trail line

One line in your workpaper, full accountability. AI assistance: Cowork on Opus 4.7, prompt v3, run April 25, 2026. Examples and field notes for where to put it.

Worked Example

Monthly close commentary for a SaaS client

A complete rubric, a sample scored output, two drift scenarios, and the version-trail entry — all applied to one repeatable CAS workflow. Adapt the dimensions to your firm.

The Three Tests

An afternoon to set up. Five minutes a week to run.

1. Drift detection

Pick one workflow. Have AI write a scoring rubric for what good output looks like. Once a week, run a stable input through the workflow and have a different LLM score the new output. When the score moves from a steady 8 to a steady 6, something changed in the model before the vendor announced it.

2. Portability evaluation

The rubric you built for the drift test does double duty. New models will land monthly now. Build the rubric once and you can evaluate any new model the day it ships. That's the difference between vendor optionality as a capability and vendor optionality as a wish.

3. Version trail

One line in the workpaper or workflow log: AI assistance: Cowork on Opus 4.7, prompt v3, run April 25, 2026. When the model changes — and it will every six weeks now — your sign-off doesn't. The version trail is the difference between defensible AI use and "we used AI somewhere."

Download the Vendor Test Pack (PDF)

Stay Current

Get the test pack — plus weekly AI briefings for CAS.

Subscribe to The AI Accountant newsletter and get the Vendor Test Pack delivered to your inbox, along with weekly analysis of the AI developments that matter for your practice.

Your Move

Don't wait for the next bad week to find out.

The firms that build this in April spend the next year refining it. The firms that don't will spend the next year hoping the vendor doesn't have another bad week. Pick one workflow, build one rubric, run one weekly score — start the trail this week.