Evan Windels /rights
A formula

The formula for getting AI right.

R · I · G · H · T · S

Six factors to consider when you design an AI system that has to do real work. When an output isn't good, think RIGHTS — walk them and find the one that's missing.

USESix factors to design for. When the output isn't good, think RIGHTS.

Use it as a checklist when you are designing an AI system, agent, or product — make sure every one of the six factors is properly in place before you ship. Use it as a diagnostic when an existing system underperforms — walk the letters and find the one that's missing or weak.

There is no rescuing a system whose grounding is wrong by adding more tokens. There is no rescuing a system without a feedback loop by buying a smarter model. Most "AI doesn't work" stories are not stories of bad models — they are stories of one of these factors being absent. Get all six in place from the start, and the same model that produced thin work for the team next door will produce work indistinguishable from a senior professional's.

Q = ( error ID via reward functions · atomic JSON trials )RRefinement loop
· ( grounding differential · fix · test )IIteration loop
· ( detail · relevance · accuracy )GGrounding
· ( human-readable · syncs with human tools )HHuman cooperation
· ( lean output · full AI-sense coverage · intuitive design )TTools
· ( tokens · no usage limits · context mgmt · no time pressure )SSubstrate

Before the breakdown

R and I are both loops. They are not the same loop.

The two are easy to conflate, and conflating them is why most teams build neither well. Read this before you read R and I.

R · between jobs

The system gets better.

R runs after a job. Look at where the agent failed last week, find the pattern, edit the agent files, prompts, skills, and tools so next week's batch goes better. The artefact you shipped is already gone — what you're improving is the machine that produced it.

Without R, every error is one you'll see again next week. The system plateaus at the level of the prompts you wrote on day one.

scope: agent files, prompts, skills, tools, knowledge libraries
I · within a job

The artefact gets better.

I runs during a job. The model produces a draft, a critic compares it to G and reports the gap, the model fixes the gap and re-submits. Repeat until the artefact converges on the spec. The system doesn't change — only the artefact does.

Without I, you ship first drafts. First drafts are rarely the answer; the answer is what comes back after the model has been told why the first draft fell short.

scope: a single deliverable, a single conversation, a single run
R Refinement loop The recursive feedback that tightens the system.

Factor one

You are not finished when the output is right. You are finished when the process that produced it is right.

R is the loop that runs between jobs. The job already shipped — what you're improving is the system that will produce next week's job, and the week after that.

Most teams skip R entirely. They fix bad output by hand, ship it, and move on. The same class of error reappears two days later in another job, and gets hand-fixed again. The system never compounds. The first job and the thousandth job are produced by the same prompt, and each takes the same amount of human babysitting. R is what makes the thousandth job ten times cheaper to ship than the first.

  • 01 Error identification — via reward function, scoring rubric, or grader. You cannot fix a class of errors you have not named, and you cannot prioritise the named errors without a number telling you which class is worst. The first thing I build for any new agent is the grader that scores its output against G — not the agent itself. The grader is the instrument; the agent is the experiment. Build the instrument first.
  • 02 Atomic JSON trials — zoom into the model's behaviour, then fix surgically. JSON is the microscope. Force the model to emit structured output at every step — reasoning chains, tool calls, intermediate findings, grader verdicts — and you can see what it actually did at the atom level, not just what it returned at the surface. The misstep that produced the wrong output is in there; without JSON visibility you are guessing at the prompt with a hammer, and with it you fix surgically — edit the one tool description, the one piece of grounding, the one rubric weight that produced the miss — and re-run. Each fix is a contained trial: one variable changed, the JSON re-read, the score re-measured. Most teams write prompts and read responses. The teams that read the JSON between them ship an order of magnitude faster.
I Iteration loop The recursive feedback that tightens the output.

Factor two

First drafts are rarely the answer. The answer is what comes back after the model has been told why the first draft fell short.

I is the loop that runs inside a single job. The model produces a draft, a critic compares it to G and reports the gap, the model fixes the gap and re-submits. The artefact converges on the spec.

Without I, every output that ships is a first attempt. First attempts from a model are like first attempts from a person — recognisable, sometimes brilliant, often subtly wrong. The team that builds I is the team whose deliverables look the same on the third draft as a senior professional's first.

  • 01 Grounding differential — measure the same characteristics on the output and the benchmark. If G has been done properly, the benchmark data is not just "an example we like" — its characteristics have been measured along named axes. For an architectural drawing: line weights in millimetres, annotation spacing, detail density, which materials correlate with which detail levels. For a cost plan: prelims as a percentage of net, OH&P cascade values, line-item counts per NRM2 section. The differential is what you get when the model measures those same axes on its own output and compares the numbers to the benchmark. "Line weight is 0.5 mm; benchmark range is 0.18–0.35 mm; output is heavier by ~50%." That is a differential the model can act on — not a prose impression. The discipline is in the measurement: turn every quality dimension into a number, on both sides, before the loop runs.
  • 02 Fix — a targeted edit, not a rewrite. Propose specific changes aimed at the specific items in the differential. "Make it better" tells the model nothing and produces a different first draft. "Add the prelims section using the standard NRM2 structure, update the BCIS factor to the South-East value, and add quantity units to lines 14, 22, and 31" tells the model exactly what to do — and lets you measure whether it did.
  • 03 Test — re-measure, then commit or iterate. Run the fix and re-measure the differential. If it closed, commit and ship. If it widened, roll back and try a different fix. Loops without measurement are not loops — they're the model rewriting at random, and random rewrites regress as often as they progress. The measurement is what makes I a loop instead of a churn.
G Grounding The picture of done the system anchors to.

Factor three

The model can only be as good as the picture of done you give it.

G is the spec — the explicit picture of what this deliverable looks like when it is right. Most projects skip this step. They start with a brief and hope the model figures the rest out from training data.

It usually does — fluently, confidently, and subtly wrong. And without G you will never know which part to fix, because you never wrote down what right looks like in the first place. Every other factor is shooting at a target nobody drew.

  • 01 Detail — a schema, not a vibe. "Professional cost plan" is not detail. Detail is: NRM2-structured, sections A through Z, quantities in the unit each rule specifies, prelims at 10–18% by complexity, contingency at 7–12% of net, OH&P cascading from a named-range cell. The model can reproduce a schema. It cannot reproduce a vibe.
  • 02 Relevance — match this task, not an adjacent one. The grounding has to fit the actual deliverable. A great cost plan is not grounding for a programme. A great residential extension drawing is not grounding for a commercial fit-out. The discipline of relevance is admitting that what you have isn't the same as what you need, and going to find — or build — the right exemplar before you start pricing the work.
  • 03 Accuracy — the gold standard has to actually be gold. If the spec has errors, the model trains itself onto those errors and they become the new ceiling. The model cannot be more accurate than the reference it's measured against. Treat G like the cleanest, most-reviewed document in the system — because everything else gets graded against it, and a wrong G silently caps the quality of every output for as long as it goes uncorrected.
H Human cooperation The output has to land in the human's hands, not near them.

Factor four

Correct output in the wrong shape is invisible.

H is the principle that the deliverable has to land — it has to arrive in a form the recipient can use today, in their existing tools, with no translation step between AI output and human action.

An accurate cost plan returned as JSON is a correct cost plan that nobody opens. A perfect risk register returned as a markdown table is a correct register that the PM has to manually retype into Excel. Output that fails H technically works and is functionally useless — and the failure is doubly invisible, because the team that built it sees the AI succeed at the task while the team that's meant to use it sees nothing they can act on.

  • 01 Human-readable — match the shape of the decision. The output is editable, scannable, and immediately comprehensible. No JSON where a table belongs. No prose where a list belongs. No 200-line markdown when the recipient wanted a one-page PDF. The shape of the output should match the shape of the decision it informs — a five-line ranked list when the human is choosing between options; a full document when the human is approving a deliverable.
  • 02 Syncs with human tool stacks — drop into the existing workflow. Lives in the formats the user already lives in: Excel, PDF, Markdown, Jira, Figma, AutoCAD. The AI's deliverable should drop into the user's existing pipeline without anyone having to learn a new tool. If you require the user to install something to consume the output, you have lost — the cost of adoption is now higher than the cost of doing it themselves.
T Tools The AI-native surface the system uses to do work.

Factor five

Models don't fail at thinking. They fail at what you didn't hand them.

T is the AI-native tool surface — the set of capabilities and primitives the model can actually call. A model with a great toolbelt operates like a senior professional. A model without one operates like a smart graduate with a phone and no email account.

This is the factor most teams under-invest in, and it is usually the cheapest to lift with the largest effect. A weaker model with a great toolbelt will out-perform a stronger model with a generic one — every single time.

  • 01 Lean output — every returned byte is a tax. Tools should return exactly what the task needs and not a byte more. Every field a tool returns is a field the model has to read, parse, and reason about — and a token you pay for. A tool that returns the cost of one item should return that cost, not a 600-line JSON with seventeen nested fields the model has to navigate to find the number it asked for. Lean tools make the model faster, cheaper, and more accurate at the same time.
  • 02 Full AI-sense coverage — match the modalities to the task. If the task involves drawings, the model needs vision. If it involves spreadsheets, it needs structured data tools. If it involves audio, it needs to hear. A text-only model on a vision task is doing the work with one sense missing, and you'll wonder why it's slow and wrong. The fix is not a smarter model — it's the missing sense.
  • 03 Composition over derivation — give it pieces, not problems. LLMs are excellent at composition: picking the right piece, snapping it onto the right next piece, sequencing already-solved parts into a working whole. They are mediocre at first-principles reasoning: deriving geometry, writing parsers, solving numerical methods from scratch. The right tool surface gives the model pre-solved primitives and lets it compose. Asking a model to code a wall from scratch — boolean joins, layer offsets, cut patterns, hosting rules — burns enormous tokens on a problem solved tens of millions of times already by humans. Hand it the parametric component library and let it pick CAV-215-FR60 instead. The model that picks the part is fast and accurate; the model that derives the part is slow and approximate. Same model. Different tool surface.
S Substrate The conditions under which the model does the work.

Factor six

A loop that has to finish in thirty seconds will produce thirty-second work.

S is the substrate — the operating conditions the model is given. Tokens, context window, time to converge, no mid-task caps. Substrate is the easiest factor to ignore and the most expensive to underprovision: the same agent on a constrained substrate produces visibly worse work than on an open one, and the team that imposed the constraint usually can't see the connection.

  • 01 Token access — give it room to think. Sufficient compute budget per task. Telling an agent to deliver a tender pack on a 4,000-token budget is the equivalent of telling an architect to design the building in five minutes — the work that comes back is not a verdict on the model, it's a verdict on the budget. The instinct to economise on tokens is almost always paid for in human time fixing thin output.
  • 02 No usage limits — no cap mid-job. A cap that fires mid-task is one of the most expensive failures in agent systems. The agent's accumulated context, partial work, and reasoning chain are all lost; the recovery is rarely automatic; the user gets a half-finished deliverable. Provision generously and never let a cap make the decision for you.
  • 03 Context management — what's in it matters more than how big it is. Enough context window to hold the relevant material — and the discipline to know what relevant means. Stuffing everything in is as bad as starving the model: irrelevant context becomes noise the model has to filter past, and filtering past noise costs the same tokens as reasoning over signal. Treat context like RAM, not like a hoard. Curate what goes in.
  • 04 No time pressure — let the loop converge. The loop runs as long as it needs to. A loop with a hard 30-second cutoff produces 30-second work. The first instinct is always to speed the loop up; the right instinct is usually to slow it down and let it converge. Real work takes time — and the team waiting on the output usually values quality over latency, even if the spec says otherwise.

Using RIGHTS on a system that already exists

Walk the letters. Find the factor that's missing or weak. Fix it.

The build reading is in the six sections above — design every factor in from the start. This section is the diagnostic reading, for when you have inherited a system or shipped one that underperforms. The instinct is to add more prompt text or upgrade the model; both are usually wrong. Almost always one of the six factors is missing or weak, and fixing that one factor will unlock more quality than anything you can do to the other five combined. Match the symptom you are seeing to the factor that is failing.

Factor missingWhat you'll seeWhat it means · what to do
R The same class of error keeps showing up, job after job. The system never compounds. No loop is closing on failures at the system level. Build a grader and run atomic trials against the named errors.
I First drafts ship. Quality is whatever the model produced on the first attempt. The artefact never gets a second look. Add a critic that reports the differential and a loop that fixes it.
G Output is fluent, confident, and subtly wrong. Nobody can articulate exactly what's off. The model has nothing real to anchor on. Write the spec — the schema, the exemplar, the gold standard — before the next iteration.
H Output is correct but adoption is zero. The team keeps doing the work themselves. Wrong shape, wrong format, wrong tool. Re-shape the deliverable into the format the user already opens daily.
T Model burns tokens on basic operations and still gets them wrong. Generic tasks feel hard. Deriving what it should be composing. Build the primitive library and let the model pick parts instead of inventing them.
S Quality collapses on long jobs. Truncation, rate-limit hits, lost context, half-finished deliverables. Wrong substrate. Move to a harness with token headroom, no mid-task caps, and the time to converge.

RIGHTS is the guideline I follow when I build, and the checklist I run when I review.

It is the design checklist I open every time I scope a new agent or tool surface, the diagnostic I walk every time an existing system underperforms, and the conversation I lead with every team I advise. If you want to walk it through your stack with me, write me an email.

Talk about your stack