From 3 Days to 4 Hours: AI-Powered Contract Review for a Compliance Team
Client
Mid-market financial services firm (anonymised)
Platform
Next.js / Claude API / PostgreSQL
Industry
Financial Services
The Client
A mid-market financial services firm with 200 employees and a compliance team of four. They review approximately 80 vendor contracts per quarter — technology agreements, data processing agreements, sub-processor contracts, and service-level agreements.
Every contract needs to be checked against internal policies, regulatory requirements, and existing commitments. A single missed obligation can trigger audit findings or, in the worst case, regulatory action.
The Problem
The compliance team was drowning. Each contract took an average of three working days to review — reading every clause, cross-referencing against their obligation register, checking for non-standard terms, and flagging anything that needed legal escalation.
The backlog was growing. Business teams were waiting weeks for contract approval. Some were signing contracts before compliance had reviewed them, creating exactly the kind of unmanaged risk the team existed to prevent.
They'd tried two off-the-shelf contract review tools. Both failed for the same reason: they were trained on generic contract language and didn't understand the firm's specific regulatory obligations or internal policies. The tools flagged too many false positives and missed the domain-specific clauses that actually mattered.
What We Built
Phase 1: Domain Extraction (2 weeks)
Before writing any code, we sat with the compliance team for five days. Not to interview them — to watch them work.
We documented their actual review process: which clauses they checked first, what made them pause, how they decided whether a term was acceptable or needed escalation. We captured 23 distinct decision patterns that existed in the team's heads but had never been written down.
This became the evaluation foundation. Not generic contract analysis — the specific patterns this team uses to assess risk in their regulatory context.
Phase 2: Evaluation Dataset (1 week)
We took 40 previously reviewed contracts — contracts where the team had already documented their findings — and turned them into a structured evaluation dataset. Each contract was paired with:
- The obligations the team identified
- The clauses flagged for review
- The escalation decisions made
- The final risk rating
This dataset was reviewed and validated by the senior compliance officer. It became the ground truth for measuring whether the AI system was actually working — not "working" in the demo sense, but working in the "would the compliance team trust this output" sense.
Phase 3: System Build (3 weeks)
The system processes contracts through three stages:
Clause extraction. The contract is parsed into individual clauses with structural context — which section they're in, what they reference, how they relate to definitions elsewhere in the document. This matters because the same language means different things depending on where it appears.
Obligation mapping. Each clause is compared against the firm's obligation register — a structured database of 340 regulatory and policy requirements. The AI doesn't just keyword-match; it evaluates semantic similarity and flags potential obligations that human reviewers might cross-reference.
Risk assessment. Flagged clauses are scored against the team's decision patterns. Non-standard indemnification language, unusual liability caps, missing data processing provisions — these are the patterns the compliance team taught us during Phase 1. The system surfaces them with explanations of why they're flagged and what the team's typical response has been.
The output is a structured report, not a chatbot. The compliance officer gets a document that looks like their existing review format — findings, risk ratings, recommended actions, and the specific contract language being referenced.
Phase 4: Calibration (2 weeks)
We ran the system against the full evaluation dataset and compared outputs against the team's historical reviews. First pass accuracy was 78% — not good enough. The system was catching the obvious issues but missing nuanced patterns around compound obligations and cross-referenced definitions.
Three iterations of context refinement — adjusting what the model sees and how the obligation register is structured — brought accuracy to 94%. More importantly, it caught 12% more obligations than the manual process on the same historical contracts. These were real obligations that the team's manual review had missed, confirmed by the senior officer.
The Result
The compliance team now uses the AI system as the first pass on every contract. A review that took three days takes four hours — mostly spent validating the AI's findings and handling the edge cases it escalates.
The backlog cleared in six weeks. Business teams get contract approvals within two days instead of two weeks. The compliance team redirected their time from document reading to policy development and proactive risk assessment.
Key metrics after 6 months:
| Metric | Before | After |
|---|---|---|
| Average review time | 3 days | 4 hours |
| Quarterly backlog | 15–20 contracts | 0 |
| Obligations caught per contract | 8.2 avg | 9.1 avg |
| False positive rate | N/A | 6% |
| Contracts needing legal escalation | Identified at review end | Flagged within 1 hour |
The system runs on their infrastructure. The compliance team owns the evaluation dataset and updates it as regulations change. There's no vendor dependency — when regulations shift, they update the obligation register and re-evaluate, same as they always did. The AI just does the reading.
What Made This Work
Two things made the difference between this project and the off-the-shelf tools that failed:
Domain experts drove the evaluation. The compliance team defined what "correct" means. Not engineers, not the AI vendor's idea of contract analysis. The people who do this work every day decided what the system should catch, what it should ignore, and how to measure whether it was right.
The system augments, it doesn't replace. The compliance officer still reviews every contract. The AI handles the reading; the human handles the judgement. That separation is why the team trusts it — they're not being replaced, they're being given better tools.
Services used in this engagement
Need similar results?
Every engagement starts with understanding your problem. We'll tell you honestly whether we're the right fit.