May 04, 2026 |

Why Legal AI Needs a New Standard: Inside 成人VR视频 CoCoBench

By Tyler Alexander, Director of CoCounsel AI Reliability, 成人VR视频

A lawyer听submits听a filing supported by a citation that听doesn鈥檛听exist. The system produced a polished answer.听It just wasn鈥檛 grounded in reality.

This is the gap facing legal AI today. Not whether systems can generate听 sophisticated听answers, but whether those answers are听actually good听enough for real legal work.

In practice, there is a consistent and measurable gap between how systems听perform on听traditional benchmarks and how they听perform on听real legal work.

Most evaluations still rely on benchmarks that were never designed for how legal work actually happens.听Bar exam questions, clause extraction, single听turn prompts.听These tests evaluate discrete听components of the work.听But they听fail to听capture听how a system performs across the听iterative spectrum of听tasks听that make up real legal work.

As a result, systems are often听optimized听to perform well on benchmarks that do not reflect how legal work is听actually done.

And critically, they fail in ways those benchmarks are not designed to catch, and as agentic systems proliferate, those听small errors听cascade into听more frequent and even harder to听identify听failures.

Starting with the work

When we set out to build the next generation of听CoCounsel听Legal, we听didn鈥檛听start with models or features.听We started with听the work听itself: what does legal work actually look like in practice?

鈥淭his听isn鈥檛听build听first, ask later.听It鈥檚听ask听first, build second,鈥 our teams听often听reiterate.

CoCounsel听Legal has been in听the听market since August, already supporting legal professionals in research, drafting, and review. But as we looked ahead to the next generation, now in beta, a clear shift听emerged. The focus is moving from point-in-time听assistance听to systems capable of handling听longer unaided task horizons and听more听end-to-end听workflows.听That shift required us to rethink not only how we build听CoCounsel, but how we evaluate it.

From听Single听tasks to听Work, Completed

Through research with hundreds of legal professionals and over 100 Practical Law attorney editors, a consistent pattern听emerged. The challenge was not any single task,听being听too difficult. It was the number of steps听required听and the effort听of keeping听them coherent.

Legal work听doesn鈥檛听happen in isolated prompts. It moves across research, drafting, review, and revision. Context builds,听decisions听compound, and small errors early can affect everything that follows. That is not what traditional benchmarks are designed to measure.

A different kind of system

The next generation of听CoCounsel听Legal reflects that shift. A single instruction can now trigger a complete workflow.

Ask it to draft a motion to dismiss. It plans the work, reviews the relevant documents, conducts legal research, pulls secondary sources, and produces a draft grounded in authority,听validating听citations听for its conclusions throughout the work and听returning a final output听grounded in those facts.

That鈥檚听not a task.听It鈥檚听a complete workflow.听And听it’s听exactly where traditional benchmarks break down.

And it raises a different question. How do you听comprehensively听evaluate something like that?

叠耻颈濒诲颈苍驳听颁辞颁辞叠别苍肠丑

We needed a way to measure performance at the level of real legal work.听That鈥檚听why we built听CoCoBench,听a framework designed to evaluate AI systems at the level of real legal work, and one we are now making more visible externally.

CoCoBench听measures whether an AI system can complete real legal tasks to a听fiduciary-grade听standard. It is built around hundreds of attorney-authored benchmark tasks, with a fixed core dataset used to track performance over time. More than 100 legal subject matter experts have contributed听to the legal dataset, alongside research and engineering teams at 成人VR视频 Labs听who developed the evaluation听infrastructure,听representing over听15,000听hours听of practitioner听and engineering听work.

Each test reflects听real practice: a query听written听the way a practitioner would听ask it,听supporting materials drawn from representative contracts, pleadings, or correspondence, and听a gold-standard response drafted and reviewed by attorneys.听This approach is grounded in what we internally refer to as ideal-response evaluation, defining what correct, complete legal work actually looks like and measuring system output against that standard.

The goal is not to measure whether a system can听produce a response.听It is to measure whether听that response听(and听it鈥檚听sequence of work to听reach that response)听constitutes complete,听accurate听legal work.

Evaluating how the work gets done

Legal workflows are multi-step, which means evaluation cannot stop at the final output.A system can produce a coherent answer听even听while relying on flawed reasoning听-traditional benchmarks often听fail to听detect听this as a failure mode.

In agentic systems, an error in one step carries forward. A result may appear coherent while being built on听an听error听upstream.听CoCoBench听addresses this by evaluating the final deliverable alongside the citation record the system produced along the way.听Specifically, what it cited, where it sourced it, and whether the source actually supports the claim.听

These evaluations span core categories of legal work, including research, drafting, review, and multi-step reasoning across workflows.

A higher standard

Every output is evaluated against what a practicing attorney would consider acceptable. That includes correct听application of the law, completeness of analysis,听accurate听use of听sources, and work product听that听meets听fiduciary-grade听standards and is usable in practice.

No capability is considered ready until it听demonstrates听improvement against that standard.听Progress is measured through real-world performance, evaluated by the attorneys best positioned to judge it.

What听we鈥檙e听seeing so far

In practice, we are seeing a consistent gap between how systems听perform on听traditional benchmarks and how they听perform on听real legal tasks.听Systems听optimized听for general-purpose benchmarks often struggle when evaluated against real workflows, revealing gaps in completeness, source fidelity, and multi-step reasoning that are not visible in听standard听benchmark听results.

When evaluation shifts from task-level performance to the workflow level, the bar changes.听What counts as good changes and which systems actually meet that bar changes as听well.

More detailed findings will be shared as听CoCoBench听continues to evolve. The direction is clear. Evaluating AI at the task level changes not only how performance is measured, but what needs to be built.

In the next post in this series,听we鈥檒l听share what happens when you apply this standard in practice, and how different approaches to legal AI perform when evaluated against real legal work.听

Building what听comes next

The next generation of听CoCounsel听Legal, currently in beta, is being built on this foundation. The focus is not on isolated capabilities. It is helping attorneys complete their work reliably,听efficiently, and to a听fiduciary-grade听standard.

As AI systems take on more of that work, how they are evaluated becomes as important as what they can do, because without the right standard, progress can be overstated.

Because in legal work,听almost right听is not good enough.

Share