Sep 26, 2025 |

From Testcase to Trust: Benchmarking CoCounsel with Scorecard

This post was authored by Tyler Alexander, Director of AI Reliability and Heather Nodler, Lead CoCounsel AI Reliability Manager

滨苍迟谤辞诲耻肠迟颈辞苍听听

At 成人VR视频, we are redefining what it means to deliver professional-grade AI for the legal industry. More than 20,000 law firms, corporations, nonprofits, and government agencies worldwide rely on CoCounsel, our GenAI assistant, which transforms how legal professionals work by automating complex document review, contract analysis, drafting, and other time-intensive tasks with unprecedented speed and accuracy. That trust is earned through a comprehensive evaluation methodology that encompasses dataset rotation, automated testing, expert assessment, continuous monitoring, and strategic partnerships. This post focuses on one critical component of our broader testing framework: how our teams combine attorney expertise with large-scale automated testing through Scorecard, a proprietary evaluation platform originally developed by the engineers behind Waymo鈥檚 self-driving car testing infrastructure. While Scorecard represents just one pillar of our multi-layered approach, it exemplifies our commitment to proactive system optimization and continuous improvement.

Testing and Benchmarking with Scorecard听

Our teams of attorney subject matter experts (SMEs), machine learning experts, and engineers rely on a robust array of testing tools and methodologies, including human legal expertise, specialized testing software, expert prompt engineering, and continuous monitoring of test results. Rather than waiting for performance issues to emerge, we proactively identify and address potential challenges through systematic testing and optimization. When issues arise that may affect CoCounsel鈥檚 performance, these teams are equipped to mobilize a collaborative, rapid response effort, locating and remedying performance issues before they affect our customers.听

A key tool is Scorecard, a specialized application that quantitatively evaluates CoCounsel responses against ideal responses created by our attorney SMEs. is the evaluation infrastructure for AI agents in legaltech, fintech, and compliance, and enables us to supplement our manual testing with large-scale, automated testing against our internal benchmarks. Built by the team behind Waymo’s self-driving evaluation infrastructure, Scorecard runs millions of agent simulations to help teams evaluate, optimize, and ship reliable AI agents faster.

Performance issues typically arise from two distinct factors:
(1) the quality of user inputs, such as user prompts or queries, and documents; and,
(2) system limitations.

We address the first factor by providing customers with high-quality training, support, and tools鈥攊ncluding CoCounsel-created prompts, guided expert workflows, and agentic systems. In contrast, addressing the second factor requires recalibrating the system itself.

Each CoCounsel skill is a precisely engineered legal tool, tailored on the backend to perform a specific legal task. Because we calibrate each skill to reliably extract information by leveraging the unique strengths of its underlying AI model, migrating a skill from one model to another often introduces performance issues that require recalibration. Such migrations may occur, for example, when a third party releases a new AI model with enhanced capabilities. To safeguard our customers, we conduct all migration and recalibration work within testing and staging environments before deploying any changes.

Case Study: AI Model Migration of Review Documents Skill听

Large-Scale Testing Using Realistic Scenarios & Manual and Automated Review

Jessica, an attorney SME on the CoCounsel AI Reliability Team鈥攁lso known as the Trust Team鈥攐versees the evaluation of CoCounsel鈥檚 Review Documents skill. In just minutes, the Review Documents skill can closely review and analyze large troves of legal information that would ordinarily require an attorney to spend hours or even days of manual review.听

Jessica proactively monitors the upcoming migration of the Review Documents skill to a new AI model. This migration promises significant improvements in CoCounsel鈥檚 speed and accuracy. Working in a CoCounsel testing environment, Jessica manually reviews and evaluates the skill鈥檚 responses on the new model using a carefully curated 鈥渢estset鈥 of sample 鈥渢estcases鈥 that reflect real-world legal practice scenarios. Jessica checks CoCounsel鈥檚 response to each testcase user query against an ideal or 鈥済old-standard鈥 response that she has personally crafted using knowledge and expertise gained from years of experience as a real-world attorney.听

Because each testset can contain several hundred testcases or more, reviewing each result would ordinarily be prohibitively time-consuming. However, Scorecard enables Jessica to supplement and scale the impact of her manual review by providing an extra layer of automated review.听

Scorecard works by evaluating each response produced by CoCounsel and the AI model against the corresponding ideal response, then assigning the testcase a passing or failing numerical score using several criteria, such as the model鈥檚 ability to recall information, its precision, and its accuracy.听

Reviewing the Scorecard results enables Jessica to compare the full testset鈥檚 scores on both models for the Review Documents skill. This means she can evaluate CoCounsel鈥檚 performance at scale much more efficiently.

Fig 1: Attorney SME manual review workflow.

Fig 2: Scorecard automated review workflow.

Reviewing the Scorecard data, Jessica quickly observes that on the new model, Scorecard consistently assigns failing scores to a specific testcase, assigning it a 1 out of 5 on all metrics. She identifies underperformance in other testcases, too; however, the other testcases still yield higher scores than the problem testcase. Recognizing the stakes are high, Jessica immediately begins troubleshooting the performance issue.

Troubleshooting

Jessica and her team of SMEs begin to troubleshoot by homing in on the problem testcase that Scorecard identified.

The testcase user query asks:

What medications is the patient currently taking? Please be specific with prescription names and dosages.

Analyzing CoCounsel鈥檚 outputs for the testcase, Jessica determines that on the new model, the Review Documents skill is failing to identify all medications for the patient consistently, causing a clear discrepancy with the ideal response. The new model occasionally includes all the relevant medications, but such inconsistent behavior does not meet the required standard.

[Click image to expand] Fig. 3: Scorecard screenshots of the AI model鈥檚 failing answer. As can be seen in the expanded 鈥渕odel response鈥 window above, the model was including medications that were no longer currently active and was failing to identify the only two current, active medications (Aspirin 81MG EC TAB and Aspirin 325MG EC TAB).

By digging deeper and examining the problem testcase response as well as some of the other, underperforming testcase responses, Jessica pinpoints the core issue as being the AI model鈥檚 ability to provide a sufficiently comprehensive level of detail. Since the model sometimes does output a complete response, Jessica observes, as a secondary concern, that the AI model struggles to produce consistent results.

Iterative Resolution & Continuous Improvement

Having identified the core issues, Jessica brings the issue to the CoCounsel engineering team for resolution. She describes the parameters of an ideal response and how the new model鈥檚 response fails to meet target metrics. This gives the engineers concrete goals, which they can use to modify the backend AI prompts. After each prompt change, Jessica evaluates a portion of the test set which is continuously updated, complemented by independent attorney reviews. Jessica and the engineering team continuously execute multiple rounds of prompt changes and use Scorecard to evaluate the results until the issue has completely resolved, and the new model is performing as expected. Scorecard now assigns the problem testcase a 4 out of 5 on all metrics, a good score鈥攊t reflects that the model has produced a valid response that captures all relevant substantive data points contained in the ideal response but may differ in more subtle ways, such as writing style or level of additional detail. Resolving this core issue ensures the secondary issue of inconsistent performance has been resolved as well. Jessica further conducts manual reviews of CoCounsel鈥檚 performance on the problem testcase.

These adjustments have cascading positive effects. When the problem testcase begins passing 99-100% of the time, the other testcases that had experienced the same issues (albeit less frequently) begin passing 100% of the time.

[Click image to expand] Fig 4: Scorecard screenshots of the AI model鈥檚 passing answer. This was achieved after multiple rounds of testing and prompting changes, which confirmed the engineers were able to pinpoint and fix the issue. As shown in the expanded 鈥渕odel response鈥 window above, the issue was ultimately fixed, and the model began answering this testcase correctly (as well as a few other testcases that had been failing, albeit less frequently, due to the same issue).

Once the model consistently returns results that meet TR鈥檚 expectations and are suitable for legal work, Jessica feels secure in the knowledge that the Review Documents skill meets necessary standards and can be released to customers.

Even after the skill is released on the new model, Jessica continues to run various Scorecard tests, multiple times daily to ensure consistency.

Fig 5: Continuous improvement process between attorney SME and engineers.

Observations

CoCounsel鈥檚 proactive and continuous iterative improvement process is painstaking but necessary. The problem testcase identified by Jessica using Scorecard provided a useful benchmark for improvement, because it failed more consistently than other testcases. Using a 鈥渓east common denominator鈥 testcase provided a measuring stick against which we could measure other testcases.

Using Scorecard allowed Jessica to extrapolate improvements from the single problem testcase to all other testcases, dramatically increasing the efficiency and speed with which she could iterate and improve CoCounsel鈥檚 performance across the board.

Conclusion

Innovation in AI is never 鈥渙ne and done.鈥 Models evolve, new risks emerge, and customer needs grow more complex. While this post has focused on Scorecard as one essential component of our testing infrastructure, it represents just one element of our comprehensive evaluation methodology. Our broader approach integrates dataset rotation, automated testing at scale, expert assessment from legal professionals like Jessica, continuous monitoring of live performance, and strategic partnerships with leading AI providers.

This multi-layered framework is what sets CoCounsel鈥檚 approach apart. By combining deep legal expertise with world-class technology infrastructure, we鈥檙e not only raising the standard for AI in professional fields, we鈥檙e defining it. Through proactive system optimization and evaluation approaches, CoCounsel continues to deliver the transformative professional-grade legal AI capabilities that tens of thousands of legal professionals depend on.

—-

About the Authors

Tyler Alexander is the Director of AI Reliability at 成人VR视频, where he leads a team of attorneys to ensure CoCounsel delivers trustworthy, professional-grade performance. He specializes in large-scale testing and benchmarking of AI systems for legal professionals.

Heather Nodler is a Lead CoCounsel AI Reliability Manager at 成人VR视频. With years of experience practicing law, they now apply their expertise to evaluating, calibrating, and continuously improving CoCounsel鈥檚 legal AI skills. Heather works closely with product and engineering teams to ensure that every CoCounsel feature meets the high standards required for real-world legal practice.

Share