Benchmarking Archives - ��VR��Ƶ Institute

��VR��Ƶ Best Practices for Benchmarking AI for Legal Research

jeffrey.mccoy@thomsonreuters.com — Wed, 12 Feb 2025 15:38:20 +0000

At ��VR��Ƶ, we do an enormous amount of AI testing in our efforts to improve our customers’ ability to move through legal work faster and more effectively. We’ve noticed an increase in interest in AI testing generally, and in benchmarking AI applications for legal research specifically. We’ve learned a lot in our thousands of hours of AI testing, as such we offer the following best practices for those interested in considering an updated or differentiated approach when testing or benchmarking AI for legal research.

1. Test for the results you care about most.

This would seem obvious, but we’ve seen a lot of confusion about it, and if we could only make one recommendation, this would be it. It’s foundational for all other recommendations.

If you cared most about determining how long it takes to drive from one place to another, you wouldn’t just measure highway time, you’d measure total door-to-door time. If you cared most about car maintenance costs, you wouldn’t just measure the cost and frequency of brake repairs and maintenance.

With the use of AI for legal research, there are no LLMs nor any LLM-based solutions that offer 100% accuracy. Because of that, all answers generated by large language models or LLM-based solutions, even if they use Retrieval Augmented Generation (RAG), must be independently verified.

Some assume verification is a simple matter of checking the sources cited in an AI answer, but this is incorrect. We’ve seen plenty of examples where an AI-generated answer is wrong, and the cited sources simply corroborate the wrong answer. Verification requires using additional tools (like a citator, statute annotations, etc.) to ensure the answer is correct.

This means every time an AI-generated answer is used for research, there is a three-step process the researcher must engage in: (1) review the answer, (2) review the cited material from the answer, (3) use traditional research tools to make sure the answer and cited material are correct.

When we talk with researchers about research generally and this process specifically, what they care about most is (a) getting to a correct answer or understanding of the relevant law, and (b) the time it takes to get to that correct answer or understanding.

Because of this, the two most important measures are:

Percentage of times using this three-step process the user can get to the right answer, and
Time it takes to complete all three steps

Surprisingly, the percentage of errors in answer in step 1 can have very little impact on the percentage of correct answers by the researcher using all three steps or the time to complete those steps (unless errors are excessive), as long as citations and links to primary law are good and those primary resources are current and easily verified. Focusing on step one is like trying to figure out door-to-door times by measuring highway speeds only. It’s not very useful.

For instance, which of the following systems would you rather use?

System where the initial AI answer is 92% accurate, but verification, on average, takes 18 minutes, and post-verification accuracy is 97%, or
System where the initial AI answer is 89% accurate, but verification, on average, takes 10 minutes, and post-verification accuracy is 99.9%

It’s a clear choice, but there is often a misplaced focus on measurement of the first step in the process to the exclusion of steps two and three. Measure what you care about most.

2. Use realistic, representative questions in your testing.

Presumably you want to evaluate AI for the typical legal research you or your organization does. For instance, if you look at the research your organization does and find the questions are roughly 20% simple questions, 60% medium complexity, and 20% very complex or difficult, and that roughly half are questions about IP law and half are about federal civil procedure, then a benchmark testing 90% simple questions about criminal law would not be very helpful to you.

At ��VR��Ƶ, we model our testing based on the real-world questions we see from our customers every month. For your own testing, focus on the question types that best represent the researchers you’re focused on.

Testing mostly simple questions with clear-cut answers is easiest for testing, but if those types of questions don’t represent what your users do most (it doesn’t well represent most AI usage in Westlaw), then the results are not particularly helpful. Similarly, if you primarily test overly complex, extremely difficult and nuanced questions – or trick questions, those can be useful for testing the limits of a system, but they tend not to be very helpful for most real-world decision making.

3. Test a lot of questions.

In our own testing, we’ve found that testing small sets of questions is rarely representative of actual performance with a larger set. Large language models can generate different responses each time, even with identical inputs. Additionally, if responses are long and complex, graders may disagree, even when judging identical responses. For just a quick general sense of direction, it’s fine to test with a sample of questions as small as 100 or so, but for comparing algorithms/LLMs against each other, we strongly recommend checking the results as you grade and testing until the measure of interest stabilizes. For example, if you are running a comparison between two systems to see which is preferred, you would test until the rate at which one system is preferred over the other stops changing dramatically with each new batch of questions. Another guide to the number of questions you should test is the confidence level and interval you want (see next section).

4. Calculate and report confidence levels and intervals.

Even with a relatively large set of questions, measurements of accuracy are only so precise. When using these measurements to make decisions, it’s important to understand the degree or range of accuracy of the measurement, often referred to as confidence level and confidence interval. You can think of confidence intervals and levels like margin of error in surveys. It lets you know how reliable or repeatable the measurement is expected to be.

For instance, testing AI accuracy based on 200 questions, if you ran the test again with the same questions/answers but different evaluators, or used the same evaluators but with a different 200 random, representative sample of questions, would you expect the exact same result? Typically, you wouldn’t. You’d expect the result to fall within a certain range, so it’s important to report that range along with the results so decision makers understand the differences between algorithms/LLMs that are meaningful and those that are not meaningful. The proper way to report this is with confidence intervals and levels. You can read more about them . Using standard assumptions, when measuring an error rate of 10% from a sample of only 100 questions, you can be about 95% confident that the true error rate is between 4.1% and 15.9%. This is called a 95% confidence level, and the “+/- 5.9%” is the margin of error. If you measure an error rate of 10% from a sample of 500 questions, the 95% confidence interval would be between 7.4% and 12.6%, or 10% +/- 2.6%.

The basic power analysis to estimate a confidence interval assumes a perfect means of detecting the outcome you are trying to measure. If there is some uncertainty in that detection, e.g., if two independent evaluators disagree about the outcome some percentage of the time, then the margin of error increases. A grading process or measurement that’s unreliable ~5% of the time, might increase the margin of error from 5.9% to 7.3%, in our example above with 100 questions. It’s important to note that there are various methods for calculating standard error, and these examples make simplifying assumptions that likely underestimate the confidence intervals observed in practice.

5. Use a combination of automated and manual evaluation efforts.

Having human evaluators pore through lengthy answers to complex questions can be difficult and time-consuming. Ideally, we would just have AI evaluate the accuracy and quality of answers generated by AI. This is sometimes referred to as LLM as judge. But in the same way that AI makes mistakes when generating an answer, it can also make mistakes when evaluating the quality of an answer against a gold-standard answer written by a human. In our experience, modern LLMs are pretty good at evaluating AI-generated answers against gold-standard answers when answers are clear and relatively short. With length and complexity, we’ve found the LLM as judge approach to be very unreliable.

For instance, has shown that LLMs tend to struggle when evaluating responses to complex and challenging questions like those requiring expert knowledge, reasoning, and math.

Since most test sets will contain a sample of simple/easy/clear questions and answers, it makes sense to use AI for automated evaluation of these, then use human evaluators for the rest, at least until AI improves to the point where more can be automated.

6. For human grading, use two separate human evaluators for each answer, and have a third (ideally more experienced) evaluator to resolve conflicts.

For assessments like these, can be a real issue. In our own testing, we’ve found attorneys evaluating AI-generated answers for more complex legal research questions can disagree about the accuracy or quality of answers about 25% of the time, which makes single-grader evaluation unreliable. To improve reliability, we have two evaluators separately grade each answer, and where there are conflicts, we have a third, more experienced evaluator resolves the conflict.

7. When answers are wrong, investigate to see if the gold-standard answer might be wrong.

In the same way people make mistakes in evaluating answers, they can also make mistakes in coming up with the gold-standard answer for testing. In our experience, we’ve found some instances where the AI-generated answer was evaluated as incorrect when compared to the gold-standard answer, but when we dug into it further, it turned out the AI was correct and the person who put together the gold-standard answer was wrong. Sometimes AI makes mistakes and sometimes humans make mistakes – you should check both.

8. If evaluating multiple algorithms/LLMs/solutions, make sure the evaluators are blind to which algorithm/LLM/solution the answer was generated by.

In our evaluations we try to avoid human bias in grading. Sometimes an evaluator has had bad experiences or great experiences with a certain product or LLM in the past, and we don’t want them to bring that bias to the current evaluation, so when evaluating different solutions, we first strip away anything that would identify the source of the solution, so results are not biased by past positive or negative experiences.

9. Grade the value of answers in addition to making a binary determination of whether the answer has an error.

What’s right or wrong in an answer can vary enormously in terms of positive value and negative impact. For instance, consider the following answers:

A. Answer is correct in every way but is short and high level. It just gives a basic description of the legal issue as it relates to the question but doesn’t provide any references to primary or secondary law for verification, nor any nuance regarding exceptions or other considerations.

B. Answer is lengthy and nuanced, addressing multiple aspects of the question and discussing important exceptions that might apply, and it provides references with citations and links for verification, and it’s correct in every way except in one of the citations, the date is incorrect, but that’s easily verified and corrected when clicking the link from the citation.

C. Answer is incorrect in every way and all its linked references point to primary law that simply corroborate the wrong answer.

If the evaluation is simply a binary view of the number of answers that contain an error, then answer A looks good and answers B and C look equally bad. In reality, answer C is far worse and more harmful than answer B, and Answer B is likely much more valuable to the researcher than answer A.

In our evaluations, we’re looking for answer attributes that are helpful to researchers, like depth of the answer and quality of the references, and we don’t just evaluate errors in a binary way. We consider answers that are totally wrong to be far worse than answers with erroneous statements in otherwise correct and helpful answers. Similarly, we consider erroneous statements in answers based on whether they address the core questions or are tangential to it, and whether they’re contradicted in the answer or easily verified with the linked references. We’d like to eradicate all errors, of course, but some are more harmful than others.

10. Look for errors beyond gold-standard answers.

Often LLMs generate answers with information beyond the scope of a gold-standard answer. For instance, the gold-standard answer might say the answer should state that the answer to the question is no, and it should explain that with X, Y, and Z, and it should specifically cite to cases A & B and statute C.

The LLM-generated answer might state the answer is no and explain X, Y, and Z with references to A, B, and C, but it might also add a few statements about exceptions or related issues or an additional case or statute. Sometimes these additional statements are incorrect, even when everything else is correct. So, if an LLM-as-judge or human evaluator only looks at the gold-standard answer to see if the AI-generated answer is correct, that evaluation can miss errors in the additional material. This means evaluators need to do independent research beyond simply looking at the gold-standard answers to determine if an answer has an error.

11. Consider testing reliability.

LLMs often have some randomness built into them. Many have a temperature setting that can be used to minimize or eliminate this, making answers more consistent when asking the same question multiple times.

But some LLMs are better at this than others, and some integrated solutions that use LLMs in conjunction with other techniques, like RAG, don’t set temperature low to allow for more creativity in answers.

For big decisions you might be making, consider testing reliability by running the same question 20 times and seeing if any of the answers are substantially worse than the other answers to the same question.

The above are our and learnings from our extensive expertise with AI, Gen AI and LLMs over the past 30 years. At ��VR��Ƶ we put the customer at the heart of each of these decisions we make and are transparent that at the point of use all our AI generated answers must be checked by a human.

As we work through testing our AI products, our teams do not follow each of these steps for every test we do, sometimes we prioritize speed over accuracy of testing or vice versa, but we ensure we clearly understand the trade-off in prioritizing some of these steps and communicate this with our teams. The bigger and more important the decision we’re trying to make, the more of these steps we follow.

This is a guest post from Mike Dahn, head of Westlaw Product, and Dasha Herrmannova, senior applied scientist, from ��VR��Ƶ.

Raghu Ramanathan: Reflections on Legal Generative AI One Year In

carrie.brooker@thomsonreuters.com — Mon, 30 Dec 2024 09:02:06 +0000

I recently talked with ’ Ben Joyner about generative AI in the legal space, touching on everything from our company’s M&A strategy to how CoCounsel is transitioning to a multi-model product. Talking with Ben about how generative AI has shaped our industry over the past year has me reflecting on my first year with ��VR��Ƶ.��

Raghu Ramanathan, president, Legal Professionals, ��VR��Ƶ.

Continued climb in law firm productivity��

I joined ��VR��Ƶ in February, and a notable way we’ve seen the impact of generative AI solutions is the uptick in lawyer productivity. For the first time in years, Q2 saw a majority of law firms experience productivity growth. By Q3, an astounding 64% of law firms reported productivity growth, building on the gains made in Q2.��

This uptick underscores how technology is key to boosting law firm profitability. Law firms that invest in new technology as well as adopt AI and generative AI solutions to streamline workflows and improve the efficiency and quality of their work are best positioned to improve client satisfaction and drive sustainable productivity growth.��

Build, buy, partner strategy��

I’m pleased with the progress ��VR��Ƶ has made on our vision to provide all the legal professionals we serve with a professional-grade GenAI assistant to augment their work. We’ve committed to investing $100 million annually in AI over the coming years, including investing more than $200 million to incorporate responsible AI into our solutions in the past year alone.��

This year we continued investing in the latest technology through our build, buy, partner program. On the buy side in the legal space, our acquisition of Safe Sign Technologies – a UK legal large language model (LLM) startup – in August is proving a great fit. We’re incorporating Safe Sign’s tech and talent into our industry-leading content and expertise to bring customers even greater quality and performance from our AI solutions. ��

On the build side, we introduced 19 legal generative AI solutions in 2024. Highlights include CoCounsel 2.0, the professional-grade GenAI assistant; Claims Explorer, a generative AI skill available in ; CoCounsel Drafting, an end-to-end drafting solution that streamlines and improves the drafting process for legal professionals within Microsoft Word; and Mischaracterization Identification in Quick Check and AI Jurisdictional Surveys – two generative AI research features that help customers save substantial time and deliver greater confidence that legal research is accurate, thorough, and complete. We also delivered deeper integration of CoCounsel into Westlaw and Practical Law.��

On the partner side, we’re working with Microsoft, OpenAI, Google and others on plugins and integrations to enhance the generative AI-powered capabilities in our solutions. Every aspect of our build, buy, partner strategy is geared toward helping our customers automate their workflows, provide powerful insights to their clients and drive efficiencies.  ��

A maturing market����

2024 saw the implementation of legal generative AI solutions as well as efforts to benchmark these solutions. Our benchmarking support is reflected in our participation in studies including Vals.ai plus two consortium efforts – from Stanford and Litig – exploring how to best evaluate legal AI.��

I believe that benchmarking can improve both the development and the adoption of AI, but it’s just one component in how we consider and understand the benefits AI delivers for our customers. I look forward to our ongoing collaboration with customers and industry partners as we continue working to minimize inaccuracies and increase the usefulness of the research outcomes for generative AI solutions.��

To date, 15% of law firms have adopted and implemented legal-specific generative AI solutions. I anticipate we’ll soon see a wave of fast followers – eager to be perceived as innovative – that will dramatically strengthen generative AI implementation.��

I can’t think of a more exciting time to have joined a business. Where our industry is at now mirrors the early internet era: initial excitement, followed by strategic integration.��

We’re fast approaching a maturing market where legal professionals will not just desire but require AI capabilities for their workflows. We’ll see more implementation of generative AI solutions among legal professionals as they increasingly realize the tangible benefits.  ��

For more on how generative AI is shaping the future of the legal profession, please check out my interview.��

This is a guest post from Raghu Ramanathan, president, Legal Professionals, ��VR��Ƶ.��

Legal AI Benchmarking: CoCounsel

carrie.brooker@thomsonreuters.com — Wed, 23 Oct 2024 14:04:16 +0000

We’re excited to be sharing a detailed look into our testing program for CoCounsel, including specific methodologies for evaluating its skills. We aim not only to showcase the steps we take to ensure CoCounsel’s reliability, but also to contribute to broader benchmarking efforts in the legal AI industry. Though it’s challenging to establish universal benchmarks in such a diverse field, we’re engaging with industry stakeholders to work toward the shared goal of elevating the reliability and transparency of AI tools for all legal professionals.��

Why evaluating legal skills is complicated��

Traditional legal benchmarks usually rely on multiple-choice, true/false, or short-answer formats for easy evaluation. But these methods aren’t enough to assess the complex, open-ended tasks lawyers encounter daily and that large language model (LLM)-powered solutions like CoCounsel are built to perform.��

CoCounsel’s skills produce nuanced outputs that must meet multiple criteria, including factual accuracy, adherence to source documents, and logical consistency. These are difficult outputs to evaluate using true/false tests. On top of that, assessing the “correctness” of legal outputs can be subjective. For instance, some users prefer detailed summaries, others prefer concise ones. Neither is “wrong,” it just comes down to preference, which makes it difficult to consistently automate evaluations.��

To make it even more complicated, each CoCounsel skill often involves multiple components, with the LLM handling only the final stage of answer generation. For example, the Search a Database skill first uses various non-LLM-based search systems to retrieve relevant documents before the LLM synthesizes an answer. If the initial retrieval process is substandard, the LLM’s performance will be compromised. So, our evaluation must consider both LLM-based and non-LLM-based aspects, to make sure our assessment of the whole is accurate.��

How we benchmark��

Our benchmarking process begins long before putting CoCounsel through its paces. Whenever a significant new LLM is released, we test it across a wide suite of public and private legal tests, such as the dataset created by our Stanford collaborators, to assess their aptitude for legal review and analysis. We then integrate the LLMs that perform well in these initial tests with the CoCounsel platform, in a staging environment, to evaluate how they perform under real-world conditions.��

Then we use an automated platform to run a battery of test cases created by our Trust Team (more on this below), to evaluate the output that comes from this experimental integration. If the results are promising, we conduct additional manual reviews using a skilled team of attorneys. When we see an improvement in performance compared to previous benchmarks, then we start talking as a team about how it might improve the CoCounsel experience for our users.��

How we test��

Our Trust Team has been around as long as CoCounsel has.�� This group of experienced attorneys from diverse backgrounds – in-house counsel, large and small law firms, government, public policy – is dedicated to continually rigorously testing CoCounsel performance. ��

We continue to follow a process that’s been integral to all our performance evaluation since CoCounsel’s inception: Our Trust Team creates tests representative of the real work attorneys use CoCounsel for and runs these tests against CoCounsel skills. When creating a test, they first consider what the skill’s for and how it might be used, based on their own insights, customer feedback, and secondary sources. Once the test is created, the attorney tester manually completes the test task, just as a lawyer would, to create an answer key – what we refer to as an “ideal response.” These tests and their corresponding ideal responses then undergo peer review. Being this meticulous is crucial, because the quality of our ideal responses determines the benchmark for a passing score.  ��

Once the ideal response has been created, a member of the Trust Team runs the test, using the applicable CoCounsel skill to complete the task just as a user would. An attorney tester reviews the output, referred to as our “model response.” Then they point-by-point compare CoCounsel’s response to the ideal response, identifying differences and assessing whether these differences deviate from the ideal response in a way or to a degree that would make the skill’s output incomplete, incorrect, or misleading. It is important to note that a test can be failed for many reasons, not just because an answer is outright wrong. For instance, some skills (e.g., Summarize) are more inherently subjective than others and do not lend themselves to a right or wrong answer. Accordingly, even though attorneys’ opinions of the “correct” answer can reasonably vary, we try to maintain the highest grading standards and “fail” the test if it is missing a single detail the attorney-tester considers to be Critical.��

Based on this manual evaluation, we create a set of evaluation instructions tailored to the ideal output unique to each skill, capturing the needs of each skill so an LLM can match our human reviewers’ process. To illustrate this, let’s look at a real example using CoCounsel’s Extract Contract Data skill. After running the test in our automated evaluation platform, here’s what an attorney reviewer would see when reconciling the model response with the LLM’s evaluation of it: ��

Question: If the customer is acquired through a merger, can the agreement be assigned to the acquiring entity without obtaining the provider’s prior written consent? ��

- Trust Team ideal response:��

- Short answer: Yes, with 30 days’ prior written notice.  ��

- Detailed answer: Yes, if the customer is acquired through a merger, the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent; the customer is only required to provide the provider with 30 days’ prior written notice. The clause specifies that either party may assign its rights or delegate its obligations without such consent, provided that 30 days’ prior written notice is given to the other party, to an entity that acquires all or substantially all of the business or assets of such party, including by merger. ��

Model response: ��

-  Short answer: Yes, with notice. ��

- Detailed answer: The agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent if the customer is acquired through a merger. The clause provides an exception to the general rule requiring consent for assignment. Specifically, it states that either party may assign its rights or delegate its obligations without such consent to an entity that acquires all or substantially all of the business or assets of the party, including through a merger, provided that 30 days’ prior written notice is given to the other party. ��

- Automated evaluation analysis: The model response captures the correct information that the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent but requires 30 days’ prior written notice. However, it incorrectly states the short answer as “Yes, with notice” instead of “Yes, with 30 days’ prior written notice,” as per the ideal answer. This is a minor difference and does not change the meaning significantly. ��

In this instance, the model response included a minor discrepancy from the attorney-authored ideal response. But the LLM’s evaluation of the response accurately determined that the answer remained sufficient, because it captured the complete notice requirement elsewhere in the response.��

Our ideal-response approach provides two key advantages over assertion-based evaluations. It excels at identifying deviations from attorney expectations, including hallucinations. And it pinpoints extraneous or inconsistent information that, while not technically a hallucination, could make even a complete response incorrect if that information introduces logical inconsistencies, which would result in a failing score. 
 
We rely on our Trust Team to create well-defined ideal responses and auto-evaluation instructions and to determine if a test case passes or fails. A skill’s output definitively fails if it falls short of this ideal because of material omissions, factual incorrectness, or hallucinations. However, we recognize that many legal issues aren’t black-and-white, and the “correct” answer could be open to reasonable disagreement. To address this, we peer review ideal responses in cases when the answer might require a second opinion. And we might eliminate tests when we find insufficient agreement among the attorney testers. This is how we both ensure that our passing criteria remain rigorous and account for the nuanced nature of legal analysis. 

Maintenance and improvement��

Creating a skill test set is only the beginning. Once we begin using it, the Trust Team continually monitors and refines it by manually reviewing failure cases from the automated tests and spot-checking passing samples to make sure the automated evaluation is in line with human judgments. We also regularly add tests to cover more use cases and capture user-reported issues, which could lead to further iterations of the tests submitted for automated evaluation and their success criteria.  ��

By following this process, every night we can execute, across all CoCounsel skills, more than 1,500 tests on our automated platform under attorney oversight, which combined with manual testing means we’ve run more than 1,000,000 tests since CoCounsel’s launch. And it empowers us to quickly identify areas for improvement, which is vital to ensuring CoCounsel remains the most trustworthy AI legal assistant available.��

Conclusion��

, we explored what it means for an AI tool to be “professional-grade” and why that standard is crucial for professionals in high-stakes fields like law. This post takes that concept further by diving into how we benchmark CoCounsel to ensure it meets those rigorous standards. By understanding the extensive testing that goes into evaluating its performance, you can see how CoCounsel consistently delivers the reliability and accuracy expected of a true professional-grade GenAI solution.��

To promote the transparency my team and I believe is necessary in the legal AI field, we’ve decided to release some of our performance statistics for the first time and a sample of the tests that are used to arrive at the figures below applying the criteria referenced within this article. Check out our results .

This��is a guest��post from��Jake Heller,��head��of��CoCounsel,��Thomson��Reuters.

A Holistic Approach to Advancing Generative AI Solutions

carrie.brooker@thomsonreuters.com — Thu, 17 Oct 2024 11:42:10 +0000

At ��VR��Ƶ, our vision is to deliver an AI assistant for every professional we serve. As part of that, our focus is on delivering benefits for our customers across the breadth of our AI- and non-AI-powered features. We know that our solutions deliver benefits to customers in many ways, including AI-powered automation.

In April of this year, we shared our vision to provide a GenAI assistant for each professional we serve. CoCounsel embodies our ongoing efforts to augment professionals’ work with GenAI skills, enabling professionals to accelerate and streamline entire workflows to increase efficiency, produce better work, and deliver more value for their clients. Our continued investment in GenAI is driven to enable professionals across industries to accelerate and streamline entire workflows through a single GenAI assistant.

We believe our investment in GenAI – along with our integration to customer data as well as third-party integrations – extends the value customers derive from CoCounsel beyond our connected experience and our verified and trusted content. Our work with Microsoft, for example, includes CoCounsel integrations across Word, Outlook and Teams – meeting professionals where they’re already working.

AI and large language models are proving to be powerful tools that deliver efficiency gains and strengthen research practices for our customers. Yet our efforts to redefine work with GenAI are rooted in our strong foundation of editorial enhancements, authoritative content and technological expertise, alongside our long history of working closely with customers. That’s why we continue to build out AI- and non-AI-powered solutions to help with the entire workflow for legal, tax, and risk and compliance professionals. While AI may not be perfect, it can significantly help professionals reduce the amount of work and manage more complex and substantive work more efficiently. We collaborate with our customers to help them understand that AI is an accelerant rather than a replacement for their own research.

Benchmarking expectations

As a leader in innovation and AI research, we recognize the role that independent benchmarking brings in ensuring the accuracy, transparency, and accountability of evolving GenAI solutions. We believe that benchmarking can improve both the development and the adoption of AI. We also see it as one component in a broad range of ways we consider and understand the benefits AI delivers for our customers. We work with our customers as their trusted partners for change, helping them to confidently understand and adopt new technologies, looking at both their immediate value and role in long-term transformation, and leveraging our deep understanding of their businesses.

At ��VR��Ƶ, our understanding of the holistic value of our products is based on customers’ usage and the benefits they derive. Our customers have run more than 2.5M searches through AI-Assisted Research on Westlaw Precision since its launch late last year, and they tell us it’s saving time and improving productivity. Similarly, internal testing of CoCounsel’s skills has yielded impressive results, particularly with regards to CoCounsel’s document review capabilities.

Our benchmarking support is reflected in our participation in studies including Vals.ai as well as two consortium efforts – from Stanford and Litig – exploring how to best evaluate legal AI. We are submitting CoCounsel AI skills to the Vals.ai benchmarking study in five areas of evaluation – Doc Q&A, Data Extraction, Document Summarization, Chronology Generation, and E-Discovery.

is a first attempt at establishing a standard, and so we should view this work as the first iteration and an opportunity to learn versus treating it like a gold standard. For example, one limitation of the benchmarking methodology is that each vendor’s results are evaluated based on the text output alone, removed from the interface and experiences of the individual products. This discounts the work each vendor has done to design interfaces and safety features to minimize the harms of errors. This reinforces the need for a holistic evaluation of each product being tested, ideally as designed for the user.

Looking ahead, my expectation is that, while accuracy will continue to improve, no products will produce answers entirely free of errors. And as we’ve shared with our customers, every AI product requires human expertise for verification and review – regardless of the accuracy rate. As the current approach to benchmarking rates an accuracy percentage – we need to be very clear on this point – whether the product produces a score in the low or high 90th percentile, all answers still must be checked 100% of the time.

I look forward to our ongoing collaboration with customers and industry partners as we continue our work towards minimizing inaccuracies and increasing the usefulness of the research outcomes for GenAI tools and all our solutions.

Benchmarking Archives - ����VR��Ƶ Institute

����VR��Ƶ Best Practices for Benchmarking AI for Legal Research

Raghu Ramanathan: Reflections on Legal Generative AI One Year In

Legal AI Benchmarking: CoCounsel

A Holistic Approach to Advancing Generative AI Solutions

Benchmarking Archives - ��VR��Ƶ Institute

��VR��Ƶ Best Practices for Benchmarking AI for Legal Research