legal research Archives - ��VR��Ƶ Institute

��VR��Ƶ Best Practices for Benchmarking AI for Legal Research

jeffrey.mccoy@thomsonreuters.com — Wed, 12 Feb 2025 15:38:20 +0000

At ��VR��Ƶ, we do an enormous amount of AI testing in our efforts to improve our customers’ ability to move through legal work faster and more effectively. We’ve noticed an increase in interest in AI testing generally, and in benchmarking AI applications for legal research specifically. We’ve learned a lot in our thousands of hours of AI testing, as such we offer the following best practices for those interested in considering an updated or differentiated approach when testing or benchmarking AI for legal research.

1. Test for the results you care about most.

This would seem obvious, but we’ve seen a lot of confusion about it, and if we could only make one recommendation, this would be it. It’s foundational for all other recommendations.

If you cared most about determining how long it takes to drive from one place to another, you wouldn’t just measure highway time, you’d measure total door-to-door time. If you cared most about car maintenance costs, you wouldn’t just measure the cost and frequency of brake repairs and maintenance.

With the use of AI for legal research, there are no LLMs nor any LLM-based solutions that offer 100% accuracy. Because of that, all answers generated by large language models or LLM-based solutions, even if they use Retrieval Augmented Generation (RAG), must be independently verified.

Some assume verification is a simple matter of checking the sources cited in an AI answer, but this is incorrect. We’ve seen plenty of examples where an AI-generated answer is wrong, and the cited sources simply corroborate the wrong answer. Verification requires using additional tools (like a citator, statute annotations, etc.) to ensure the answer is correct.

This means every time an AI-generated answer is used for research, there is a three-step process the researcher must engage in: (1) review the answer, (2) review the cited material from the answer, (3) use traditional research tools to make sure the answer and cited material are correct.

When we talk with researchers about research generally and this process specifically, what they care about most is (a) getting to a correct answer or understanding of the relevant law, and (b) the time it takes to get to that correct answer or understanding.

Because of this, the two most important measures are:

Percentage of times using this three-step process the user can get to the right answer, and
Time it takes to complete all three steps

Surprisingly, the percentage of errors in answer in step 1 can have very little impact on the percentage of correct answers by the researcher using all three steps or the time to complete those steps (unless errors are excessive), as long as citations and links to primary law are good and those primary resources are current and easily verified. Focusing on step one is like trying to figure out door-to-door times by measuring highway speeds only. It’s not very useful.

For instance, which of the following systems would you rather use?

System where the initial AI answer is 92% accurate, but verification, on average, takes 18 minutes, and post-verification accuracy is 97%, or
System where the initial AI answer is 89% accurate, but verification, on average, takes 10 minutes, and post-verification accuracy is 99.9%

It’s a clear choice, but there is often a misplaced focus on measurement of the first step in the process to the exclusion of steps two and three. Measure what you care about most.

2. Use realistic, representative questions in your testing.

Presumably you want to evaluate AI for the typical legal research you or your organization does. For instance, if you look at the research your organization does and find the questions are roughly 20% simple questions, 60% medium complexity, and 20% very complex or difficult, and that roughly half are questions about IP law and half are about federal civil procedure, then a benchmark testing 90% simple questions about criminal law would not be very helpful to you.

At ��VR��Ƶ, we model our testing based on the real-world questions we see from our customers every month. For your own testing, focus on the question types that best represent the researchers you’re focused on.

Testing mostly simple questions with clear-cut answers is easiest for testing, but if those types of questions don’t represent what your users do most (it doesn’t well represent most AI usage in Westlaw), then the results are not particularly helpful. Similarly, if you primarily test overly complex, extremely difficult and nuanced questions – or trick questions, those can be useful for testing the limits of a system, but they tend not to be very helpful for most real-world decision making.

3. Test a lot of questions.

In our own testing, we’ve found that testing small sets of questions is rarely representative of actual performance with a larger set. Large language models can generate different responses each time, even with identical inputs. Additionally, if responses are long and complex, graders may disagree, even when judging identical responses. For just a quick general sense of direction, it’s fine to test with a sample of questions as small as 100 or so, but for comparing algorithms/LLMs against each other, we strongly recommend checking the results as you grade and testing until the measure of interest stabilizes. For example, if you are running a comparison between two systems to see which is preferred, you would test until the rate at which one system is preferred over the other stops changing dramatically with each new batch of questions. Another guide to the number of questions you should test is the confidence level and interval you want (see next section).

4. Calculate and report confidence levels and intervals.

Even with a relatively large set of questions, measurements of accuracy are only so precise. When using these measurements to make decisions, it’s important to understand the degree or range of accuracy of the measurement, often referred to as confidence level and confidence interval. You can think of confidence intervals and levels like margin of error in surveys. It lets you know how reliable or repeatable the measurement is expected to be.

For instance, testing AI accuracy based on 200 questions, if you ran the test again with the same questions/answers but different evaluators, or used the same evaluators but with a different 200 random, representative sample of questions, would you expect the exact same result? Typically, you wouldn’t. You’d expect the result to fall within a certain range, so it’s important to report that range along with the results so decision makers understand the differences between algorithms/LLMs that are meaningful and those that are not meaningful. The proper way to report this is with confidence intervals and levels. You can read more about them . Using standard assumptions, when measuring an error rate of 10% from a sample of only 100 questions, you can be about 95% confident that the true error rate is between 4.1% and 15.9%. This is called a 95% confidence level, and the “+/- 5.9%” is the margin of error. If you measure an error rate of 10% from a sample of 500 questions, the 95% confidence interval would be between 7.4% and 12.6%, or 10% +/- 2.6%.

The basic power analysis to estimate a confidence interval assumes a perfect means of detecting the outcome you are trying to measure. If there is some uncertainty in that detection, e.g., if two independent evaluators disagree about the outcome some percentage of the time, then the margin of error increases. A grading process or measurement that’s unreliable ~5% of the time, might increase the margin of error from 5.9% to 7.3%, in our example above with 100 questions. It’s important to note that there are various methods for calculating standard error, and these examples make simplifying assumptions that likely underestimate the confidence intervals observed in practice.

5. Use a combination of automated and manual evaluation efforts.

Having human evaluators pore through lengthy answers to complex questions can be difficult and time-consuming. Ideally, we would just have AI evaluate the accuracy and quality of answers generated by AI. This is sometimes referred to as LLM as judge. But in the same way that AI makes mistakes when generating an answer, it can also make mistakes when evaluating the quality of an answer against a gold-standard answer written by a human. In our experience, modern LLMs are pretty good at evaluating AI-generated answers against gold-standard answers when answers are clear and relatively short. With length and complexity, we’ve found the LLM as judge approach to be very unreliable.

For instance, has shown that LLMs tend to struggle when evaluating responses to complex and challenging questions like those requiring expert knowledge, reasoning, and math.

Since most test sets will contain a sample of simple/easy/clear questions and answers, it makes sense to use AI for automated evaluation of these, then use human evaluators for the rest, at least until AI improves to the point where more can be automated.

6. For human grading, use two separate human evaluators for each answer, and have a third (ideally more experienced) evaluator to resolve conflicts.

For assessments like these, can be a real issue. In our own testing, we’ve found attorneys evaluating AI-generated answers for more complex legal research questions can disagree about the accuracy or quality of answers about 25% of the time, which makes single-grader evaluation unreliable. To improve reliability, we have two evaluators separately grade each answer, and where there are conflicts, we have a third, more experienced evaluator resolves the conflict.

7. When answers are wrong, investigate to see if the gold-standard answer might be wrong.

In the same way people make mistakes in evaluating answers, they can also make mistakes in coming up with the gold-standard answer for testing. In our experience, we’ve found some instances where the AI-generated answer was evaluated as incorrect when compared to the gold-standard answer, but when we dug into it further, it turned out the AI was correct and the person who put together the gold-standard answer was wrong. Sometimes AI makes mistakes and sometimes humans make mistakes – you should check both.

8. If evaluating multiple algorithms/LLMs/solutions, make sure the evaluators are blind to which algorithm/LLM/solution the answer was generated by.

In our evaluations we try to avoid human bias in grading. Sometimes an evaluator has had bad experiences or great experiences with a certain product or LLM in the past, and we don’t want them to bring that bias to the current evaluation, so when evaluating different solutions, we first strip away anything that would identify the source of the solution, so results are not biased by past positive or negative experiences.

9. Grade the value of answers in addition to making a binary determination of whether the answer has an error.

What’s right or wrong in an answer can vary enormously in terms of positive value and negative impact. For instance, consider the following answers:

A. Answer is correct in every way but is short and high level. It just gives a basic description of the legal issue as it relates to the question but doesn’t provide any references to primary or secondary law for verification, nor any nuance regarding exceptions or other considerations.

B. Answer is lengthy and nuanced, addressing multiple aspects of the question and discussing important exceptions that might apply, and it provides references with citations and links for verification, and it’s correct in every way except in one of the citations, the date is incorrect, but that’s easily verified and corrected when clicking the link from the citation.

C. Answer is incorrect in every way and all its linked references point to primary law that simply corroborate the wrong answer.

If the evaluation is simply a binary view of the number of answers that contain an error, then answer A looks good and answers B and C look equally bad. In reality, answer C is far worse and more harmful than answer B, and Answer B is likely much more valuable to the researcher than answer A.

In our evaluations, we’re looking for answer attributes that are helpful to researchers, like depth of the answer and quality of the references, and we don’t just evaluate errors in a binary way. We consider answers that are totally wrong to be far worse than answers with erroneous statements in otherwise correct and helpful answers. Similarly, we consider erroneous statements in answers based on whether they address the core questions or are tangential to it, and whether they’re contradicted in the answer or easily verified with the linked references. We’d like to eradicate all errors, of course, but some are more harmful than others.

10. Look for errors beyond gold-standard answers.

Often LLMs generate answers with information beyond the scope of a gold-standard answer. For instance, the gold-standard answer might say the answer should state that the answer to the question is no, and it should explain that with X, Y, and Z, and it should specifically cite to cases A & B and statute C.

The LLM-generated answer might state the answer is no and explain X, Y, and Z with references to A, B, and C, but it might also add a few statements about exceptions or related issues or an additional case or statute. Sometimes these additional statements are incorrect, even when everything else is correct. So, if an LLM-as-judge or human evaluator only looks at the gold-standard answer to see if the AI-generated answer is correct, that evaluation can miss errors in the additional material. This means evaluators need to do independent research beyond simply looking at the gold-standard answers to determine if an answer has an error.

11. Consider testing reliability.

LLMs often have some randomness built into them. Many have a temperature setting that can be used to minimize or eliminate this, making answers more consistent when asking the same question multiple times.

But some LLMs are better at this than others, and some integrated solutions that use LLMs in conjunction with other techniques, like RAG, don’t set temperature low to allow for more creativity in answers.

For big decisions you might be making, consider testing reliability by running the same question 20 times and seeing if any of the answers are substantially worse than the other answers to the same question.

The above are our and learnings from our extensive expertise with AI, Gen AI and LLMs over the past 30 years. At ��VR��Ƶ we put the customer at the heart of each of these decisions we make and are transparent that at the point of use all our AI generated answers must be checked by a human.

As we work through testing our AI products, our teams do not follow each of these steps for every test we do, sometimes we prioritize speed over accuracy of testing or vice versa, but we ensure we clearly understand the trade-off in prioritizing some of these steps and communicate this with our teams. The bigger and more important the decision we’re trying to make, the more of these steps we follow.

This is a guest post from Mike Dahn, head of Westlaw Product, and Dasha Herrmannova, senior applied scientist, from ��VR��Ƶ.

Quick Check Mischaracterization Identification: New Westlaw Enhancement Furthers the ��VR��Ƶ Generative AI Vision

carrie.brooker@thomsonreuters.com — Tue, 22 Oct 2024 13:19:05 +0000

��VR��Ƶ recently announced deeper integration of CoCounsel 2.0 in Westlaw and Practical Law as well as new generative AI research features – Mischaracterization Identification in Quick Check and AI Jurisdictional Surveys – that are saving customers significant time and helping them ensure accuracy of their research. The enhancements build on the ��VR��Ƶ vision to deliver a comprehensive GenAI assistant for every professional it serves.

Below, CJ Lechtenberg, senior director, Westlaw Product Management, ��VR��Ƶ, shares her insights on developing Mischaracterization Identification, a generative AI capability to help detect mischaracterizations and omissions in legal briefs.

In the five years since Quick Check was introduced, you’ve added many enhancements including Quick Check Contrary Authority Identification, Quick Check Judicial and Quick Check Quotation Analysis. How did integrating generative AI make the Mischaracterization Identification enhancement different than previous ones?

Lechtenberg: This enhancement takes researchers beyond the step of knowing what might be a potential mischaracterization to an explanation of why something might be a potential mischaracterization – and that is radically different from any feature we’ve deployed in Quick Check before.

I’m sure it’ll come as no surprise when I say that generative AI is just a completely different beast. Lay people may think about the law as being black and white. You can do this; you ��’t do that. But legal professionals know that the law is really a sea of varying shades of gray. With machine learning, we wrestled with how we could ever give the machine enough data to figure out all the different ways an attorney may mischaracterize the law.

In Quick Check Quotation Analysis prior to the Mischaracterization Identification enhancement, we highlighted the actual textual differences – additions, omissions, and changes – in the quotations and showed the context around the quotes. Doing so certainly saved researchers a significant amount of time and helped them spot issues they might not otherwise find, but the onus was still on researchers to review everything and determine what the precise differences were and how material they might be, if at all. Even with the additional context provided, it could still be difficult to determine whether the quotations were taken out of context, especially if the quotes themselves didn’t appear to be different.

In developing Mischaracterization Identification, we recognized that the task of analyzing quotations and their context is so nuanced that attorneys will have different expectations for whether a mischaracterization occurred, so we needed to provide more than just categorizations. We found that large language models (LLMs) can generate nuanced descriptions of potential mischaracterizations, versus just explicit categorizations, and do it well, which is hugely beneficial for this type of task.

How will using Mischaracterization Identification give legal professionals and law firms a competitive advantage? How will judges using it benefit?

Lechtenberg: The advantages of using the new Mischaracterization Identification are substantial for both legal professionals and the judiciary – both in terms of speed of review and quality of work product. When we launched Quick Check Quotation Analysis in 2020, customers, both legal professionals and the judiciary, lamented about how time-consuming it is for them to review quotations and how challenging it is to spot differences. It is a mentally taxing task and often our brains fill in the blanks – interpreting what we think a brief maybe should say but actually doesn’t. Attorneys never have a surplus of time, so the last thing they want to do is spend the little bit they have on the most tedious of tasks and still end up missing potential problems.

For attorneys, Mischaracterization Identification will help them efficiently and accurately make contextual misstatement and omission determinations for their opponents’ and their own quotations and the context surrounding those quotations. The fear of missing their own mistakes is very real for attorneys, but the possibility of missing the opportunity to capitalize on their opponents’ mistakes is an even larger concern. This new enhancement reduces both of those worries and will help attorneys be even better advocates for their clients.

Judges will also be able to effectively review the filings of parties in matters before them much faster. Attorneys owe a duty of candor to the judiciary and the Mischaracterization Identification feature will help flag any potential issues quickly. An added benefit, which members of the judiciary or their staff perhaps haven’t considered, is the ability to analyze their own orders and opinions to ensure that they haven’t made mistakes that could be appealed. This new enhancement will help alert judges and law clerks to potential issues before they finalize their opinions.

What early feedback are you hearing from customers?

Lechtenberg: In a recent survey, 93% of law firm professionals told us they’ve seen opposing counsel misuse a quotation, 66% said they’ve seen misrepresentations by an associate or colleague, and 65% of corporate respondents said they check the accuracy of outside counsel’s quotations. The need to review opposing counsels’ and colleagues’ briefs for mischaracterizations of the law is still a very real issue for attorneys. Likewise, attorneys have said they’re always concerned about the accuracy of their work and that maintaining their reputation as a credible litigator with courts and opposing counsel is incredibly important.

Customers are extremely excited about this new Quick Check enhancement to help combat these concerns and we’ve received positive feedback from them. One law firm managing partner stated that they would use this tool a lot. They cite-check their opponents’ briefs, so any shortcuts are beneficial to them. They recognize that most of the time, errors are harmless, but occasionally there are things they want to bring to the court’s attention and this feature will help them spot those issues more quickly and accurately.

Another law firm partner said this new feature is the “ultimate security blanket” because everything attorneys do is based on their credibility, and this feature alerting them to quotes being taken out of context before filing with the court would calm some of those fears.

Any surprising or unexpected moments as the team worked on developing or launching Mischaracterization Identification?

Lechtenberg: The fact that we’ve accomplished this now with the use of LLMs is exciting, a little surprising and a long time coming. I’m an attorney who leads a team of attorneys; we’re literally trained to question everything and have a healthy dose of skepticism. But I have been dreaming about a mischaracterization identification feature in Quick Check ever since we developed Quotation Analysis more than five years ago. At my core, I believed someday this could be achieved, but for years traditional machine learning approaches were just not powerful or nuanced enough to do it well.

Leveraging LLMs for a use case like this is a new frontier like we’ve never seen before. The LLM’s ability to analyze text from an uploaded document and compare that text to the text from the cited case used to support the argument and then go beyond highlighting textual differences and provide an actual explanation of what may be problematic – whether that’s a selective quote, omitted context or a misinterpreted holding – has been absolutely astounding.

What’s the one thing you want everyone to know about Mischaracterization Identification?

Lechtenberg: Mischaracterization Identification will not only help researchers spot contextual misstatements and omissions in their opponents’ or their own quotations and contextual statements faster and with more accuracy, but most importantly it will help them understand why those misstatements or omissions may be problematic. And, spoiler alert: Mischaracterization Identification is just the beginning of how ��VR��Ƶ will harness the power of generative AI in Quick Check to solve important customer problems.

For more on Mischaracterization Identification, read the press release or check out the by Mike Dahn, head of Westlaw Product Management, ��VR��Ƶ.

The Transformative Role of AI in Professional Tools: A Conversation With David Wong and Leann Blanchfield

carrie.brooker@thomsonreuters.com — Wed, 02 Oct 2024 13:33:39 +0000

Leann Blanchfield, head of Editorial, ��VR��Ƶ, said now is the most exciting time in her 30+ years with the company.

In the latest , Blanchfield shared how the power of generative AI – and the dramatic leap it’s making in how professionals across industries can access large quantities of data – is transforming the legal industry and beyond. Blanchfield credits the more than 1,500 attorney editors on her team, who create and enhance content, with harnessing the power of generative AI for legal research.

Human expertise is just one component of how ��VR��Ƶ is capitalizing on the potential of generative AI. Three elements are critical, as David Wong, chief product officer, ��VR��Ƶ, noted in his comments about the launch of CoCounsel 2.0 at ILTACON: “We have the data, the expertise, and the tech. Few have all three in such quantity and depth.”

In the new , Wong focused on the role of human domain experts, noting they’re key to the process of creating and validating data used by AI models for professional research.

“There’s a lot of both prompt engineering, fine tuning, and system refinement that’s necessary to get quality to a usable spot,” Wong said. “Experts, experienced researchers and experienced lawyers can help to gauge whether or not the systems are correct. We couldn’t have an objective, quantified measure of quality on these systems without the editors, without those experts.”

Wong and Blanchfield discussed the importance of human experts in ensuring the accuracy and reliability of AI.

“Maintaining accuracy is at the heart of what the editorial team does,” Blanchfield said. “It’s the number-one priority across every editorial team. We maintain our content to be accurate and trusted.”

Wong acknowledged it’s challenging for the team to process and update unstructured, constantly changing data in real time. He said that ��VR��Ƶ ensures that its AI models are customized and meeting the varying needs of various jurisdictions through a combination of software and algorithms that take advantage of the LLMs.

“So when you ask a question of , for example, we are running an end-to-end algorithm which runs search, retrieves data, re-ranks, interprets and then ultimately passes that information to a large language model to synthesize and produce the answer,” Wong said. “It’s a very complicated system which involves multiple types of technology, multiple types of information retrieval. Processing unstructured, dynamic data and customizing AI models requires integrating multiple technologies and algorithms to optimize performance.”

Hear more of Wong and Blanchfield’s insights on integrating AI into professional tools and ensuring that information is trustworthy in the of the TechConnect series, which brings diverse and dynamic perspectives from all corners of the technology world with thought-provoking questions and conversation.

How Harmful Are Errors in AI Research Results?

carrie.brooker@thomsonreuters.com — Fri, 02 Aug 2024 14:19:28 +0000

AI and large language models have proven to be powerful tools for legal professionals. Our customers are seeing the gains in efficiency and tell us it’s greatly beneficial. However, there has been a lot of discussion lately of errors and hallucinations, but what hasn’t been discussed is the extent of harm that comes from errors or the benefits of answers with an error.

First, let’s settle on terminology. We should use terms like “errors” or “inaccuracies” instead of “hallucinations.” “Hallucination” sounds smart, like we’re AI insiders and know the lingo, but the term is often defined narrowly as a fabrication, which is just one type of error. Customers will be as concerned, if not more concerned, about non-fabricated statements from non-fabricated cases that, despite being real, are still incorrect for the question. “Errors” or “inaccuracies” are much better and more encompassing ways to describe the full range of problems we care about.

Next, let’s consider types of errors and risk of harm from each. Error rates are often just reported as a percentage, which is a binary view – either an answer has an error or it does not, but that’s overly simplistic. It conflates the big differences in risk of harm from different types of errors and ignores the potential benefit of lengthy and nuanced answers that contain a minor error.

There are dozens of ways to categorize errors in LLM-generated answers, but we’ve found three to be most helpful:

Incorrect references in otherwise correct answers
Incorrect statements in otherwise correct answers
Answers that are entirely incorrect

A fourth category of error that sometimes comes up in discussions with customers is about inconsistency, where the system provides a correct answer one time, then later, when the same exact question is submitted, the answer is different and sometimes less complete or incorrect. Minor differences in wording are very common when submitting the same question. Substantial differences are uncommon, but when they do result in an error, the error simply falls into one of the three categories above.

Incorrect references refer to situations where an answer is correct, but the footnote references provided for a statement of law does not stand for the precise proposition of the statement. Fortunately, risk of harm with these types of errors appears to be low, since they’re easy to detect when researchers review the primary law cited. Answers with these types of errors still offer substantial benefit to researchers because they get them to the right answer quickly, often with a lot of nuance about the issues, but the researcher still has to use additional searches or other research techniques to find the best source material.

Incorrect statements in otherwise correct answers are often obvious in the answer. An answer might say the law is X in paragraphs 1 – 4 and then, inexplicably, declare the law is Y in paragraph 5, then go back to stating the law is X in paragraph 6. Risk of harm with these errors also appears to be low, since the inconsistency is obvious and prompts the researcher to dig into the primary law to figure it out. Answers with these types of errors still offer some benefit, since they point the user to highly relevant primary law, explain the issues, and help the researcher with what to look for when reviewing primary law.

Answers that are entirely wrong are more problematic. These are quite rare in our testing, but they do occur. Often a simple check of the primary sources cited will resolve the error quickly, but sometimes additional research is needed beyond that. These answers still offer some benefit to researchers, since they often point to relevant primary law in a way that is more effective and useful than traditional searching, but they also come with greater risk of harm, since the incorrectness of the answer is not obvious, and simply reviewing cited sources does not always resolve the issue.

These sound scary, but researchers have been dealing with this type of issue for ages. For instance, secondary sources can be incredibly helpful for summarizing complex areas of law and offering insights, but they sometimes fail to discuss important nuance, and sometimes the law has changed since they were written. If researchers relied on them alone, without doing further research, they would be at risk of harm, even if they consulted cited primary sources.

Yet we would never tell researchers to avoid using secondary sources because they can sometimes be beautifully written, very convincing, and utterly wrong. What we tell researchers is they can be enormously helpful for research but must be used as part of a sound research process where primary law is reviewed, and tools like KeyCite, Key Numbers, and statutes annotations are used to make sure the researcher has a complete understanding of the law.

Individual research tools have rarely been perfect. Their value has been in improving sound research practices. Stephen Embry captured this idea well in his recent blog post, :

“The point is not whether Gen AI can provide perfect answers. It’s whether, given the speed and efficiency of using the tools and their error rates compared to those of humans, we can develop mitigation strategies that reduce errors. That’s what we do with humans. (I.E. read the cases before you cite them, please).”

But if you must check primary resources and engage in sound research practices when using a research tool, is there really any benefit to using it? If it improves overall research times or helps surface important nuance that might otherwise be missed, the answer is yes.

Prior to launching AI-Assisted Research, we knew large language models would not produce answers free of errors 100% of the time, so we asked attorneys if the tool would be valuable even with an occasional error, and if we should we release it now or wait until it was perfect?

Most of the attorneys said, “I want this now.” They saw clear benefits and thought an occasional error was worth it for the extraordinary benefits of the new tool, since they would easily uncover an error when reading through primary law. They said that if they knew the answers were generated by AI, they would never trust them and would verify by checking primary sources. If there was an error, those primary sources (and further standard research checks, like looking at KeyCite flags, statute annotations, etc.) would reveal it. That’s why we put AI in the name of this CoCounsel skill, so researchers would be encouraged to check primary sources.

Our customers have submitted over 1.5 million questions to AI-Assisted Research in Westlaw Precision. Generally, three big research benefits come up in discussions:

It gives them a helpful overview before diving into primary sources.
It uncovers sub-issues, related issues, or other nuances they might not have found as quickly with traditional approaches.
It points them to the best primary sources for the question more quickly and efficiently than traditional methods of research.

Customers have described these benefits with great enthusiasm, telling us AI-Assisted Research “saves hours” and is a “game changer.”

Lawyers know they need to rely on the law when writing a brief or advising a client, and the law lies in primary law documents (cases, statutes, regulations, etc.). Researchers have always known that when they’re looking at something that is not a primary law document, such as a treatise section, a bar journal article, or an answer from AI, they must check the primary law before relying on it to advise a client or write a brief. That’s why we cite to primary law in the answers and why we provide an even greater selection of relevant primary and secondary sources under the answers – to make this checking easy.

But what about ? That lawyer submitted his brief without ever reading any of the cases he was citing.

That ��’t be the standard for considering the value of products like Westlaw that provide a rich set of research tools that make it easy to check primary sources, understand their validity, and find related material. If the standard were, a user might not read any of the primary law, many high-value research capabilities today would be deemed useless.

The way to dramatically reduce the risk of harm from LLM-based results or any other individual research tool, like secondary sources, is what it has always been: sound research practices.

Jean O’Grady conveyed this beautifully in :

“Does generative AI pose truly unique risks for legal research? In my opinion, there is no risk that could not be completely mitigated by the use of traditional legal research skills. The only real risk is lawyers losing the ability to read, comprehend and synthesize information from primary sources.”

At ��VR��Ƶ, we’re continuing to work on ways to reduce all types of errors in generative AI results, and we expect rapid improvement in the coming months. Because of the way large language models work, even with retrieval augmented generation, eliminating errors is difficult, and it’s going to be quite some time before answers are completely free of errors. That’s the bad news.

The good news is that harm from these types of errors can be reduced dramatically with common research practices. It’s why we’re not only investing in generative AI projects. We’re also continuing to build out a full suite of research tools that help with the entire research process because that process will continue to be important.

Even when errors get reduced to just 1%, that will still mean that 100% of answers need to be checked, and thorough research practices employed.

We’re currently involved in two consortium efforts to provide benchmarking for generative AI products. When generative AI products for legal research are tested against these benchmarks, I expect we’ll see the following:

None of the products will produce answers that are all entirely free of errors.
All the products will require sound research practices, including checking primary law documents, to reduce risk of harm.
When sound research practices are employed, the risk of harm from errors in the answers is small and no different in magnitude from the risks we see with traditional research tools like secondary sources or Boolean search.

Even in the age of generative AI, sound research practices remain important and are here to stay. As Aravind Srinivas, CEO and cofounder of , said,

“The journey doesn’t end once you get an answer… the journey begins after you get an answer.”

I think Aravind’s statement applies perfectly to legal research and to the art of crafting legal arguments. Even as our teams strive to reduce errors further, we should keep in mind the benefits of generative AI and weigh them against the new and traditional risks of harm in tools that are less than perfect. When used as part of a thorough research process, these new tools offer tremendous benefits with very little risk of harm.

This is a guest post from Mike Dahn, head of Westlaw Product Management, ��VR��Ƶ.

Two years of unprecedented progress – Law firms deriving tangible value from ��VR��Ƶ AI

carrie.brooker@thomsonreuters.com — Tue, 02 Jul 2024 16:32:06 +0000

As we approach the two-year mark since we launched Westlaw Precision, the industry has seen unprecedented development – in many ways instigated by the launch of Chat GPT in November 2022, customers and software developers alike never experienced such exponential opportunity (and some would argue risk).

And here at ��VR��Ƶ – we haven’t stood still; in fact, we have never moved faster!

Within this 24-month period Professionals no longer need to speculate how AI ��ǳܱ��affect their work because they now have a better sense of how it will — and in some cases already is.

And for our customers in November 2023, we launched AI-Assisted Research – which allows customers to ask complex legal research questions in natural language and quickly receive synthesized answers, with links to supporting authority from Westlaw content and links to further examine that authority. AI-Assisted Research streamlines the initial phase of legal research with sophisticated answers to questions and the authority those answers are based on, saving hours of work. In fact, this is how one of our valued customers describes the solution:

“Because ��VR��Ƶ has the best case law database, lawyers can feel confident that the answer AI-Assisted Research is generating in response to our questions is well supported. The fact that the AI-Assisted Research delivers all the resources it relied upon in coming up the answer, right beneath the answer, amplifies the confidence we all can have in using the program to help with our research needs.” Andrew Bedigian, Larson LLP

And since launch 6k customers have run more than 1.5M searches through AI-Assisted Research. ��VR��Ƶ closed loop LLM is trained on millions of terabytes of our trusted and verified content – rather than publicly available information – and this generates the most trusted and reliable answer on the market today.

“I did go through and compare the ChatGPT paid version as compared to this AI-Assisted Research. What I can tell you is there is a major difference in the libraries that Westlaw has versus any other program. There is no other program that has the secondary sources, the court orders, the appellate documents, the primary sources – every single thing that Westlaw offers, which is not only on point and published, you have the citations, there’s a source of truth from where the information comes from and it’s only as good as the prompts you give it and the parameters you put.” Jesse Guth, owner, Guth Law Office

Our customers tell us each day what a critical tool AI assisted Research is for their legal research both to those new to the profession and those that are experienced in the field. By design our intuitive user experience guides customers to run follow-up research – AI Assisted Research provides customers with a comprehensive answer which can be easily interrogated, linking to more sources for validation.

“At Blank Rome, we are committed to providing the highest levels of innovative client service. As part of this effort, last year we were excited to implement Westlaw Precision and Practical Law Dynamic AI capabilities for our attorneys, which has resulted in increased efficiencies and enhanced results.” Frank Spadafino, chief information officer, Blank Rome

Over the years ��VR��Ƶ has always been at the forefront of legal research innovation, helping customers to reduce research times and ensure nothing important is missed. AI-Assisted Research is among the very best of these tools, and when it’s used as intended, it offers enormous benefits with very little risk of harm. I strongly encourage you to try it yourself – you will find it’s a powerful research tool you’ll want to employ regularly in your research processes.

��VR��Ƶ Launches Westlaw Edge UK with CoCounsel

carrie.brooker@thomsonreuters.com — Mon, 22 Apr 2024 16:17:51 +0000

��VR��Ƶ is expanding customers’ access to AI-Assisted Research with the introduction of . The first ��VR��Ƶ generative AI legal research offering in the UK, Westlaw Edge UK with CoCounsel streamlines the initial phase of legal research by allowing customers to ask complex questions in natural language and delivering synthesized answers with detailed insights from top results along with a list of key cases, legislation, and topics.

The UK rollout follows last year’s ��Ǵ��l to help legal professionals quickly get answers to complex research questions.

“Westlaw Edge UK with CoCounsel will help users save time and work smarter by jumpstarting – not replacing – their current methods for legal research,” said Andrew Buckley, vice president, Research and Commentary, ��VR��Ƶ. “AI-Assisted Research enables practitioners to ask a question in everyday language and get an answer grounded in the powerful combination of trusted, comprehensive Westlaw content and the latest in large language models. Working with increased efficiency is key for practitioners, who need the right research tools to be competitive as they navigate complex legal issues for their clients in a constantly evolving environment.”

In addition to empowering customers with AI-Assisted Research, Westlaw Edge UK with CoCounsel gives customers access to an AI assistant, called CoCounsel. It’s integrated with Westlaw Precision and Practical Law Dynamic Tool Set, and soon will be integrated with Document Intelligence and HighQ.

“Westlaw Edge UK with CoCounsel furthers our long-standing leadership in delivering the most sophisticated legal research solutions in the UK,” Buckley said. “We launched Westlaw UK nearly 25 years ago, marking the globalization of Westlaw, and we introduced Westlaw Edge UK in 2020. Adding AI capabilities to the Westlaw Edge UK portfolio of solutions represents the next chapter in our rich history of legal innovation in the UK and using AI technology to help legal researchers be more efficient.”

For more on Westlaw Edge UK with CoCounsel and additional new CoCounsel capabilities for legal and tax professionals, check out the news release.

��VR��Ƶ Completes Acquisition of Casetext, Inc.

carrie.brooker@thomsonreuters.com — Thu, 17 Aug 2023 15:47:05 +0000

��VR��Ƶ announced today that it has closed on its previously announced acquisition of Casetext, Inc., a provider of technology for legal professionals, for a purchase price of $650 million in cash.

Founded in 2013, Casetext uses advanced AI and machine learning to build technology for legal professionals, creating solutions that help them work more efficiently and provide higher-quality representation to more clients. Casetext’s customers include more than 10,000 law firms and corporate legal departments. Its key products include CoCounsel, an AI legal assistant powered by GPT-4 that delivers document review, legal research memos, deposition preparation, and contract analysis in minutes.

The acquisition supports ��VR��Ƶ “build, partner and buy” strategy to bring generative AI solutions to its customers and the company’s efforts to redefine the future of professionals through applications of generative AI. Other recent developments include ��VR��Ƶ commitment to invest $100 million-plus annually to integrate AI into its flagship content and technology solutions, as well as its work with Microsoft for a new plugin with Microsoft 365 Copilot, with the two companies collaborating on a legal drafting solution that leverages Westlaw, Practical Law and Document Intelligence. In July, ��VR��Ƶ also launched a beta program to pilot new generative AI capabilities in Westlaw with select customers.

For more details, read the press release.

legal research Archives - ����VR��Ƶ Institute

����VR��Ƶ Best Practices for Benchmarking AI for Legal Research

Quick Check Mischaracterization Identification: New Westlaw Enhancement Furthers the ����VR��Ƶ Generative AI Vision