large language models Archives - ��VR��Ƶ Institute

The Rise of Large Language Models in Automatic Evaluation: Why We Still Need Humans in the Loop

jeffrey.mccoy@thomsonreuters.com — Tue, 21 Jan 2025 17:23:20 +0000

In recent years, the field of Natural Language Processing (NLP) has seen remarkable advancements, primarily driven by the development of Large Language Models (LLMs) such as GPT-4, Gemini, and Llama. These models, with their astounding generation capabilities, have transformed a wide range of applications from chatbots to content generation. One exciting and increasingly prevalent application of LLMs is in the automatic evaluation of Natural Language Generation (NLG) tasks. However, while LLMs offer impressive potential for evaluating domain-specific tasks, the necessity for a human-in-the-loop remains essential.

The Emergence of LLMs in Automatic Evaluation

Traditional evaluation metrics in NLG primarily rely on comparing the generated text to reference texts using word overlap measures. These metrics, while useful, often fall short of capturing the nuances of language quality, coherence, and relevance. For example, suppose we have a reference summary “The cat is on the mat.” and a generated summary from a model “A feline is resting on a rug.”. If we use ROUGE as a metric for the evaluation, ROUGE only considers lexical overlap and cannot capture the semantic similarity between words or phrases. These summaries have the same meaning but would score poorly on ROUGE due to low word overlap.

LLMs have demonstrated an exceptional understanding of language, context, and semantics, making them attractive candidates for evaluating generated text. They can assess factors like fluency, coherence, and even factual accuracy, which are crucial for more sophisticated and context-aware evaluations. For instance, LLMs can be fine-tuned to understand the specific jargon and style of a particular domain, such as medical or legal texts, making them great potential evaluators.

The Promise of LLMs in Automated Evaluation

The recent technological advancements of LLMs have encouraged the development of LLM-based evaluation methods in various tasks and systems. LLMs can offer several advantages in the evaluation of NLG tasks:

Context-Aware Evaluation: Unlike traditional metrics, LLMs can comprehend the context and generate evaluations that account for the subtleties and intricacies of human language.
Scalability: LLMs can evaluate large volumes of text quickly and consistently, offering scalability that human evaluators cannot match.
Reduced Subjectivity: Automated evaluation can minimize the subjective bias that human evaluators might introduce, leading to more consistent and objective assessments.

How to use LLM as an Evaluator

LLM-based evaluators are conceptually much simpler than traditional automatic evaluators for evaluation. While traditional evaluation methods rely on predefined metrics and comparisons to reference datasets, LLM-based evaluators work by directly assessing the generated text.

Figure 1 shows an overview of the LLM-based evaluation frameworks. To evaluate the quality of the text, you embed it into a prompt template that contains the evaluation criteria, then provide this prompt to an LLM. The LLM then analyzes the text based on the given criteria and provides feedback on its quality. This approach bypasses the need for extensive preprocessing and reference comparisons, making the evaluation process more straightforward and versatile.

Figure 2 illustrates the step-by-step workflow of evaluating summaries using clarity as a metric. A document and its corresponding summary are inputs. The summary is then embedded into a pre-formulated prompt template that includes detailed evaluation criteria, such as clarity, defined on a 1-to-5 scale. This prompt is then inputted to an LLM for automated analysis and evaluation. Based on the specified criteria, the LLM evaluator reviews the summary, assigns a score, and provides justification for the rating.

The Limitations of LLMs: Why Humans Are Still Indispensable

Despite the promising capabilities of LLMs, there are significant limitations that necessitate the continued involvement of human experts in the evaluation process, especially for domain-specific tasks:

Lacking Specialized Domain Knowledge: Domain-specific tasks often involve complex knowledge. LLMs are typically trained as general-purpose assistants, and they still lack specialised domain knowledge. In contrast, subject matter experts bring in- depth domain knowledge obtained by years of dedicated training and education.
Evolving Knowledge: Especially in fast-evolving fields like medicine and technology, staying up-to-date with the latest information is challenging for static models. Human experts, however, continuously learn and adapt to new knowledge and standards.
Handling Ambiguities: In specialized domains, the language can be highly ambiguous and complex, and the ability to disambiguate based on deep contextual knowledge is something LLMs still struggle with.
Ethical and Bias Concerns: LLMs can inadvertently reinforce biases present in their training data. Human oversight is crucial to identify and mitigate these biases, ensuring fair and ethical evaluations.

The Human-in-the-Loop Model: Best of Both Worlds

To harness the strengths of LLMs while addressing their limitations, a human-in-the-loop approach is essential. This combines the efficiency and scalability of LLMs with the expertise and judgment of human evaluators:

Initial Screening: LLMs can perform initial screenings and provide preliminary evaluations, identifying clear cases of high or low quality.
Expert Review: Human experts then review and refine these evaluations, focusing on cases that require nuanced understanding or where the LLM’s assessment is inadequate.
Continuous Feedback Loop: Feedback from human evaluators can be used to fine-tune and improve LLMs, creating a continuous improvement cycle.

Conclusion

The integration of LLMs into the automatic evaluation of NLG tasks marks a significant step forward in the field of NLP. However, for domain-specific evaluations, the complexity and nuance of human language still necessitate human experts. By adopting a human-in- the-loop approach, we can leverage the best of both worlds: the speed and scalability of LLMs and the depth and discernment of human evaluators. This constructive interaction ensures that we maintain high standards of accuracy, fairness, and relevance in evaluating natural language generation tasks, ultimately driving the field towards more sophisticated and reliable applications.

This post was written by Grace Lee, lead applied scientist at ��VR��Ƶ Labs (TR Labs).

Note. This work has been done as part of the internship of Hossein A. (Seed) Rahmani at ��VR��Ƶ Labs (TR Labs).

CoCounsel Drafting Is Now Available for the UK Market

carrie.brooker@thomsonreuters.com — Tue, 26 Nov 2024 08:32:05 +0000

I’m excited to share that ��VR��Ƶ rolled out CoCounsel Drafting – an end-to-end drafting solution that streamlines and improves the drafting process for legal professionals within Microsoft Word – for UK legal professionals. CoCounsel Drafting allows users to easily and quickly move through the phases of contract creation, and our latest version is tailored for the UK market.

The generative AI-driven solution enables UK legal professionals to find the best starting point from their own databases or Practical Law templates, use content from Practical Law alongside their legal department’s or firm’s contract repository to draft or revise clauses, leverage Practical Law contract playbooks, and correct common drafting errors. It uses AI to refine and review documents, producing more accurate, higher-quality work – without leaving Microsoft Word.

Rawia Ashraf, Vice President of Product Management, ��VR��Ƶ

I recently shared a CoCounsel Drafting demo with ’s Richard Tromans.

“The offering has been especially shaped for legal needs here and will directly engage with TR’s contract data in Practical Law,” Tromans said in .

I highlighted CoCounsel Drafting capabilities and features for Tromans, and we discussed how “LLM-supported drafting could be truly transformative.” We talked about how users will be able to use the large language model (LLM) to modify a clause, for example, and how CoCounsel Drafting uses generative AI grounded in retrieval augmented generation (RAG) to access��Practical Law content and notes.

We also discussed how CoCounsel Drafting delivers a level of detail similar to what a junior associate may produce when asked to draft a contract. I noted that CoCounsel Drafting allows customers to use their own playbook to draft a preferred set of terms for any contract type or use a Practical Law playbook for a range of popular contract types.

Finally, Tromans and I discussed integrations – with HighQ and with document management systems (DMS) including iManage – as well as security.

On the other side of the pond, U.S. customer feedback on CoCounsel Drafting, which we introduced earlier this year, has been overwhelmingly positive. I hope that UK legal professionals are equally pleased with the time savings and productivity increases CoCounsel Drafting offers.

As a former lawyer turned AI product leader, I’ve lived the pain of tedious legal tasks firsthand. That’s why I’m passionate about harnessing generative AI to transform legal drafting. I know how much time and energy gets wasted on repetitive tasks, taking away from the work that really matters. Now, I’m excited to help lawyers reclaim that time and focus on what they love – whether that’s helping clients, developing strategy, or simply enjoying a better work-life balance.

You can learn more about CoCounsel Drafting , and check out Tromans’ blog .

This is a guest post from Rawia Ashraf, vice president of Product Management, ��VR��Ƶ.

Tailoring Large Language Models for Professional-Grade Work

carrie.brooker@thomsonreuters.com — Thu, 14 Nov 2024 09:15:17 +0000

Data curation is crucial for training large language models (LLMs) to operate effectively, especially in professional settings. Generative AI tools like GPT-4 and other mass-market LLMs can get tripped up when it comes to nuanced or specialized tasks, such as navigating the intricacies of U.S. tax codes.

LLMs for professional-grade AI solutions must be tailored with the right mix of data sources and go through a rigorous data architecture process. For enterprise tasks, developers need specialized data plus domain expertise to organize it in such a way that the eventual outputs will be helpful for end-user professionals. Developing a tool for accountants or tax attorneys, for example, involves gathering a wide array of tax codes, regulatory filings, legal interpretations, and more as well as integrating and standardizing this data into a format that LLMs can digest.

As I recently shared in , getting raw data to a place where it can be used to power a generative AI solution requires two steps: grounding and the human factor. Grounding is like giving an LLM a specialized education – analogous to an individual going from an undergrad degree to law school – by augmenting it with use-case-specific information. Human experts, of course, are irreplaceable when it comes to domain expertise, which is essential for creating industry-specific LLMs.

Leading the Technology Services team of engineers at ��VR��Ƶ is an incredibly rewarding experience. We tackle the unique challenges of creating professional-grade AI solutions that meet the high standards of accuracy and reliability demanded by legal and tax professionals.

Our team is deeply committed to bridging the gap between cutting-edge technology and specialized domain knowledge. We understand that our work doesn’t just involve writing code or developing algorithms; it’s about empowering professionals with tools that enhance their expertise and efficiency. Guiding this team has shown me the impact that thoughtful, well-crafted AI can have in transforming the way professionals work, and it reinforces our dedication to continuous innovation in this space.

Check out my article to learn more about the need for data curation and data stewardship in developing professional-grade AI solutions.

This is a guest post from Noah Pruzek, head of Technology Systems, ��VR��Ƶ.

The Progressive Rise of Generative AI: A Conversation With David Wong and Joel Hron

carrie.brooker@thomsonreuters.com — Wed, 30 Oct 2024 09:50:36 +0000

In honor of the one-year anniversary of the first episode of TechConnect, highlights the progressive rise of generative AI In the past year.

“As fast as it started, it really feels like in the last year, there’s been an even more rapid acceleration, and many companies racing to become leaders in this field, including ��VR��Ƶ,” said Joel Hron, chief technology officer, ��VR��Ƶ.

Hron and David Wong, chief product officer, ��VR��Ƶ, shared their takes on the most significant advances in generative AI technology, including improvements in accessibility to the technology, with more developer tools alongside reduced costs and more out-of-the-box capabilities.

Wong said he’s most excited about large language models’ ability to have longer context windows, enabling them to keep more information in their short-term memory and answer ever-more complex questions.

“That’s critical for the way ��VR��Ƶ uses a lot of these models,” Wong said.

“The agentic behaviors of the models have become more robust in their ability to plan and ability to use reason over complex information,” Hron added.

They also discussed balancing the need to innovate and go fast with the need for ethical, responsible and high-quality AI development.

Wong noted how ��VR��Ƶ is best positioned to develop professional-grade AI, grounded in fact and data. He emphasized customers’ need for measurable solutions, so they can discern tools’ accuracy rates, as well as the need for security and privacy.

Wong said ��VR��Ƶ has the scale and infrastructure to understand customers’ needs and develop solutions to solve their biggest challenges, guided by a philosophy and process that ensures the right balance between moving fast and ensuring quality.

Hron said the company’s human-centric approach to AI development is key.

“Our human expertise at ��VR��Ƶ and the level of rigor and quality we put behind both our content and our products for many years has really been a cornerstone of our brand,” Hron said.

Hron said the iterations between technology and domain experts are crucial to how ��VR��Ƶ helps customers streamline their workflows with AI, such as with AI-Assisted Research on Westlaw Precision and CoCounsel Core.

They also highlighted the ��VR��Ƶ acquisition of Materia, an AI assistant and platform for accounting and auditing professionals.

“It’s a reinforcement of our belief in AI assistants being in the hands of every professional and a reinforcement of our commitment around AI across our entire product portfolio,” Hron said.

He added that Materia’s strengths have included leaning into the long context and multimodal capabilities of generative AI as well as enabling agentic behavior.

Hear more of Wong and Hron’s insights on Materia as well as the evolution of generative AI in the of the TechConnect series, which brings diverse and dynamic perspectives from all corners of the technology world with thought-provoking questions and conversation.

Legal AI Benchmarking: CoCounsel

carrie.brooker@thomsonreuters.com — Wed, 23 Oct 2024 14:04:16 +0000

We’re excited to be sharing a detailed look into our testing program for CoCounsel, including specific methodologies for evaluating its skills. We aim not only to showcase the steps we take to ensure CoCounsel’s reliability, but also to contribute to broader benchmarking efforts in the legal AI industry. Though it’s challenging to establish universal benchmarks in such a diverse field, we’re engaging with industry stakeholders to work toward the shared goal of elevating the reliability and transparency of AI tools for all legal professionals.��

Why evaluating legal skills is complicated��

Traditional legal benchmarks usually rely on multiple-choice, true/false, or short-answer formats for easy evaluation. But these methods aren’t enough to assess the complex, open-ended tasks lawyers encounter daily and that large language model (LLM)-powered solutions like CoCounsel are built to perform.��

CoCounsel’s skills produce nuanced outputs that must meet multiple criteria, including factual accuracy, adherence to source documents, and logical consistency. These are difficult outputs to evaluate using true/false tests. On top of that, assessing the “correctness” of legal outputs can be subjective. For instance, some users prefer detailed summaries, others prefer concise ones. Neither is “wrong,” it just comes down to preference, which makes it difficult to consistently automate evaluations.��

To make it even more complicated, each CoCounsel skill often involves multiple components, with the LLM handling only the final stage of answer generation. For example, the Search a Database skill first uses various non-LLM-based search systems to retrieve relevant documents before the LLM synthesizes an answer. If the initial retrieval process is substandard, the LLM’s performance will be compromised. So, our evaluation must consider both LLM-based and non-LLM-based aspects, to make sure our assessment of the whole is accurate.��

How we benchmark��

Our benchmarking process begins long before putting CoCounsel through its paces. Whenever a significant new LLM is released, we test it across a wide suite of public and private legal tests, such as the dataset created by our Stanford collaborators, to assess their aptitude for legal review and analysis. We then integrate the LLMs that perform well in these initial tests with the CoCounsel platform, in a staging environment, to evaluate how they perform under real-world conditions.��

Then we use an automated platform to run a battery of test cases created by our Trust Team (more on this below), to evaluate the output that comes from this experimental integration. If the results are promising, we conduct additional manual reviews using a skilled team of attorneys. When we see an improvement in performance compared to previous benchmarks, then we start talking as a team about how it might improve the CoCounsel experience for our users.��

How we test��

Our Trust Team has been around as long as CoCounsel has.�� This group of experienced attorneys from diverse backgrounds – in-house counsel, large and small law firms, government, public policy – is dedicated to continually rigorously testing CoCounsel performance. ��

We continue to follow a process that’s been integral to all our performance evaluation since CoCounsel’s inception: Our Trust Team creates tests representative of the real work attorneys use CoCounsel for and runs these tests against CoCounsel skills. When creating a test, they first consider what the skill’s for and how it might be used, based on their own insights, customer feedback, and secondary sources. Once the test is created, the attorney tester manually completes the test task, just as a lawyer would, to create an answer key – what we refer to as an “ideal response.” These tests and their corresponding ideal responses then undergo peer review. Being this meticulous is crucial, because the quality of our ideal responses determines the benchmark for a passing score.  ��

Once the ideal response has been created, a member of the Trust Team runs the test, using the applicable CoCounsel skill to complete the task just as a user would. An attorney tester reviews the output, referred to as our “model response.” Then they point-by-point compare CoCounsel’s response to the ideal response, identifying differences and assessing whether these differences deviate from the ideal response in a way or to a degree that would make the skill’s output incomplete, incorrect, or misleading. It is important to note that a test can be failed for many reasons, not just because an answer is outright wrong. For instance, some skills (e.g., Summarize) are more inherently subjective than others and do not lend themselves to a right or wrong answer. Accordingly, even though attorneys’ opinions of the “correct” answer can reasonably vary, we try to maintain the highest grading standards and “fail” the test if it is missing a single detail the attorney-tester considers to be Critical.��

Based on this manual evaluation, we create a set of evaluation instructions tailored to the ideal output unique to each skill, capturing the needs of each skill so an LLM can match our human reviewers’ process. To illustrate this, let’s look at a real example using CoCounsel’s Extract Contract Data skill. After running the test in our automated evaluation platform, here’s what an attorney reviewer would see when reconciling the model response with the LLM’s evaluation of it: ��

Question: If the customer is acquired through a merger, can the agreement be assigned to the acquiring entity without obtaining the provider’s prior written consent? ��

- Trust Team ideal response:��

- Short answer: Yes, with 30 days’ prior written notice.  ��

- Detailed answer: Yes, if the customer is acquired through a merger, the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent; the customer is only required to provide the provider with 30 days’ prior written notice. The clause specifies that either party may assign its rights or delegate its obligations without such consent, provided that 30 days’ prior written notice is given to the other party, to an entity that acquires all or substantially all of the business or assets of such party, including by merger. ��

Model response: ��

-  Short answer: Yes, with notice. ��

- Detailed answer: The agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent if the customer is acquired through a merger. The clause provides an exception to the general rule requiring consent for assignment. Specifically, it states that either party may assign its rights or delegate its obligations without such consent to an entity that acquires all or substantially all of the business or assets of the party, including through a merger, provided that 30 days’ prior written notice is given to the other party. ��

- Automated evaluation analysis: The model response captures the correct information that the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent but requires 30 days’ prior written notice. However, it incorrectly states the short answer as “Yes, with notice” instead of “Yes, with 30 days’ prior written notice,” as per the ideal answer. This is a minor difference and does not change the meaning significantly. ��

In this instance, the model response included a minor discrepancy from the attorney-authored ideal response. But the LLM’s evaluation of the response accurately determined that the answer remained sufficient, because it captured the complete notice requirement elsewhere in the response.��

Our ideal-response approach provides two key advantages over assertion-based evaluations. It excels at identifying deviations from attorney expectations, including hallucinations. And it pinpoints extraneous or inconsistent information that, while not technically a hallucination, could make even a complete response incorrect if that information introduces logical inconsistencies, which would result in a failing score. 
 
We rely on our Trust Team to create well-defined ideal responses and auto-evaluation instructions and to determine if a test case passes or fails. A skill’s output definitively fails if it falls short of this ideal because of material omissions, factual incorrectness, or hallucinations. However, we recognize that many legal issues aren’t black-and-white, and the “correct” answer could be open to reasonable disagreement. To address this, we peer review ideal responses in cases when the answer might require a second opinion. And we might eliminate tests when we find insufficient agreement among the attorney testers. This is how we both ensure that our passing criteria remain rigorous and account for the nuanced nature of legal analysis. 

Maintenance and improvement��

Creating a skill test set is only the beginning. Once we begin using it, the Trust Team continually monitors and refines it by manually reviewing failure cases from the automated tests and spot-checking passing samples to make sure the automated evaluation is in line with human judgments. We also regularly add tests to cover more use cases and capture user-reported issues, which could lead to further iterations of the tests submitted for automated evaluation and their success criteria.  ��

By following this process, every night we can execute, across all CoCounsel skills, more than 1,500 tests on our automated platform under attorney oversight, which combined with manual testing means we’ve run more than 1,000,000 tests since CoCounsel’s launch. And it empowers us to quickly identify areas for improvement, which is vital to ensuring CoCounsel remains the most trustworthy AI legal assistant available.��

Conclusion��

, we explored what it means for an AI tool to be “professional-grade” and why that standard is crucial for professionals in high-stakes fields like law. This post takes that concept further by diving into how we benchmark CoCounsel to ensure it meets those rigorous standards. By understanding the extensive testing that goes into evaluating its performance, you can see how CoCounsel consistently delivers the reliability and accuracy expected of a true professional-grade GenAI solution.��

To promote the transparency my team and I believe is necessary in the legal AI field, we’ve decided to release some of our performance statistics for the first time and a sample of the tests that are used to arrive at the figures below applying the criteria referenced within this article. Check out our results .

This��is a guest��post from��Jake Heller,��head��of��CoCounsel,��Thomson��Reuters.

Quick Check Mischaracterization Identification: New Westlaw Enhancement Furthers the ��VR��Ƶ Generative AI Vision

carrie.brooker@thomsonreuters.com — Tue, 22 Oct 2024 13:19:05 +0000

��VR��Ƶ recently announced deeper integration of CoCounsel 2.0 in Westlaw and Practical Law as well as new generative AI research features – Mischaracterization Identification in Quick Check and AI Jurisdictional Surveys – that are saving customers significant time and helping them ensure accuracy of their research. The enhancements build on the ��VR��Ƶ vision to deliver a comprehensive GenAI assistant for every professional it serves.

Below, CJ Lechtenberg, senior director, Westlaw Product Management, ��VR��Ƶ, shares her insights on developing Mischaracterization Identification, a generative AI capability to help detect mischaracterizations and omissions in legal briefs.

In the five years since Quick Check was introduced, you’ve added many enhancements including Quick Check Contrary Authority Identification, Quick Check Judicial and Quick Check Quotation Analysis. How did integrating generative AI make the Mischaracterization Identification enhancement different than previous ones?

Lechtenberg: This enhancement takes researchers beyond the step of knowing what might be a potential mischaracterization to an explanation of why something might be a potential mischaracterization – and that is radically different from any feature we’ve deployed in Quick Check before.

I’m sure it’ll come as no surprise when I say that generative AI is just a completely different beast. Lay people may think about the law as being black and white.��You can do this; you ��’t do that.��But legal professionals know that the law is really a sea of varying shades of gray. With machine learning, we wrestled with how we could ever give the machine enough data to figure out all the different ways an attorney may mischaracterize the law.

In Quick Check Quotation Analysis prior to the Mischaracterization Identification enhancement, we highlighted the actual textual differences – additions, omissions, and changes – in the quotations and showed the context around the quotes.��Doing so certainly saved researchers a significant amount of time and helped them spot issues they might not otherwise find, but the onus was still on researchers to review everything and determine what the precise differences were and how material they might be, if at all.��Even with the additional context provided, it could still be difficult to determine whether the quotations were taken out of context, especially if the quotes themselves didn’t appear to be different.

In developing Mischaracterization Identification, we recognized that the task of analyzing quotations and their context is so nuanced that attorneys will have different expectations for whether a mischaracterization occurred, so we needed to provide more than just categorizations. We found that large language models (LLMs) can generate nuanced descriptions of potential mischaracterizations, versus just explicit categorizations, and do it well, which is hugely beneficial for this type of task.

How will using Mischaracterization Identification give legal professionals and law firms a competitive advantage? How will judges using it benefit?��

Lechtenberg: The advantages of using the new Mischaracterization Identification are substantial for both legal professionals and the judiciary – both in terms of speed of review and quality of work product.��When we launched Quick Check Quotation Analysis in 2020, customers, both legal professionals and the judiciary, lamented about how time-consuming it is for them to review quotations and how challenging it is to spot differences. It is a mentally taxing task and often our brains fill in the blanks – interpreting what we think a brief maybe should say but actually doesn’t.�� Attorneys never have a surplus of time, so the last thing they want to do is spend the little bit they have on the most tedious of tasks and still end up missing potential problems.

For attorneys, Mischaracterization Identification will help them efficiently and accurately make contextual misstatement and omission determinations for their opponents’ and their own quotations and the context surrounding those quotations. The fear of missing their own mistakes is very real for attorneys, but the possibility of missing the opportunity to capitalize on their opponents’ mistakes is an even larger concern. This new enhancement reduces both of those worries and will help attorneys be even better advocates for their clients.

Judges will also be able to effectively review the filings of parties in matters before them much faster. Attorneys owe a duty of candor to the judiciary and the Mischaracterization Identification feature will help flag any potential issues quickly. An added benefit, which members of the judiciary or their staff perhaps haven’t considered, is the ability to analyze their own orders and opinions to ensure that they haven’t made mistakes that could be appealed. This new enhancement will help alert judges and law clerks to potential issues before they finalize their opinions.

What early feedback are you hearing from customers?

Lechtenberg: In a recent survey, 93% of law firm professionals told us they’ve seen opposing counsel misuse a quotation, 66% said they’ve seen misrepresentations by an associate or colleague, and 65% of corporate respondents said they check the accuracy of outside counsel’s quotations.��The need to review opposing counsels’ and colleagues’ briefs for mischaracterizations of the law is still a very real issue for attorneys. Likewise, attorneys have said they’re always concerned about the accuracy of their work and that maintaining their reputation as a credible litigator with courts and opposing counsel is incredibly important.

Customers are extremely excited about this new Quick Check enhancement to help combat these concerns and we’ve received positive feedback from them.��One law firm managing partner stated that they would use this tool a lot.��They cite-check their opponents’ briefs, so any shortcuts are beneficial to them. They recognize that most of the time, errors are harmless, but occasionally there are things they want to bring to the court’s attention and this feature will help them spot those issues more quickly and accurately.

Another law firm partner said this new feature is the “ultimate security blanket” because everything attorneys do is based on their credibility, and this feature alerting them to quotes being taken out of context before filing with the court would calm some of those fears.

Any surprising or unexpected moments as the team worked on developing or launching Mischaracterization Identification?

Lechtenberg: The fact that we’ve accomplished this now with the use of LLMs is exciting, a little surprising and a long time coming. I’m an attorney who leads a team of attorneys; we’re literally trained to question everything and have a healthy dose of skepticism.��But I have been dreaming about a mischaracterization identification feature in Quick Check ever since we developed Quotation Analysis more than five years ago. At my core, I believed someday this could be achieved, but for years traditional machine learning approaches were just not powerful or nuanced enough to do it well.

Leveraging LLMs for a use case like this is a new frontier like we’ve never seen before.��The LLM’s ability to analyze text from an uploaded document and compare that text to the text from the cited case used to support the argument and then go beyond highlighting textual differences and provide an actual explanation of what may be problematic – whether that’s a selective quote, omitted context or a misinterpreted holding – has been absolutely astounding.

What’s the one thing you want everyone to know about Mischaracterization Identification?

Lechtenberg: Mischaracterization Identification will not only help researchers spot contextual misstatements and omissions in their opponents’ or their own quotations and contextual statements faster and with more accuracy, but most importantly it will help them understand why those misstatements or omissions may be problematic. And, spoiler alert: Mischaracterization Identification is just the beginning of how ��VR��Ƶ will harness the power of generative AI in Quick Check to solve important customer problems.

For more on Mischaracterization Identification, read the press release or check out the by Mike Dahn, head of Westlaw Product Management, ��VR��Ƶ.

A Holistic Approach to Advancing Generative AI Solutions

carrie.brooker@thomsonreuters.com — Thu, 17 Oct 2024 11:42:10 +0000

At ��VR��Ƶ, our vision is to deliver an AI assistant for every professional we serve. As part of that, our focus is on delivering benefits for our customers across the breadth of our AI- and non-AI-powered features. We know that our solutions deliver benefits to customers in many ways, including AI-powered automation.

In April of this year, we shared our vision to provide a GenAI assistant for each professional we serve. CoCounsel embodies our ongoing efforts to augment professionals’ work with GenAI skills, enabling professionals to accelerate and streamline entire workflows to increase efficiency, produce better work, and deliver more value for their clients. Our continued investment in GenAI is driven to enable professionals across industries to accelerate and streamline entire workflows through a single GenAI assistant.

We believe our investment in GenAI – along with our integration to customer data as well as third-party integrations – extends the value customers derive from CoCounsel beyond our connected experience and our verified and trusted content. Our work with Microsoft, for example, includes CoCounsel integrations across Word, Outlook and Teams – meeting professionals where they’re already working.

AI and large language models are proving to be powerful tools that deliver efficiency gains and strengthen research practices for our customers. Yet our efforts to redefine work with GenAI are rooted in our strong foundation of editorial enhancements, authoritative content and technological expertise, alongside our long history of working closely with customers. That’s why we continue to build out AI- and non-AI-powered solutions to help with the entire workflow for legal, tax, and risk and compliance professionals. While AI may not be perfect, it can significantly help professionals reduce the amount of work and manage more complex and substantive work more efficiently. We collaborate with our customers to help them understand that AI is an accelerant rather than a replacement for their own research.

Benchmarking expectations

As a leader in innovation and AI research, we recognize the role that independent benchmarking brings in ensuring the accuracy, transparency, and accountability of evolving GenAI solutions. We believe that benchmarking can improve both the development and the adoption of AI. We also see it as one component in a broad range of ways we consider and understand the benefits AI delivers for our customers. We work with our customers as their trusted partners for change, helping them to confidently understand and adopt new technologies, looking at both their immediate value and role in long-term transformation, and leveraging our deep understanding of their businesses.

At ��VR��Ƶ, our understanding of the holistic value of our products is based on customers’ usage and the benefits they derive. Our customers have run more than 2.5M searches through AI-Assisted Research on Westlaw Precision since its launch late last year, and they tell us it’s saving time and improving productivity. Similarly, internal testing of CoCounsel’s skills has yielded impressive results, particularly with regards to CoCounsel’s document review capabilities.

Our benchmarking support is reflected in our participation in studies including Vals.ai as well as two consortium efforts – from Stanford and Litig – exploring how to best evaluate legal AI. We are submitting CoCounsel AI skills to the Vals.ai benchmarking study in five areas of evaluation – Doc Q&A, Data Extraction, Document Summarization, Chronology Generation, and E-Discovery.

is a first attempt at establishing a standard, and so we should view this work as the first iteration and an opportunity to learn versus treating it like a gold standard. For example, one limitation of the benchmarking methodology is that each vendor’s results are evaluated based on the text output alone, removed from the interface and experiences of the individual products. This discounts the work each vendor has done to design interfaces and safety features to minimize the harms of errors. This reinforces the need for a holistic evaluation of each product being tested, ideally as designed for the user.

Looking ahead, my expectation is that, while accuracy will continue to improve, no products will produce answers entirely free of errors. And as we’ve shared with our customers, every AI product requires human expertise for verification and review – regardless of the accuracy rate. As the current approach to benchmarking rates an accuracy percentage – we need to be very clear on this point – whether the product produces a score in the low or high 90th percentile, all answers still must be checked 100% of the time.

I look forward to our ongoing collaboration with customers and industry partners as we continue our work towards minimizing inaccuracies and increasing the usefulness of the research outcomes for GenAI tools and all our solutions.

Unlocking the full potential of professional-grade GenAI for your work

carrie.brooker@thomsonreuters.com — Tue, 15 Oct 2024 12:06:01 +0000

Today, nearly two years since ChatGPT debuted, GenAI continues to dominate our cultural and professional conversations. Even as its adoption for work steadily increases, the biggest concern for most professionals – 70% of them – is accuracy of output.��

However, not using GenAI for work at all is a non-option. 77% of professionals believe AI will have a high or transformational impact on their work over the next five years, and 78% call AI a “force for good” in their profession. In fact, 50% of law firms named AI among their top five strategic priorities for the next 18 months. If there were still doubt, there definitely isn’t anymore: GenAI is here to stay.��

So how can conscientious – and forward-thinking – professionals make the most of this generational technology while guarding against its drawbacks? How do you know if the GenAI solution you’re considering will live up to your professional obligation to work ethically and ensure your clients’ data is securely handled? Is any GenAI product trustworthy? Is it even possible for tools built on large language models (LLMs) such as GPT-4o from OpenAI and Google’s Gemini, all of which are known to hallucinate, to be safe enough to use professionally?��

Yes, it is possible. “Built on” is the key. When we launched our GenAI assistant, CoCounsel, our product and engineering teams delivered on the challenge of creating a product that could take advantage of LLMs’ tremendous raw power while eliminating as many of their serious limitations as possible – like hallucinations – that curb the professional utility of models when used on their own. What makes the current generation of LLMs truly extraordinary, then, is not what they alone can do, but what they enable.��

Using a model directly should be done with great caution and exposes users to risk if they use the output professionally. CoCounsel, on the other hand, harnesses that power and has engineered robust, well-tested accuracy, privacy, and security controls around it. In short: LLMs are the world’s most incredible engines. CoCounsel uses that engine to take you incredible places – places you couldn’t reach without these LLMs – safely.��

Why can professionals trust CoCounsel?��

We’ve applied our technical and domain expertise to leading LLMs in creating and continuing to optimize CoCounsel, a first-of-its-kind product that both does more than LLMs can and corrects the problems that make them unsuitable on their own.��

In short, CoCounsel is a professional-grade GenAI assistant. And no professional should use a GenAI solution that isn’t.��

What does it take for a GenAI assistant to be professional-grade? At a bare minimum – without which it should not be trusted for your work – it must be:��

Built for domain-specific use and grounded in reliable sources of data relevant to that use. A professional-grade solution, such as CoCounsel, harnesses the power of LLMs but limits the source of knowledge to known, reliable data sources – such as profession-specific domains or professionals’ or their clients’ databases – which rigorously limits the possibility of inaccuracies.
Built to make verifying its output easy. CoCounsel was not designed to replace the role of the professional, but rather to help them accomplish more and higher-quality work in less time. So just as lawyers review all work delegated to a junior associate or paralegal, they must validate CoCounsel’s output. We’ve made it easy to do so: all answers link to their origin in the source documents, so it’s simple to “trust, but verify.”��
Developed by technical teams with deep GenAI expertise.  Though GenAI has only been broadly talked about since 2022, it’s been around since 1961. ��VR��Ƶ AI engineers and research teams have worked with LLMs since their invention, were among the first to build with GPT-4, and have invented patented approaches to applying LLMs to professional use cases. ��
Continually and consistently tested and authenticated by a dedicated team of domain experts.  ��VR��Ƶ��AI engineers and Trust Team attorneys together filter, rank, and score CoCounsel’s responses to a daily battery of thousands of tests developed to simulate real-life legal use cases and ensure the assistant’s answers are consistent and accurate. To date we’ve run more than 1,000,000 such tests against CoCounsel. ��
Secure and private, because it interacts with third-party LLMs the right way. ��VR��Ƶ GenAI solutions access third-party LLMs through dedicated, private servers, and through an “eyes off” API. No LLM partner employees can see customer queries or documents, and our LLM access is contractually “zero retention.” Our LLM partners cannot store customer data longer than it takes to process the request. Our product data is never used to train any third-party models. And all product data is encrypted in transit and at rest, and subject to ��VR��Ƶ rigorous security policies and practices. ��

Why “makes minimal mistakes” isn’t enough��

As important as the above five characteristics are, they’ve become price-of-entry criteria for professional-grade GenAI. And given how rapidly the technology is evolving, what you expect from a professional-grade solution should, as well. Remember: GenAI has the power to do so much more than help you complete jobs. It can transform what it means to be a professional, freeing you for more strategic, creative, valuable work that a machine just cannot do – which can transform not only how you do business, but also how much business you do.��

To take full advantage of this potential, you need a GenAI assistant that fulfills two more key requirements:��

1. Professional-grade means intelligently and seamlessly handling workflows, not just completing a series of tasks. A true GenAI assistant goes beyond responding to your requests, instead guiding you through the steps required to finish long, complex, even open-ended projects. Only this kind of product, such as CoCounsel, can truly unlock the full potential of GenAI.

Through an expanding set of capabilities and deep connections with both your documents and the tools you use every day – e.g., Microsoft 365 – a professional-grade GenAI assistant can traffic an entire deliverable from task to task, program to program, in a continuous stream, prompting you forward through the next steps and simultaneously handling multiple pieces of the work itself.

CoCounsel is built for workflows. It’s accessible across multiple ��VR��Ƶ products, bringing together both fundamental capabilities such as summarization and document review with specialized functions such as legal research, for a smooth transition from one type of work to the next. And it’s integrated with both Microsoft 365 and document management systems, available for you literally wherever you’re working, from client communication to research to document drafting and beyond.

2. Professional-grade means providing partnership, not just product. Without in-depth, sustained support, you’re unlikely to get the most possible value from your investment. A professional-grade vendor offers a success and support team that will be there for you long after you’ve signed a subscription agreement. They work with you to deeply understand your most prevalent use cases and how their solution will help you tackle them, and of course are available when something’s not working the way it should. They keep you informed about product changes and improvements and continue creating content you can use to increase your knowledge, such as videos, webinars, and written materials. A truly professional-grade team has a vision for their GenAI assistant, ensuring it will increase in power and capability as the technology advances. And most important, they are as invested in you as you are in them.

Upon adopting CoCounsel, legal professional users are trained by the ��VR��Ƶ Customer Success team, made up of licensed attorneys, many still practicing, with dozens of years of combined experience in both litigation and transactional law. And many of these trainers are prompting engineer specialists, who in addition to being licensed attorneys have a background in computer science. After onboarding, we ensure everyone using CoCounsel has the opportunity to attend live trainings and watch recorded webinars, get individual help through live chat and email, and access dozens of video tutorials—a resource pool that will only keep growing.

This is a guest post from Jake Heller, head of CoCounsel, Thomson Reuters, and Erin Nelson, CoCounsel content strategist, ��VR��Ƶ.��

The Transformative Role of AI in Professional Tools: A Conversation With David Wong and Leann Blanchfield

carrie.brooker@thomsonreuters.com — Wed, 02 Oct 2024 13:33:39 +0000

Leann Blanchfield, head of Editorial, ��VR��Ƶ, said now is the most exciting time in her 30+ years with the company.

In the latest , Blanchfield shared how the power of generative AI – and the dramatic leap it’s making in how professionals across industries can access large quantities of data – is transforming the legal industry and beyond. Blanchfield credits the more than 1,500 attorney editors on her team, who create and enhance content, with harnessing the power of generative AI for legal research.

Human expertise is just one component of how ��VR��Ƶ is capitalizing on the potential of generative AI. Three elements are critical, as David Wong, chief product officer, ��VR��Ƶ, noted in his comments about the launch of CoCounsel 2.0 at ILTACON: “We have the data, the expertise, and the tech. Few have all three in such quantity and depth.”

In the new , Wong focused on the role of human domain experts, noting they’re key to the process of creating and validating data used by AI models for professional research.

“There’s a lot of both prompt engineering, fine tuning, and system refinement that’s necessary to get quality to a usable spot,” Wong said. “Experts, experienced researchers and experienced lawyers can help to gauge whether or not the systems are correct. We couldn’t have an objective, quantified measure of quality on these systems without the editors, without those experts.”

Wong and Blanchfield discussed the importance of human experts in ensuring the accuracy and reliability of AI.

“Maintaining accuracy is at the heart of what the editorial team does,” Blanchfield said. “It’s the number-one priority across every editorial team. We maintain our content to be accurate and trusted.”

Wong acknowledged it’s challenging for the team to process and update unstructured, constantly changing data in real time. He said that ��VR��Ƶ ensures that its AI models are customized and meeting the varying needs of various jurisdictions through a combination of software and algorithms that take advantage of the LLMs.

“So when you ask a question of , for example, we are running an end-to-end algorithm which runs search, retrieves data, re-ranks, interprets and then ultimately passes that information to a large language model to synthesize and produce the answer,” Wong said. “It’s a very complicated system which involves multiple types of technology, multiple types of information retrieval.��Processing unstructured, dynamic data and customizing AI models requires integrating multiple technologies and algorithms to optimize performance.”

Hear more of Wong and Blanchfield’s insights on integrating AI into professional tools and ensuring that information is trustworthy in the of the TechConnect series, which brings diverse and dynamic perspectives from all corners of the technology world with thought-provoking questions and conversation.

��VR��Ƶ Labs: Training large language models using Amazon SageMaker HyperPod

carrie.brooker@thomsonreuters.com — Tue, 17 Sep 2024 09:12:44 +0000

2023��proved to be��an inflection point for AI, prompting ��VR��Ƶ to consider how our��high-value, curated, data could improve general language models on customer-specific tasks. Training and finetuning a large language model (LLM) is compute-intensive and requires specialized hardware.

quickly discovered that it was extremely difficult to acquire these resources on-demand and at scale in our cloud environments. Further, looking to other third parties presented its own set of risks and challenges.

We turned to��(AWS), which has long been a trusted partner in secure and scalable solutions, to get early access to . With our computing platform acquired, we were ready to roll up our sleeves and do the hard work of exploring how to optimally train and finetune models to our domain. In our first phase of experimentation, we peaked at 16�� compute instances, 128 A100 GPUs, with the longest job taking 36 days to complete training a 70 billion parameter model.

Initial results of our custom models look promising and our research continues, supported by the release of . Our explores the journey that ��VR��Ƶ took to enable cutting-edge research in training domain-adapted LLMs using��Amazon SageMaker HyperPod.

This is a guest post from John Duprey, distinguished engineer, ��VR��Ƶ.

large language models Archives - ����VR��Ƶ Institute