LLMs Archives - ��VR��Ƶ Institute

Legal AI Benchmarking: Evaluating Long Context Performance for LLMs

denise.lam@thomsonreuters.com — Mon, 14 Apr 2025 13:00:43 +0000

The Importance of Long Context

In the legal profession, many daily tasks revolve around document analysis—reviewing contracts, transcripts, and other legal documents. As the leading AI legal assistant, CoCounsel’s ability to automate document-centric tasks is one of its core legal capabilities. Users can upload documents, and CoCounsel can automatically perform various tasks on these documents, saving lawyers valuable time.

Legal documents are often extensive. Deposition transcripts can easily exceed a hundred pages, and merger and acquisition agreements can be similarly lengthy. Additionally, some tasks require the simultaneous analysis of multiple documents, such as comparing contracts or testimonies. To perform these legal tasks effectively, solutions must handle long documents without losing track of critical information.

When GPT-4 was first released in 2023, it featured a context window of 8K tokens, equivalent to approximately 6,000 words or 20 pages of text. To process documents longer than this, it was necessary to split them into smaller chunks, process each chunk individually, and synthesize the final answer. Today, most major LLMs have context windows ranging from 128K to over 1M tokens. However, the ability to fit 1M tokens into an input window does not guarantee effective performance with that much text. Often, the more text included, the higher the risk of missing important details. To ensure CoCounsel’s effectiveness with long documents, we’ve developed rigorous testing protocols to measure long context effectiveness.

Why a Multi-LLM Strategy—and Trusted Testing Ground—Matter

At ��VR��Ƶ, we don’t believe in putting all our bets on a single model. The idea that one LLM can outperform across every task—especially in high-stakes professional domains—is a myth. That’s why we’ve built a multi-LLM strategy into the core of our AI infrastructure. It’s not a fallback. It’s one of our competitive advantages.

Some models reason better. Others handle long documents more reliably. Some follow instructions more precisely. The only way to know what’s best for any given task is to test them—relentlessly. And that’s exactly what we do.

Because of our rigor and persistence in this space, ��VR��Ƶ is a trusted, early tester and collaborator for leading AI labs. When major providers want to understand how their newest models perform in high-stakes, real-world scenarios, they turn to us. Why? Because we’re uniquely positioned to pressure-test these models against the complexity, precision, and accountability that professionals demand — and few others can match.��

Our legal, tax, and compliance workflows are complex, unforgiving, and grounded in real-world stakes
Our proprietary content—including Westlaw and Reuters News—gives us gold-standard input data for model evaluation
Our SME-authored benchmarks and skill-specific test suites reflect how professionals actually work—not how a model demo looks on paper��

When OpenAI was looking to train and validate a custom model built on o1-mini, ��VR��Ƶ was among the first. And when the next generation of long-context models hits the market, we are routinely early testers.

A multi-model strategy only works if you know which model to trust—and when. Benchmarking is how we turn optionality into precision.

This disciplined, iterative approach isn’t just the best way to stay competitive—it’s the only way in a space that’s evolving daily. The technology is changing fast. New models are launching, improving, and shifting the landscape every week. Our ability to rapidly test and integrate the best model for the job—at any moment—isn’t just a technical strategy. It’s a business advantage.

And all of this isn’t just about immediate performance gains. It’s about building the foundation for truly agentic AI—the kind of intelligent assistant that can plan, reason, adapt, and act across professional workflows with precision and trust. That future won’t be built on rigid stacks or static decisions. It will be built by those who can move with the market, test with integrity, and deliver products that perform in the real world.

RAG vs. Long Context

In developing CoCounsel’s capabilities, a significant question was whether and how to utilize retrieval-augmented generation (RAG). A common pattern in RAG applications is to split documents into passages (e.g., sentences or paragraphs) and store them in a search index. When a user requests information from the application, the top N search results are retrieved and fed to the LLM in order to ground the response.

RAG is effective when searching through a vast collection of documents (such as all case law) or when looking for simple factoid questions easily found within a document (e.g., specific names or topics). However, some complex queries require a more sophisticated discovery process and more context from the underlying documents. For instance, the query “Did the defendant contradict himself in his testimony?” requires comparing each statement in the testimony against all others; a semantic retrieval using that query would likely only return passages explicitly discussing contradictory testimony.

In our internal testing (more on this later), we found that inputting the full document text into the LLM’s input window (and chunking extremely long documents when necessary) generally outperformed RAG for most of our document-based skills. This finding is supported by studies in the literature^1,2. Consequently, in CoCounsel 2.0 we leverage long context LLMs to the greatest extent possible to ensure all relevant context is passed to the LLM. At the same time, RAG is reserved for skills that require searching through a repository of content.

Comparing the Current Long-Context Models + testing

As discussed in our previous post, before deploying an LLM into production, we conduct multiple stages of testing, each more rigorous than the last, to ensure peak performance.

Our initial benchmarks measure LLM performance across key capabilities critical to our skills. We use over 20,000 test samples from open and private benchmarks covering legal reasoning, contract understanding, hallucinations, instruction following, and long context capability. These tests have easily gradable answers (e.g., multiple-choice questions), allowing for full automation and easy evaluation of new LLM releases.

For our long context benchmarks, we use tests from LOFT³, which measures the ability to answer questions from Wikipedia passages, and NovelQA⁴, which assesses the ability to answer questions from English novels. Both tests accommodate up to 1M input tokens and measure key long context capabilities critical to our skills, such as multihop reasoning (synthesizing information from multiple locations in the input text) and multitarget reasoning (locating and returning multiple pieces of information). These capabilities are essential for applications like interpreting contracts or regulations, where the definition of a term in one part of the text determines how another part is interpreted or applied.

We track and evaluate all major LLM releases, both open and closed source, to ensure we are using the latest and most advanced models, such as the newly updated GPT-4.1 model with its much-improved long context capabilities.

Skill-Specific Benchmarks

The top-performing LLMs from our initial benchmarks are tested on our actual skills. This stage involves iteratively developing (sometimes very complex) prompt flows specific to each skill to ensure the LLM consistently generates accurate and comprehensive responses required for legal work.

Once a skill flow is fully developed, it undergoes evaluation using LLM-as-a-judge against attorney-authored criteria. For each skill, our team of attorney subject matter experts (SMEs) has generated hundreds of tests representing real use cases. Each test includes a user query (e.g., “What was the basis of Panda’s argument for why they believed they were entitled to an insurance payout?”), one or more source documents (e.g., a complaint and demand for jury trial), and an ideal minimum viable answer capturing the key data elements necessary for the answer to be useful in a legal context. Our SMEs and engineers collaborate to create grading prompts so that an LLM judge can score skill outputs against the ideal answers written by our SMEs. This is an iterative process, where LLM-as-a-judge scores are manually reviewed, grading prompts are adjusted, and ideal answers are refined until the LLM-as-a-judge scores align with our SME scores. More details on our skill-specific benchmarks are discussed in our previous post.

Our test samples are carefully curated by our SMEs to be representative of the use cases of our users, including context length. For each skill, we have test samples utilizing one or more source documents with a total input length of up to 1M tokens. Additionally, we have constructed specialized long context test sets where all test samples use one or more source documents totaling 100K–1M tokens in length. These long context tests are crucial because we have found that the effective context windows of LLMs, where they perform accurately and reliably, are often much smaller than their available context window.

In our testing, we have observed that the more complex and challenging a skill, the smaller an LLM’s effective context window for that skill. For more straightforward skills, where we search a document for one or a few data elements, most LLMs can accurately generate answers at input lengths up to several hundred thousand tokens. However, for more complex tasks, where many different data elements must be tracked and returned, LLMs may struggle with recall to a greater degree. Therefore, even with long context models, we still split documents into smaller chunks to ensure important information isn’t missed.��

When you look at the advertised context window for leading models today, don’t be fooled into thinking this is a solved problem. It is exactly the kind of complex, reasoning-heavy real-world problem where that effective context window shrinks. Our challenge to the model builders: keep stretching and stress-testing that boundary!

Final Manual Review

All new LLMs undergo rigorous manual review by our attorney SMEs before deployment. Our SMEs can capture nuanced details missed by automated graders and provide feedback to our engineers for improvement. These SMEs further provide the final check to verify that the new LLM flow performs better than the previously deployed solution and meets the exacting standards for reliability and accuracy in legal use.

Looking Ahead: From Benchmarks to Agents

Our long-context benchmarking work is more than just performance testing — it’s a blueprint for what comes next. We’re not just optimizing for prompt-and-response AI. We’re laying the technical foundation for truly agentic systems: AI that can not only read and reason, but plan, execute, and adapt across complex legal workflows.

Imagine an AI assistant that doesn’t just answer a question, but knows when to dig deeper, when to ask for clarification, and how to take the next step — whether that’s reviewing a deposition, cross-referencing contracts, or preparing a case summary. That’s where we’re headed.

This next chapter requires everything we’ve built so far: long-context capabilities, multi-model orchestration, SME-driven evaluation, and deep integration into the professional’s real-world tasks. We’re closer than you think.

Stay tuned — more on that soon.

��——-

Li, Xinze, et al. “Long Context vs. RAG for LLMs: An Evaluation and Revisits.” arXiv preprint arXiv:2501.01880 (2024).��
Li, Zhuowan, et al. “Retrieval augmented generation or long-context llms? a comprehensive study and hybrid approach.” Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track. 2024.��
Lee, Jinhyuk, et al. “Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?.” arXiv preprint arXiv:2406.13121 (2024).��
Wang, Cunxiang, et al. “Novelqa: Benchmarking question answering on documents exceeding 200k tokens.” arXiv preprint arXiv:2403.12766 (2024).��

��

The Rise of Large Language Models in Automatic Evaluation: Why We Still Need Humans in the Loop

jeffrey.mccoy@thomsonreuters.com — Tue, 21 Jan 2025 17:23:20 +0000

In recent years, the field of Natural Language Processing (NLP) has seen remarkable advancements, primarily driven by the development of Large Language Models (LLMs) such as GPT-4, Gemini, and Llama. These models, with their astounding generation capabilities, have transformed a wide range of applications from chatbots to content generation. One exciting and increasingly prevalent application of LLMs is in the automatic evaluation of Natural Language Generation (NLG) tasks. However, while LLMs offer impressive potential for evaluating domain-specific tasks, the necessity for a human-in-the-loop remains essential.

The Emergence of LLMs in Automatic Evaluation

Traditional evaluation metrics in NLG primarily rely on comparing the generated text to reference texts using word overlap measures. These metrics, while useful, often fall short of capturing the nuances of language quality, coherence, and relevance. For example, suppose we have a reference summary “The cat is on the mat.” and a generated summary from a model “A feline is resting on a rug.”. If we use ROUGE as a metric for the evaluation, ROUGE only considers lexical overlap and cannot capture the semantic similarity between words or phrases. These summaries have the same meaning but would score poorly on ROUGE due to low word overlap.

LLMs have demonstrated an exceptional understanding of language, context, and semantics, making them attractive candidates for evaluating generated text. They can assess factors like fluency, coherence, and even factual accuracy, which are crucial for more sophisticated and context-aware evaluations. For instance, LLMs can be fine-tuned to understand the specific jargon and style of a particular domain, such as medical or legal texts, making them great potential evaluators.

The Promise of LLMs in Automated Evaluation

The recent technological advancements of LLMs have encouraged the development of LLM-based evaluation methods in various tasks and systems. LLMs can offer several advantages in the evaluation of NLG tasks:

Context-Aware Evaluation: Unlike traditional metrics, LLMs can comprehend the context and generate evaluations that account for the subtleties and intricacies of human language.
Scalability: LLMs can evaluate large volumes of text quickly and consistently, offering scalability that human evaluators cannot match.
Reduced Subjectivity: Automated evaluation can minimize the subjective bias that human evaluators might introduce, leading to more consistent and objective assessments.

How to use LLM as an Evaluator

LLM-based evaluators are conceptually much simpler than traditional automatic evaluators for evaluation. While traditional evaluation methods rely on predefined metrics and comparisons to reference datasets, LLM-based evaluators work by directly assessing the generated text.

Figure 1 shows an overview of the LLM-based evaluation frameworks. To evaluate the quality of the text, you embed it into a prompt template that contains the evaluation criteria, then provide this prompt to an LLM. The LLM then analyzes the text based on the given criteria and provides feedback on its quality. This approach bypasses the need for extensive preprocessing and reference comparisons, making the evaluation process more straightforward and versatile.

Figure 2 illustrates the step-by-step workflow of evaluating summaries using clarity as a metric. A document and its corresponding summary are inputs. The summary is then embedded into a pre-formulated prompt template that includes detailed evaluation criteria, such as clarity, defined on a 1-to-5 scale. This prompt is then inputted to an LLM for automated analysis and evaluation. Based on the specified criteria, the LLM evaluator reviews the summary, assigns a score, and provides justification for the rating.

The Limitations of LLMs: Why Humans Are Still Indispensable

Despite the promising capabilities of LLMs, there are significant limitations that necessitate the continued involvement of human experts in the evaluation process, especially for domain-specific tasks:

Lacking Specialized Domain Knowledge: Domain-specific tasks often involve complex knowledge. LLMs are typically trained as general-purpose assistants, and they still lack specialised domain knowledge. In contrast, subject matter experts bring in- depth domain knowledge obtained by years of dedicated training and education.
Evolving Knowledge: Especially in fast-evolving fields like medicine and technology, staying up-to-date with the latest information is challenging for static models. Human experts, however, continuously learn and adapt to new knowledge and standards.
Handling Ambiguities: In specialized domains, the language can be highly ambiguous and complex, and the ability to disambiguate based on deep contextual knowledge is something LLMs still struggle with.
Ethical and Bias Concerns: LLMs can inadvertently reinforce biases present in their training data. Human oversight is crucial to identify and mitigate these biases, ensuring fair and ethical evaluations.

The Human-in-the-Loop Model: Best of Both Worlds

To harness the strengths of LLMs while addressing their limitations, a human-in-the-loop approach is essential. This combines the efficiency and scalability of LLMs with the expertise and judgment of human evaluators:

Initial Screening: LLMs can perform initial screenings and provide preliminary evaluations, identifying clear cases of high or low quality.
Expert Review: Human experts then review and refine these evaluations, focusing on cases that require nuanced understanding or where the LLM’s assessment is inadequate.
Continuous Feedback Loop: Feedback from human evaluators can be used to fine-tune and improve LLMs, creating a continuous improvement cycle.

Conclusion

The integration of LLMs into the automatic evaluation of NLG tasks marks a significant step forward in the field of NLP. However, for domain-specific evaluations, the complexity and nuance of human language still necessitate human experts. By adopting a human-in- the-loop approach, we can leverage the best of both worlds: the speed and scalability of LLMs and the depth and discernment of human evaluators. This constructive interaction ensures that we maintain high standards of accuracy, fairness, and relevance in evaluating natural language generation tasks, ultimately driving the field towards more sophisticated and reliable applications.

This post was written by Grace Lee, lead applied scientist at ��VR��Ƶ Labs (TR Labs).

Note. This work has been done as part of the internship of Hossein A. (Seed) Rahmani at ��VR��Ƶ Labs (TR Labs).

Raghu Ramanathan: Reflections on Legal Generative AI One Year In

carrie.brooker@thomsonreuters.com — Mon, 30 Dec 2024 09:02:06 +0000

I recently talked with ’ Ben Joyner about generative AI in the legal space, touching on everything from our company’s M&A strategy to how CoCounsel is transitioning to a multi-model product. Talking with Ben about how generative AI has shaped our industry over the past year has me reflecting on my first year with ��VR��Ƶ.��

Raghu Ramanathan, president, Legal Professionals, ��VR��Ƶ.

Continued climb in law firm productivity��

I joined ��VR��Ƶ in February, and a notable way we’ve seen the impact of generative AI solutions is the uptick in lawyer productivity. For the first time in years, Q2 saw a majority of law firms experience productivity growth. By Q3, an astounding 64% of law firms reported productivity growth, building on the gains made in Q2.��

This uptick underscores how technology is key to boosting law firm profitability. Law firms that invest in new technology as well as adopt AI and generative AI solutions to streamline workflows and improve the efficiency and quality of their work are best positioned to improve client satisfaction and drive sustainable productivity growth.��

Build, buy, partner strategy��

I’m pleased with the progress ��VR��Ƶ has made on our vision to provide all the legal professionals we serve with a professional-grade GenAI assistant to augment their work. We’ve committed to investing $100 million annually in AI over the coming years, including investing more than $200 million to incorporate responsible AI into our solutions in the past year alone.��

This year we continued investing in the latest technology through our build, buy, partner program. On the buy side in the legal space, our acquisition of Safe Sign Technologies – a UK legal large language model (LLM) startup – in August is proving a great fit. We’re incorporating Safe Sign’s tech and talent into our industry-leading content and expertise to bring customers even greater quality and performance from our AI solutions. ��

On the build side, we introduced 19 legal generative AI solutions in 2024. Highlights include CoCounsel 2.0, the professional-grade GenAI assistant; Claims Explorer, a generative AI skill available in ; CoCounsel Drafting, an end-to-end drafting solution that streamlines and improves the drafting process for legal professionals within Microsoft Word; and Mischaracterization Identification in Quick Check and AI Jurisdictional Surveys – two generative AI research features that help customers save substantial time and deliver greater confidence that legal research is accurate, thorough, and complete. We also delivered deeper integration of CoCounsel into Westlaw and Practical Law.��

On the partner side, we’re working with Microsoft, OpenAI, Google and others on plugins and integrations to enhance the generative AI-powered capabilities in our solutions. Every aspect of our build, buy, partner strategy is geared toward helping our customers automate their workflows, provide powerful insights to their clients and drive efficiencies.  ��

A maturing market����

2024 saw the implementation of legal generative AI solutions as well as efforts to benchmark these solutions. Our benchmarking support is reflected in our participation in studies including Vals.ai plus two consortium efforts – from Stanford and Litig – exploring how to best evaluate legal AI.��

I believe that benchmarking can improve both the development and the adoption of AI, but it’s just one component in how we consider and understand the benefits AI delivers for our customers. I look forward to our ongoing collaboration with customers and industry partners as we continue working to minimize inaccuracies and increase the usefulness of the research outcomes for generative AI solutions.��

To date, 15% of law firms have adopted and implemented legal-specific generative AI solutions. I anticipate we’ll soon see a wave of fast followers – eager to be perceived as innovative – that will dramatically strengthen generative AI implementation.��

I ��’t think of a more exciting time to have joined a business. Where our industry is at now mirrors the early internet era: initial excitement, followed by strategic integration.��

We’re fast approaching a maturing market where legal professionals will not just desire but require AI capabilities for their workflows. We’ll see more implementation of generative AI solutions among legal professionals as they increasingly realize the tangible benefits.  ��

For more on how generative AI is shaping the future of the legal profession, please check out my interview.��

This is a guest post from Raghu Ramanathan, president, Legal Professionals, ��VR��Ƶ.��

2024 Reflections: Top Innovation Highlights From ��VR��Ƶ

carrie.brooker@thomsonreuters.com — Wed, 11 Dec 2024 10:59:55 +0000

��VR��Ƶ closed out 2024 with thousands of corporate, legal, tax, audit and accounting customers focusing on the year’s theme: generative AI and innovation. They convened at SYNERGY 2024, the premier annual technology conference for professionals, for eight days of product and innovation announcements, thought-leadership insights and networking opportunities. Below are 2024 product and innovation highlights plus a sneak peek of what’s to come in 2025.��

��VR��Ƶ President and CEO Steve Hasker shared a state of the industry outlook, noting generative AI is as disruptive and transformative as previous technology shifts yet is happening even faster. He emphasized what differentiates ��VR��Ƶ, including investments the company is making in generative AI to enable professionals to accelerate and streamline entire workflows and deliver more value for clients. ��

Hasker said ��VR��Ƶ has invested more than $200M in AI in the last year. He discussed the company’s vision to provide each professional it serves with an AI assistant; the launch of CoCounsel 2.0, which generates answers three times faster than the previous version; and new work with Microsoft on autonomous agents to increase revenue, reduce costs, and scale impact for customers. ��

Tax, Audit & Accounting����

“AI is not just changing the landscape of accounting, it’s reshaping it.” That was the message from Elizabeth Beastrom, president of Tax & Accounting at ��VR��Ƶ. ��

While the profession sees AI as a game-changer to help them work differently, tax and accounting professionals also continue to wrestle with the perennial challenge of a talent shortage. This, combined with escalating complexity and more tax regulations, as well as changing client expectations, leaves tax professionals in need of a critical solution.��

��VR��Ƶ sees the potential of AI to help alleviate these challenges by augmenting human capabilities. Automating mundane, time-consuming tasks will enhance efficiency for tax professionals, helping them reclaim time to channel into higher value tasks. ��VR��Ƶ is working to bring the power of generative AI, machine learning and automation into its solutions in the following ways:

Saving time in tax preparation:

Coming in beta during the upcoming busy season, ��VR��Ƶ will launch an AI-assisted tax preparation experience to increase firm efficiency. The solution combines the power of CoCounsel, ��VR��Ƶ professional-grade generative AI assistant, with workflow automation and software integrations. It supports the delegation of data gathering to simplify mundane tasks and automate tax preparation. ��VR��Ƶ research shows that customers using this solution will save at least two hours per 1040 tax return on average.��

2. Supporting firms’ growth with advisory: ��

As client expectations continue to evolve, they’re increasingly looking to their accountants as trusted advisors. Firms of all sizes are focusing on growing their advisory practices to help bring their clients additional value, as well as supporting their growth. In 2025, the ��VR��Ƶ Advisory Solution will combine the power of CoCounsel and Checkpoint content to identify advisory opportunities. Advisory services are integrated directly into a firm’s practice, with technology empowering junior staff to take on higher-value advisory work and seasoned professionals to move beyond technical expertise to value-added synthesis. ��

“It helps firms build their advisory practice with confidence to deliver unprecedented value to meet clients’ evolving needs,” said Nancy Hawkins, vice president of Product Management, Research.��

3. Transforming audit efficiency:��

Halving sample sizes, boosting efficiency and sharpening the focus on high-risk areas are all at the heart of ��VR��Ƶ Audit Intelligence Analyze solution, which launched in October. Further functionality will be coming in 2025 as it expands the Audit Intelligence suite capabilities. ‘Test’ will support with automating substantive testing with dynamic transaction tracing, while ‘Plan’ will harness full data populations with cutting-edge analytics for superior risk assessment. Both will launch with beta programs next year, along with the addition of CoCounsel to the Audit Intelligence suite.��

All three solutions – Review Ready, ��VR��Ƶ Advisory Solution and the Audit Intelligence suite – will be further enhanced with �Ѳ��ٱ��’s generative and agentic AI capabilities.��

Corporates��

Laura Clayton McDonnell, president of the Corporates segment, shared how enterprise technology, including AI and generative AI, is revolutionizing the profession with innovative and emerging solutions. She emphasized that companies are taking a streamlined and proactive approach to addressing risk and compliance across the enterprise, while driving towards their business goals, will maintain their competitive advantage. Clayton McDonnell also shared how organizations are using solutions including ONESOURCE Pagero, CoCounsel Core, Legal Tracker, Checkpoint Edge with CoCounsel and CLEAR to solve challenges and realize value for their business.��

In addition, Ray Grove, head of Corporate Tax and Trade, ��VR��Ƶ, highlighted the company’s efforts to build a seamless, integrated compliance network, and Kevin Appold, vice president of US Public Records, ��VR��Ƶ, shared how the company’s risk and fraud solutions play a critical role in the convergence of compliance and commerce. Also, Valerie McConnell, senior director of CoCounsel Customer Success, discussed how CoCounsel is transforming the general counsel’s office.��

Legal��

A highlight from the Legal Professionals segment included an in-depth look at the ��VR��Ƶ 2025 AI product roadmap from David Wong, chief product officer; Mike Dahn, head of Westlaw Product; and Valerie McConnell, senior director of CoCounsel Customer Success. They outlined upcoming generative AI features and innovations to support legal professionals, including deeper integration of CoCounsel 2.0 in Westlaw and Practical Law plus generative AI research features including Claims Explorer, Mischaracterization Identification in Quick Check and AI Jurisdictional Surveys.��

Legal SYNERGY attendees also participated in interactive sessions and CLE courses on advanced prompting techniques, the science behind large language models, and optimizing generative AI for tasks like drafting and legal research. Sessions offered attendees a comprehensive view of the future of AI in law. ��

SYNERGY 2024 also included several customer panels and executive briefing sessions. Watch the Innovation Blog for highlights from these sessions and for 2025 product and innovation highlights.��

The Progressive Rise of Generative AI: A Conversation With David Wong and Joel Hron

carrie.brooker@thomsonreuters.com — Wed, 30 Oct 2024 09:50:36 +0000

In honor of the one-year anniversary of the first episode of TechConnect, highlights the progressive rise of generative AI In the past year.

“As fast as it started, it really feels like in the last year, there’s been an even more rapid acceleration, and many companies racing to become leaders in this field, including ��VR��Ƶ,” said Joel Hron, chief technology officer, ��VR��Ƶ.

Hron and David Wong, chief product officer, ��VR��Ƶ, shared their takes on the most significant advances in generative AI technology, including improvements in accessibility to the technology, with more developer tools alongside reduced costs and more out-of-the-box capabilities.

Wong said he’s most excited about large language models’ ability to have longer context windows, enabling them to keep more information in their short-term memory and answer ever-more complex questions.

“That’s critical for the way ��VR��Ƶ uses a lot of these models,” Wong said.

“The agentic behaviors of the models have become more robust in their ability to plan and ability to use reason over complex information,” Hron added.

They also discussed balancing the need to innovate and go fast with the need for ethical, responsible and high-quality AI development.

Wong noted how ��VR��Ƶ is best positioned to develop professional-grade AI, grounded in fact and data. He emphasized customers’ need for measurable solutions, so they can discern tools’ accuracy rates, as well as the need for security and privacy.

Wong said ��VR��Ƶ has the scale and infrastructure to understand customers’ needs and develop solutions to solve their biggest challenges, guided by a philosophy and process that ensures the right balance between moving fast and ensuring quality.

Hron said the company’s human-centric approach to AI development is key.

“Our human expertise at ��VR��Ƶ and the level of rigor and quality we put behind both our content and our products for many years has really been a cornerstone of our brand,” Hron said.

Hron said the iterations between technology and domain experts are crucial to how ��VR��Ƶ helps customers streamline their workflows with AI, such as with AI-Assisted Research on Westlaw Precision and CoCounsel Core.

They also highlighted the ��VR��Ƶ acquisition of Materia, an AI assistant and platform for accounting and auditing professionals.

“It’s a reinforcement of our belief in AI assistants being in the hands of every professional and a reinforcement of our commitment around AI across our entire product portfolio,” Hron said.

He added that �Ѳ��ٱ��’s strengths have included leaning into the long context and multimodal capabilities of generative AI as well as enabling agentic behavior.

Hear more of Wong and Hron’s insights on Materia as well as the evolution of generative AI in the of the TechConnect series, which brings diverse and dynamic perspectives from all corners of the technology world with thought-provoking questions and conversation.

Legal AI Benchmarking: CoCounsel

carrie.brooker@thomsonreuters.com — Wed, 23 Oct 2024 14:04:16 +0000

We’re excited to be sharing a detailed look into our testing program for CoCounsel, including specific methodologies for evaluating its skills. We aim not only to showcase the steps we take to ensure CoCounsel’s reliability, but also to contribute to broader benchmarking efforts in the legal AI industry. Though it’s challenging to establish universal benchmarks in such a diverse field, we’re engaging with industry stakeholders to work toward the shared goal of elevating the reliability and transparency of AI tools for all legal professionals.��

Why evaluating legal skills is complicated��

Traditional legal benchmarks usually rely on multiple-choice, true/false, or short-answer formats for easy evaluation. But these methods aren’t enough to assess the complex, open-ended tasks lawyers encounter daily and that large language model (LLM)-powered solutions like CoCounsel are built to perform.��

CoCounsel’s skills produce nuanced outputs that must meet multiple criteria, including factual accuracy, adherence to source documents, and logical consistency. These are difficult outputs to evaluate using true/false tests. On top of that, assessing the “correctness” of legal outputs can be subjective. For instance, some users prefer detailed summaries, others prefer concise ones. Neither is “wrong,” it just comes down to preference, which makes it difficult to consistently automate evaluations.��

To make it even more complicated, each CoCounsel skill often involves multiple components, with the LLM handling only the final stage of answer generation. For example, the Search a Database skill first uses various non-LLM-based search systems to retrieve relevant documents before the LLM synthesizes an answer. If the initial retrieval process is substandard, the LLM’s performance will be compromised. So, our evaluation must consider both LLM-based and non-LLM-based aspects, to make sure our assessment of the whole is accurate.��

How we benchmark��

Our benchmarking process begins long before putting CoCounsel through its paces. Whenever a significant new LLM is released, we test it across a wide suite of public and private legal tests, such as the dataset created by our Stanford collaborators, to assess their aptitude for legal review and analysis. We then integrate the LLMs that perform well in these initial tests with the CoCounsel platform, in a staging environment, to evaluate how they perform under real-world conditions.��

Then we use an automated platform to run a battery of test cases created by our Trust Team (more on this below), to evaluate the output that comes from this experimental integration. If the results are promising, we conduct additional manual reviews using a skilled team of attorneys. When we see an improvement in performance compared to previous benchmarks, then we start talking as a team about how it might improve the CoCounsel experience for our users.��

How we test��

Our Trust Team has been around as long as CoCounsel has.�� This group of experienced attorneys from diverse backgrounds – in-house counsel, large and small law firms, government, public policy – is dedicated to continually rigorously testing CoCounsel performance. ��

We continue to follow a process that’s been integral to all our performance evaluation since CoCounsel’s inception: Our Trust Team creates tests representative of the real work attorneys use CoCounsel for and runs these tests against CoCounsel skills. When creating a test, they first consider what the skill’s for and how it might be used, based on their own insights, customer feedback, and secondary sources. Once the test is created, the attorney tester manually completes the test task, just as a lawyer would, to create an answer key – what we refer to as an “ideal response.” These tests and their corresponding ideal responses then undergo peer review. Being this meticulous is crucial, because the quality of our ideal responses determines the benchmark for a passing score.  ��

Once the ideal response has been created, a member of the Trust Team runs the test, using the applicable CoCounsel skill to complete the task just as a user would. An attorney tester reviews the output, referred to as our “model response.” Then they point-by-point compare CoCounsel’s response to the ideal response, identifying differences and assessing whether these differences deviate from the ideal response in a way or to a degree that would make the skill’s output incomplete, incorrect, or misleading. It is important to note that a test can be failed for many reasons, not just because an answer is outright wrong. For instance, some skills (e.g., Summarize) are more inherently subjective than others and do not lend themselves to a right or wrong answer. Accordingly, even though attorneys’ opinions of the “correct” answer can reasonably vary, we try to maintain the highest grading standards and “fail” the test if it is missing a single detail the attorney-tester considers to be Critical.��

Based on this manual evaluation, we create a set of evaluation instructions tailored to the ideal output unique to each skill, capturing the needs of each skill so an LLM can match our human reviewers’ process. To illustrate this, let’s look at a real example using CoCounsel’s Extract Contract Data skill. After running the test in our automated evaluation platform, here’s what an attorney reviewer would see when reconciling the model response with the LLM’s evaluation of it: ��

Question: If the customer is acquired through a merger, can the agreement be assigned to the acquiring entity without obtaining the provider’s prior written consent? ��

- Trust Team ideal response:��

- Short answer: Yes, with 30 days’ prior written notice.  ��

- Detailed answer: Yes, if the customer is acquired through a merger, the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent; the customer is only required to provide the provider with 30 days’ prior written notice. The clause specifies that either party may assign its rights or delegate its obligations without such consent, provided that 30 days’ prior written notice is given to the other party, to an entity that acquires all or substantially all of the business or assets of such party, including by merger. ��

Model response: ��

-  Short answer: Yes, with notice. ��

- Detailed answer: The agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent if the customer is acquired through a merger. The clause provides an exception to the general rule requiring consent for assignment. Specifically, it states that either party may assign its rights or delegate its obligations without such consent to an entity that acquires all or substantially all of the business or assets of the party, including through a merger, provided that 30 days’ prior written notice is given to the other party. ��

- Automated evaluation analysis: The model response captures the correct information that the agreement can be assigned to the acquiring entity without obtaining the provider’s prior written consent but requires 30 days’ prior written notice. However, it incorrectly states the short answer as “Yes, with notice” instead of “Yes, with 30 days’ prior written notice,” as per the ideal answer. This is a minor difference and does not change the meaning significantly. ��

In this instance, the model response included a minor discrepancy from the attorney-authored ideal response. But the LLM’s evaluation of the response accurately determined that the answer remained sufficient, because it captured the complete notice requirement elsewhere in the response.��

Our ideal-response approach provides two key advantages over assertion-based evaluations. It excels at identifying deviations from attorney expectations, including hallucinations. And it pinpoints extraneous or inconsistent information that, while not technically a hallucination, could make even a complete response incorrect if that information introduces logical inconsistencies, which would result in a failing score. 
 
We rely on our Trust Team to create well-defined ideal responses and auto-evaluation instructions and to determine if a test case passes or fails. A skill’s output definitively fails if it falls short of this ideal because of material omissions, factual incorrectness, or hallucinations. However, we recognize that many legal issues aren’t black-and-white, and the “correct” answer could be open to reasonable disagreement. To address this, we peer review ideal responses in cases when the answer might require a second opinion. And we might eliminate tests when we find insufficient agreement among the attorney testers. This is how we both ensure that our passing criteria remain rigorous and account for the nuanced nature of legal analysis. 

Maintenance and improvement��

Creating a skill test set is only the beginning. Once we begin using it, the Trust Team continually monitors and refines it by manually reviewing failure cases from the automated tests and spot-checking passing samples to make sure the automated evaluation is in line with human judgments. We also regularly add tests to cover more use cases and capture user-reported issues, which could lead to further iterations of the tests submitted for automated evaluation and their success criteria.  ��

By following this process, every night we can execute, across all CoCounsel skills, more than 1,500 tests on our automated platform under attorney oversight, which combined with manual testing means we’ve run more than 1,000,000 tests since CoCounsel’s launch. And it empowers us to quickly identify areas for improvement, which is vital to ensuring CoCounsel remains the most trustworthy AI legal assistant available.��

Conclusion��

, we explored what it means for an AI tool to be “professional-grade” and why that standard is crucial for professionals in high-stakes fields like law. This post takes that concept further by diving into how we benchmark CoCounsel to ensure it meets those rigorous standards. By understanding the extensive testing that goes into evaluating its performance, you can see how CoCounsel consistently delivers the reliability and accuracy expected of a true professional-grade GenAI solution.��

To promote the transparency my team and I believe is necessary in the legal AI field, we’ve decided to release some of our performance statistics for the first time and a sample of the tests that are used to arrive at the figures below applying the criteria referenced within this article. Check out our results .

This��is a guest��post from��Jake Heller,��head��of��CoCounsel,��Thomson��Reuters.

Quick Check Mischaracterization Identification: New Westlaw Enhancement Furthers the ��VR��Ƶ Generative AI Vision

carrie.brooker@thomsonreuters.com — Tue, 22 Oct 2024 13:19:05 +0000

��VR��Ƶ recently announced deeper integration of CoCounsel 2.0 in Westlaw and Practical Law as well as new generative AI research features – Mischaracterization Identification in Quick Check and AI Jurisdictional Surveys – that are saving customers significant time and helping them ensure accuracy of their research. The enhancements build on the ��VR��Ƶ vision to deliver a comprehensive GenAI assistant for every professional it serves.

Below, CJ Lechtenberg, senior director, Westlaw Product Management, ��VR��Ƶ, shares her insights on developing Mischaracterization Identification, a generative AI capability to help detect mischaracterizations and omissions in legal briefs.

In the five years since Quick Check was introduced, you’ve added many enhancements including Quick Check Contrary Authority Identification, Quick Check Judicial and Quick Check Quotation Analysis. How did integrating generative AI make the Mischaracterization Identification enhancement different than previous ones?

Lechtenberg: This enhancement takes researchers beyond the step of knowing what might be a potential mischaracterization to an explanation of why something might be a potential mischaracterization – and that is radically different from any feature we’ve deployed in Quick Check before.

I’m sure it’ll come as no surprise when I say that generative AI is just a completely different beast. Lay people may think about the law as being black and white.��You can do this; you ��’t do that.��But legal professionals know that the law is really a sea of varying shades of gray. With machine learning, we wrestled with how we could ever give the machine enough data to figure out all the different ways an attorney may mischaracterize the law.

In Quick Check Quotation Analysis prior to the Mischaracterization Identification enhancement, we highlighted the actual textual differences – additions, omissions, and changes – in the quotations and showed the context around the quotes.��Doing so certainly saved researchers a significant amount of time and helped them spot issues they might not otherwise find, but the onus was still on researchers to review everything and determine what the precise differences were and how material they might be, if at all.��Even with the additional context provided, it could still be difficult to determine whether the quotations were taken out of context, especially if the quotes themselves didn’t appear to be different.

In developing Mischaracterization Identification, we recognized that the task of analyzing quotations and their context is so nuanced that attorneys will have different expectations for whether a mischaracterization occurred, so we needed to provide more than just categorizations. We found that large language models (LLMs) can generate nuanced descriptions of potential mischaracterizations, versus just explicit categorizations, and do it well, which is hugely beneficial for this type of task.

How will using Mischaracterization Identification give legal professionals and law firms a competitive advantage? How will judges using it benefit?��

Lechtenberg: The advantages of using the new Mischaracterization Identification are substantial for both legal professionals and the judiciary – both in terms of speed of review and quality of work product.��When we launched Quick Check Quotation Analysis in 2020, customers, both legal professionals and the judiciary, lamented about how time-consuming it is for them to review quotations and how challenging it is to spot differences. It is a mentally taxing task and often our brains fill in the blanks – interpreting what we think a brief maybe should say but actually doesn’t.�� Attorneys never have a surplus of time, so the last thing they want to do is spend the little bit they have on the most tedious of tasks and still end up missing potential problems.

For attorneys, Mischaracterization Identification will help them efficiently and accurately make contextual misstatement and omission determinations for their opponents’ and their own quotations and the context surrounding those quotations. The fear of missing their own mistakes is very real for attorneys, but the possibility of missing the opportunity to capitalize on their opponents’ mistakes is an even larger concern. This new enhancement reduces both of those worries and will help attorneys be even better advocates for their clients.

Judges will also be able to effectively review the filings of parties in matters before them much faster. Attorneys owe a duty of candor to the judiciary and the Mischaracterization Identification feature will help flag any potential issues quickly. An added benefit, which members of the judiciary or their staff perhaps haven’t considered, is the ability to analyze their own orders and opinions to ensure that they haven’t made mistakes that could be appealed. This new enhancement will help alert judges and law clerks to potential issues before they finalize their opinions.

What early feedback are you hearing from customers?

Lechtenberg: In a recent survey, 93% of law firm professionals told us they’ve seen opposing counsel misuse a quotation, 66% said they’ve seen misrepresentations by an associate or colleague, and 65% of corporate respondents said they check the accuracy of outside counsel’s quotations.��The need to review opposing counsels’ and colleagues’ briefs for mischaracterizations of the law is still a very real issue for attorneys. Likewise, attorneys have said they’re always concerned about the accuracy of their work and that maintaining their reputation as a credible litigator with courts and opposing counsel is incredibly important.

Customers are extremely excited about this new Quick Check enhancement to help combat these concerns and we’ve received positive feedback from them.��One law firm managing partner stated that they would use this tool a lot.��They cite-check their opponents’ briefs, so any shortcuts are beneficial to them. They recognize that most of the time, errors are harmless, but occasionally there are things they want to bring to the court’s attention and this feature will help them spot those issues more quickly and accurately.

Another law firm partner said this new feature is the “ultimate security blanket” because everything attorneys do is based on their credibility, and this feature alerting them to quotes being taken out of context before filing with the court would calm some of those fears.

Any surprising or unexpected moments as the team worked on developing or launching Mischaracterization Identification?

Lechtenberg: The fact that we’ve accomplished this now with the use of LLMs is exciting, a little surprising and a long time coming. I’m an attorney who leads a team of attorneys; we’re literally trained to question everything and have a healthy dose of skepticism.��But I have been dreaming about a mischaracterization identification feature in Quick Check ever since we developed Quotation Analysis more than five years ago. At my core, I believed someday this could be achieved, but for years traditional machine learning approaches were just not powerful or nuanced enough to do it well.

Leveraging LLMs for a use case like this is a new frontier like we’ve never seen before.��The LLM’s ability to analyze text from an uploaded document and compare that text to the text from the cited case used to support the argument and then go beyond highlighting textual differences and provide an actual explanation of what may be problematic – whether that’s a selective quote, omitted context or a misinterpreted holding – has been absolutely astounding.

What’s the one thing you want everyone to know about Mischaracterization Identification?

Lechtenberg: Mischaracterization Identification will not only help researchers spot contextual misstatements and omissions in their opponents’ or their own quotations and contextual statements faster and with more accuracy, but most importantly it will help them understand why those misstatements or omissions may be problematic. And, spoiler alert: Mischaracterization Identification is just the beginning of how ��VR��Ƶ will harness the power of generative AI in Quick Check to solve important customer problems.

For more on Mischaracterization Identification, read the press release or check out the by Mike Dahn, head of Westlaw Product Management, ��VR��Ƶ.

A Holistic Approach to Advancing Generative AI Solutions

carrie.brooker@thomsonreuters.com — Thu, 17 Oct 2024 11:42:10 +0000

At ��VR��Ƶ, our vision is to deliver an AI assistant for every professional we serve. As part of that, our focus is on delivering benefits for our customers across the breadth of our AI- and non-AI-powered features. We know that our solutions deliver benefits to customers in many ways, including AI-powered automation.

In April of this year, we shared our vision to provide a GenAI assistant for each professional we serve. CoCounsel embodies our ongoing efforts to augment professionals’ work with GenAI skills, enabling professionals to accelerate and streamline entire workflows to increase efficiency, produce better work, and deliver more value for their clients. Our continued investment in GenAI is driven to enable professionals across industries to accelerate and streamline entire workflows through a single GenAI assistant.

We believe our investment in GenAI – along with our integration to customer data as well as third-party integrations – extends the value customers derive from CoCounsel beyond our connected experience and our verified and trusted content. Our work with Microsoft, for example, includes CoCounsel integrations across Word, Outlook and Teams – meeting professionals where they’re already working.

AI and large language models are proving to be powerful tools that deliver efficiency gains and strengthen research practices for our customers. Yet our efforts to redefine work with GenAI are rooted in our strong foundation of editorial enhancements, authoritative content and technological expertise, alongside our long history of working closely with customers. That’s why we continue to build out AI- and non-AI-powered solutions to help with the entire workflow for legal, tax, and risk and compliance professionals. While AI may not be perfect, it can significantly help professionals reduce the amount of work and manage more complex and substantive work more efficiently. We collaborate with our customers to help them understand that AI is an accelerant rather than a replacement for their own research.

Benchmarking expectations

As a leader in innovation and AI research, we recognize the role that independent benchmarking brings in ensuring the accuracy, transparency, and accountability of evolving GenAI solutions. We believe that benchmarking can improve both the development and the adoption of AI. We also see it as one component in a broad range of ways we consider and understand the benefits AI delivers for our customers. We work with our customers as their trusted partners for change, helping them to confidently understand and adopt new technologies, looking at both their immediate value and role in long-term transformation, and leveraging our deep understanding of their businesses.

At ��VR��Ƶ, our understanding of the holistic value of our products is based on customers’ usage and the benefits they derive. Our customers have run more than 2.5M searches through AI-Assisted Research on Westlaw Precision since its launch late last year, and they tell us it’s saving time and improving productivity. Similarly, internal testing of CoCounsel’s skills has yielded impressive results, particularly with regards to CoCounsel’s document review capabilities.

Our benchmarking support is reflected in our participation in studies including Vals.ai as well as two consortium efforts – from Stanford and Litig – exploring how to best evaluate legal AI. We are submitting CoCounsel AI skills to the Vals.ai benchmarking study in five areas of evaluation – Doc Q&A, Data Extraction, Document Summarization, Chronology Generation, and E-Discovery.

is a first attempt at establishing a standard, and so we should view this work as the first iteration and an opportunity to learn versus treating it like a gold standard. For example, one limitation of the benchmarking methodology is that each vendor’s results are evaluated based on the text output alone, removed from the interface and experiences of the individual products. This discounts the work each vendor has done to design interfaces and safety features to minimize the harms of errors. This reinforces the need for a holistic evaluation of each product being tested, ideally as designed for the user.

Looking ahead, my expectation is that, while accuracy will continue to improve, no products will produce answers entirely free of errors. And as we’ve shared with our customers, every AI product requires human expertise for verification and review – regardless of the accuracy rate. As the current approach to benchmarking rates an accuracy percentage – we need to be very clear on this point – whether the product produces a score in the low or high 90th percentile, all answers still must be checked 100% of the time.

I look forward to our ongoing collaboration with customers and industry partners as we continue our work towards minimizing inaccuracies and increasing the usefulness of the research outcomes for GenAI tools and all our solutions.

Unlocking the full potential of professional-grade GenAI for your work

carrie.brooker@thomsonreuters.com — Tue, 15 Oct 2024 12:06:01 +0000

Today, nearly two years since ChatGPT debuted, GenAI continues to dominate our cultural and professional conversations. Even as its adoption for work steadily increases, the biggest concern for most professionals – 70% of them – is accuracy of output.��

However, not using GenAI for work at all is a non-option. 77% of professionals believe AI will have a high or transformational impact on their work over the next five years, and 78% call AI a “force for good” in their profession. In fact, 50% of law firms named AI among their top five strategic priorities for the next 18 months. If there were still doubt, there definitely isn’t anymore: GenAI is here to stay.��

So how can conscientious – and forward-thinking – professionals make the most of this generational technology while guarding against its drawbacks? How do you know if the GenAI solution you’re considering will live up to your professional obligation to work ethically and ensure your clients’ data is securely handled? Is any GenAI product trustworthy? Is it even possible for tools built on large language models (LLMs) such as GPT-4o from OpenAI and Google’s Gemini, all of which are known to hallucinate, to be safe enough to use professionally?��

Yes, it is possible. “Built on” is the key. When we launched our GenAI assistant, CoCounsel, our product and engineering teams delivered on the challenge of creating a product that could take advantage of LLMs’ tremendous raw power while eliminating as many of their serious limitations as possible – like hallucinations – that curb the professional utility of models when used on their own. What makes the current generation of LLMs truly extraordinary, then, is not what they alone can do, but what they enable.��

Using a model directly should be done with great caution and exposes users to risk if they use the output professionally. CoCounsel, on the other hand, harnesses that power and has engineered robust, well-tested accuracy, privacy, and security controls around it. In short: LLMs are the world’s most incredible engines. CoCounsel uses that engine to take you incredible places – places you couldn’t reach without these LLMs – safely.��

Why can professionals trust CoCounsel?��

We’ve applied our technical and domain expertise to leading LLMs in creating and continuing to optimize CoCounsel, a first-of-its-kind product that both does more than LLMs can and corrects the problems that make them unsuitable on their own.��

In short, CoCounsel is a professional-grade GenAI assistant. And no professional should use a GenAI solution that isn’t.��

What does it take for a GenAI assistant to be professional-grade? At a bare minimum – without which it should not be trusted for your work – it must be:��

Built for domain-specific use and grounded in reliable sources of data relevant to that use. A professional-grade solution, such as CoCounsel, harnesses the power of LLMs but limits the source of knowledge to known, reliable data sources – such as profession-specific domains or professionals’ or their clients’ databases – which rigorously limits the possibility of inaccuracies.
Built to make verifying its output easy. CoCounsel was not designed to replace the role of the professional, but rather to help them accomplish more and higher-quality work in less time. So just as lawyers review all work delegated to a junior associate or paralegal, they must validate CoCounsel’s output. We’ve made it easy to do so: all answers link to their origin in the source documents, so it’s simple to “trust, but verify.”��
Developed by technical teams with deep GenAI expertise.  Though GenAI has only been broadly talked about since 2022, it’s been around since 1961. ��VR��Ƶ AI engineers and research teams have worked with LLMs since their invention, were among the first to build with GPT-4, and have invented patented approaches to applying LLMs to professional use cases. ��
Continually and consistently tested and authenticated by a dedicated team of domain experts.  ��VR��Ƶ��AI engineers and Trust Team attorneys together filter, rank, and score CoCounsel’s responses to a daily battery of thousands of tests developed to simulate real-life legal use cases and ensure the assistant’s answers are consistent and accurate. To date we’ve run more than 1,000,000 such tests against CoCounsel. ��
Secure and private, because it interacts with third-party LLMs the right way. ��VR��Ƶ GenAI solutions access third-party LLMs through dedicated, private servers, and through an “eyes off” API. No LLM partner employees can see customer queries or documents, and our LLM access is contractually “zero retention.” Our LLM partners cannot store customer data longer than it takes to process the request. Our product data is never used to train any third-party models. And all product data is encrypted in transit and at rest, and subject to ��VR��Ƶ rigorous security policies and practices. ��

Why “makes minimal mistakes” isn’t enough��

As important as the above five characteristics are, they’ve become price-of-entry criteria for professional-grade GenAI. And given how rapidly the technology is evolving, what you expect from a professional-grade solution should, as well. Remember: GenAI has the power to do so much more than help you complete jobs. It can transform what it means to be a professional, freeing you for more strategic, creative, valuable work that a machine just cannot do – which can transform not only how you do business, but also how much business you do.��

To take full advantage of this potential, you need a GenAI assistant that fulfills two more key requirements:��

1. Professional-grade means intelligently and seamlessly handling workflows, not just completing a series of tasks. A true GenAI assistant goes beyond responding to your requests, instead guiding you through the steps required to finish long, complex, even open-ended projects. Only this kind of product, such as CoCounsel, can truly unlock the full potential of GenAI.

Through an expanding set of capabilities and deep connections with both your documents and the tools you use every day – e.g., Microsoft 365 – a professional-grade GenAI assistant can traffic an entire deliverable from task to task, program to program, in a continuous stream, prompting you forward through the next steps and simultaneously handling multiple pieces of the work itself.

CoCounsel is built for workflows. It’s accessible across multiple ��VR��Ƶ products, bringing together both fundamental capabilities such as summarization and document review with specialized functions such as legal research, for a smooth transition from one type of work to the next. And it’s integrated with both Microsoft 365 and document management systems, available for you literally wherever you’re working, from client communication to research to document drafting and beyond.

2. Professional-grade means providing partnership, not just product. Without in-depth, sustained support, you’re unlikely to get the most possible value from your investment. A professional-grade vendor offers a success and support team that will be there for you long after you’ve signed a subscription agreement. They work with you to deeply understand your most prevalent use cases and how their solution will help you tackle them, and of course are available when something’s not working the way it should. They keep you informed about product changes and improvements and continue creating content you can use to increase your knowledge, such as videos, webinars, and written materials. A truly professional-grade team has a vision for their GenAI assistant, ensuring it will increase in power and capability as the technology advances. And most important, they are as invested in you as you are in them.

Upon adopting CoCounsel, legal professional users are trained by the ��VR��Ƶ Customer Success team, made up of licensed attorneys, many still practicing, with dozens of years of combined experience in both litigation and transactional law. And many of these trainers are prompting engineer specialists, who in addition to being licensed attorneys have a background in computer science. After onboarding, we ensure everyone using CoCounsel has the opportunity to attend live trainings and watch recorded webinars, get individual help through live chat and email, and access dozens of video tutorials—a resource pool that will only keep growing.

This is a guest post from Jake Heller, head of CoCounsel, Thomson Reuters, and Erin Nelson, CoCounsel content strategist, ��VR��Ƶ.��

The Transformative Role of AI in Professional Tools: A Conversation With David Wong and Leann Blanchfield

carrie.brooker@thomsonreuters.com — Wed, 02 Oct 2024 13:33:39 +0000

Leann Blanchfield, head of Editorial, ��VR��Ƶ, said now is the most exciting time in her 30+ years with the company.

In the latest , Blanchfield shared how the power of generative AI – and the dramatic leap it’s making in how professionals across industries can access large quantities of data – is transforming the legal industry and beyond. Blanchfield credits the more than 1,500 attorney editors on her team, who create and enhance content, with harnessing the power of generative AI for legal research.

Human expertise is just one component of how ��VR��Ƶ is capitalizing on the potential of generative AI. Three elements are critical, as David Wong, chief product officer, ��VR��Ƶ, noted in his comments about the launch of CoCounsel 2.0 at ILTACON: “We have the data, the expertise, and the tech. Few have all three in such quantity and depth.”

In the new , Wong focused on the role of human domain experts, noting they’re key to the process of creating and validating data used by AI models for professional research.

“There’s a lot of both prompt engineering, fine tuning, and system refinement that’s necessary to get quality to a usable spot,” Wong said. “Experts, experienced researchers and experienced lawyers can help to gauge whether or not the systems are correct. We couldn’t have an objective, quantified measure of quality on these systems without the editors, without those experts.”

Wong and Blanchfield discussed the importance of human experts in ensuring the accuracy and reliability of AI.

“Maintaining accuracy is at the heart of what the editorial team does,” Blanchfield said. “It’s the number-one priority across every editorial team. We maintain our content to be accurate and trusted.”

Wong acknowledged it’s challenging for the team to process and update unstructured, constantly changing data in real time. He said that ��VR��Ƶ ensures that its AI models are customized and meeting the varying needs of various jurisdictions through a combination of software and algorithms that take advantage of the LLMs.

“So when you ask a question of , for example, we are running an end-to-end algorithm which runs search, retrieves data, re-ranks, interprets and then ultimately passes that information to a large language model to synthesize and produce the answer,” Wong said. “It’s a very complicated system which involves multiple types of technology, multiple types of information retrieval.��Processing unstructured, dynamic data and customizing AI models requires integrating multiple technologies and algorithms to optimize performance.”

Hear more of Wong and Blanchfield’s insights on integrating AI into professional tools and ensuring that information is trustworthy in the of the TechConnect series, which brings diverse and dynamic perspectives from all corners of the technology world with thought-provoking questions and conversation.

LLMs Archives - ����VR��Ƶ Institute