Technically Speaking

How To Work On Legal Generative AI's Black Box Problem

This article has been saved to your Favorites!
<strong>Steven Lerner</strong> covers legal technology for Law360 Pulse
Steven Lerner covers legal technology for Law360 Pulse
Many legal technology providers have boasted that their generative artificial intelligence tools were incapable of "hallucination." That illusion shattered in May when an academic study, originally criticized and eventually revised, reported alarming rates at which some industry tools produce false or misleading information.

The study by Stanford University's Human-Centered AI Group found that AI research tools from LexisNexis and Thomson Reuters each hallucinate more than 17% of the time. In the original study, LexisNexis gave accurate responses 65% of the time, three times the rate of a tool from Thomson Reuters.

A revised version of the study was released a week later, with a different tool from Thomson Reuters providing false information 33% of the time.

Hallucination refers to false outputs from a generative AI tool. Some legal tech providers previously touted their generative AI tools as free from hallucination due to the natural language processing technique known as retrieval-augmented generation, or RAG, but the Stanford study showed that this was overstated.

Legal tech providers have not been fully transparent about how their AI tools work and about the systems that power the platforms. This phenomenon of an AI system that is opaque is known as a "black box problem."

Researchers from the Stanford study suggested there needs to be rigorous, transparent benchmarking and public evaluations of AI tools in law. Benchmarking refers to evaluating multiple tools on the same metrics.

"What is striking about legal technology is that no such benchmark exists, and instead legal technology providers can make all sorts of claims that are really not grounded and haven't been corroborated or substantiated," Daniel Ho, a Stanford law professor and one of the researchers behind the recent study, told Law360 Pulse. "Given the high documented rate of hallucinations, it is absolutely critical that we move to a system [that] has happened elsewhere in AI, that is more transparent and has benchmarking to really understand whether these improvements have been made."

Other AI fields rely on a common set of benchmarks to determine the feasibility of the models that power AI platforms. For example, many academic subjects use a benchmark known as Massive Multitask Language Understanding, or MMLU, to test public AI models.

In contrast to other fields, the legal tech ecosystem is fundamentally more closed off. Legal vendors aren't making their tools available for these evaluations, and certainly not to users unwilling to pay exorbitant subscription fees.

"Being able to conduct an evaluation like this is incredibly resource-intensive, and it should not be incumbent on independent academic researchers to try to substantiate these claims," Ho said. "There's also a responsibility by companies to actually provide evidence when they make claims like, 'Our system does not hallucinate.'"

The initial release of Stanford's study in May prompted criticism online about the results and how the study was conducted. Notably, a spokesman for Thomson Reuters said that the initial study used one of its tools, Practice Law's Ask Practical Law AI, when it should have used another, Westlaw's AI-Assisted Research. Researchers said that they were denied access by Thomson Reuters when they originally requested to use that tool.

LexisNexis told Law360 Pulse that it had not been in contact before the study and its internal research showed a lower hallucination rate.

Greg Lambert, chief knowledge services officer at the law firm Jackson Walker, wrote on LinkedIn about the need to redo the original report's benchmarking by using Westlaw's AI-Assisted Research tool.

In addition, Lambert told Law360 Pulse that he supports independent researchers evaluating the current slate of legal AI tools.

"Typically, an academic institution could come in and do the benchmarking, but I think the Stanford Human-Centered Artificial Intelligence study may have put a stain on academics conducting this type of research and understanding the legal industry in a way that would establish a trust for these types of studies," Lambert wrote to Law360 Pulse.

"If academics did run the study," he continued, "I think programs like Vanderbilt's VAILL program, Berkeley Law's AI Institute, or other practical technology or AI programs within a law school might have the ability, prestige and trust level that a study like this would need to be taken seriously."

While it may not be the best way to evaluate these AI tools, Lambert added that law firms already have developed a "gut check" for testing legal research tools due to their long-standing relationships with vendors.

If not law firms, who should be responsible for testing legal AI tools?

One option is to follow the model set forth by the National Institute of Standards and Technology, or NIST. Ho said that NIST's testing of facial recognition software has kept vendors in that space more honest, but developing a NIST version for the legal field would require a lot of resources.

Another option is to have law librarians take the lead in evaluating these tools.

In the wake of Stanford's study, a quartet of law librarians from Harvard, Ohio State University and the University of Oklahoma proposed using legal research tasks based on existing data to test generative AI platforms. The proposal was announced in a blog in late May with plans to develop this project into an academic study.

Whether through law librarians, academic researchers or publicly funded organizations, benchmarking the hallucination rates of AI platforms is important because law firms are trying to determine the best tools to acquire.

"It's really difficult if you can't compare products and you just have to make your way through a set of marketing claims," Ho said.

Ho added that law firms should use their purchasing power to demand that vendors publicly benchmark their AI products to ensure their marketing claims are warranted.

Failure to evaluate claims of hallucination in public could be devastating to the legal industry.

"We could get a race to the bottom, where those with the most grandiose marketing claims capture the market," Ho said. "That could really harm firms that are really trying to do this right and ultimately cause grave harm in legal practice and to clients."

--Editing by Robert Rudinger.

Update: This story has been updated with additional information about the study.

Law360 is owned by LexisNexis Legal & Professional, a RELX Group company.

Technically Speaking is a column by Steven Lerner. The opinions expressed are those of the author and do not necessarily reflect the views of Portfolio Media Inc. or any of its respective affiliates.

For a reprint of this article, please contact



Law360 Law360 UK Law360 Tax Authority Law360 Employment Authority Law360 Insurance Authority Law360 Real Estate Authority Law360 Healthcare Authority Law360 Bankruptcy Authority


Social Impact Leaders Prestige Leaders Pulse Leaderboard Women in Law Report Law360 400 Diversity Snapshot Rising Stars Summer Associates

National Sections

Modern Lawyer Courts Daily Litigation In-House Mid-Law Legal Tech Small Law Insights

Regional Sections

California Pulse Connecticut Pulse DC Pulse Delaware Pulse Florida Pulse Georgia Pulse New Jersey Pulse New York Pulse Pennsylvania Pulse Texas Pulse

Site Menu

Subscribe Advanced Search About Contact