AI's Limitations in the Practice of Law | Samuel Estreicher, Lior Polani

Introduction: The Promise and Reality of Legal AI

The legal profession stands at a technological inflection point, with artificial intelligence (AI) promising to revolutionize how lawyers work. The numbers appear compelling: Thomson Reuters reports that AI systems saved lawyers an average of four hours per week in 2024, generating approximately $100,000 in new billable time per lawyer annually across the United States legal market. Major law firms report 500-800% productivity increases in paralegal tasks, while AI-assisted document review reportedly now identifies critical information in 2,000-page police reports that human reviewers routinely miss.

However, these impressive claims mask a fundamental technical constraint that limits AI’s effectiveness in legal practice: the “context window” limitation. This paper examines how this constraint shapes what AI can and cannot do in legal work, evaluates vendor responses to these limitations, and provides practical guidance on where AI automation can succeed versus where human oversight remains essential.

I. The Context Window Problem in Legal AI

A. Understanding Context Windows: The Core Technical Constraint

To grasp why context windows pose such challenges for legal AI, imagine trying to analyze a complex contract dispute through a small window that only reveals one page at a time. You cannot simultaneously view the definitions section, the operative clauses, and the exhibits that give those clauses meaning. This metaphor captures the fundamental constraint facing current AI systems: they have limited “desk space” for processing information.

A context window represents the maximum amount of text an AI system can process simultaneously—technically measured in “tokens” (roughly equivalent to words). While modern large language models (LLMs) advertise context windows ranging from 32,000 to over 1 million tokens, performance degrades significantly at higher levels. This degradation has serious consequences. As researchers explain, “long context windows cost more to process, induce higher latency, and lead [LLMs]to forget or hallucinate [i.e., fabricate] information.” Databricks’ 2024 study confirms this finding: even state-of-the-art models like GPT-4 experience inaccuracies starting at 64,000 tokens, with only a handful maintaining consistent performance beyond this threshold.

B. The Unique Challenge in Legal Work

Legal work demands precisely what AI systems struggle to provide: comprehensive simultaneous analysis of interconnected documents. This mismatch creates three primary challenges that distinguish legal applications from other AI uses.

First, legal documents exhibit extensive cross-referencing that exceeds context window capacity. Consider a master service agreement that references dozens of statements of work, schedules, and amendments. Current AI systems cannot simultaneously consider all documents to identify conflicts or other relevant issues. Kira Systems claims to identify over 1,400 different clause types with 60-90% time savings—but these metrics lack independent verification and the vendor materials do not prominently disclose the system’s inability to process interconnected documents simultaneously.

Second, the precision required in legal language cannot tolerate information loss from compression. Legal documents rely on defined terms that modify meaning throughout documents and statutory frameworks, where understanding one provision requires knowledge of an entire regulatory scheme. As one industry study noted, “due to the specific nature and complexity of legal texts, traditional document management techniques often prove inadequate.”

Third, litigation often demands analysis across vast document collections. Thomson Reuters’ own evaluation of its CoCounsel system reveals this limitation explicitly: while retrieval-augmented generation (RAG) (see below), complex queries require “more context” than current systems can process. The study found that full document input generally outperformed RAG approaches for document-based legal tasks, but computational costs increased 3-5x—making comprehensive analysis economically infeasible for many applications.

C. Real-World Performance: The Gap Between Marketing and Reality

The consequences of context-window limitations appear in documented failures across the legal AI landscape. Stanford University’s 2024 study provides sobering empirical evidence: testing major legal AI platforms with over 200 legal queries revealed that LexisNexis Lexis+ AI hallucinated 17% of the time while Thomson Reuters’ Westlaw AI-Assisted Research reached 33% hallucination rates. Most concerning, Thomson Reuters’ Ask Practical Law AI provided accurate responses only 18% of the time.

These failures translate into real legal consequences. Federal courts have imposed over $50,000 in fines for AI-generated false citations, including the widely publicized Mata v. Avianca case where attorneys cited six completely fabricated cases generated by ChatGPT. Paris-based researcher Damien Charlotin’s database documents 95 instances of AI hallucinations in U.S. courts since June 2023, with 58 cases occurring in 2024 alone—suggesting the problem is accelerating rather than improving.

Technical analysis reveals why these failures occur systematically. Thomson Reuters’ own engineers acknowledge a critical gap: “effective context windows of LLMs, where they perform accurately and reliably, are often much smaller than their available context window.” GPT-4 Turbo’s performance illustrates this vividly—despite advertising a 128,000 token window, performance reaches a “saturation point” and degrades after 16,000 tokens. The “lost-in-middle” phenomenon compounds these issues. Models perform best when relevant information appears at the beginning or end of documents, with significant performance declines when critical information is buried in middle sections—exactly where important legal clauses may often appear. Platform-specific limitations make matters worse: Harvey AI’s message length limit plummets from 100,000 characters to just 4,000 when even a single document is attached, forcing users to fragment complex legal analysis.

II. How AI Vendors Address Context Limitations

A. Technical Workarounds: Engineering Around the Problem

Faced with fundamental context window constraints, AI companies have developed two primary technical approaches that attempt to work around—though not eliminate—these limitations.

Retrieval-Augmented Generation (RAG) represents the most widespread solution. RAG systems break documents into smaller chunks, create vectors purportedly capturing semantic meaning, and retrieving only the most relevant portions for analysis. Think of it as an intelligent filing system that can quickly find relevant passages without reading everything. Thomson Reuters’ implementation in CoCounsel demonstrates both the promise and limitations. As mentioned, while RAG enables searching across thousands of documents, the system acknowledges that for complex queries where context is required, performance degrades significantly.

Vector databases supplement and provide the infrastructure for RAG systems. These sophisticated filing systems organize documents by meaning rather than alphabetically, allowing AI to find conceptually-related materials even when different words are used. Recent research demonstrates that metadata-enriched implementations achieve 7.2% improvements in retrieval accuracy through metadata enrichment—adding contextual tags like document type, jurisdiction, and legal concepts.

Context compression offers a different approach by condensing information to fit within processing limits. The In-context Autoencoder (ICAE) system claims to reduce document size by 75% while maintaining performance, reducing both inference latency and GPU memory costs. Yet this approach faces an inherent trade-off: compression necessarily involves information loss, and legal documents’ precise language often cannot tolerate even minor semantic changes.

B. Hybrid Solutions: Acknowledging the Need for Human Judgment

Recognizing that technical solutions alone cannot overcome context limitations, vendors increasingly deploy hybrid approaches that combine AI capabilities with human expertise. These “human-in-the-loop” systems represent a pragmatic acknowledgment that legal judgment cannot be fully automated.

Thomson Reuters positions CoCounsel as the exemplar of this hybrid approach. The system “leverage[s] long context LLMs to the greatest extent” for individual document analysis while using RAG to search across document collections. Crucially, the platform requires attorney review at multiple checkpoints, particularly for cross-document analysis where context window limitations are most severe. However, the vendor at this point does not disclose specific error rates or performance metrics at these checkpoints, preventing independent assessment of effectiveness.

“Iterative processing workflows” represent another means of adapting to context constraints. Harvey AI exemplifies this approach: when the platform’s prompt limit drops from 100,000 to 4,000 characters upon document upload, attorneys must manually break complex queries into manageable segments. This maintains human oversight but sacrifices the efficiency gains that full automation promises.

“Knowledge graph integration” attempts to preserve document overviews and relationships that context windows often cannot accommodate. By mapping connections between entities, clauses, and concepts before AI processing begins, these systems, it is claimed, maintain some awareness of document interdependencies. The Vals AI Legal Report (VLAIR) provides rare comparative data: platforms using knowledge graphs achieved 94.8% accuracy on document Q&A tasks compared to 80.2% on chronology generation, which requires tracking information across longer contexts.

C. The Performance Reality: Metrics and Trade-offs

Empirical studies reveal significant performance variations among approaches, with important implications for legal practice. Li et al.’s academic evaluation provides an important data point: long-context (LC) models generally outperform RAG in legal question-answering, with 60% of answers being identical between approaches. However, LC models incur 3-5x higher computational costs, making them economically infeasible for many applications.

The Li Study’s most actionable finding concerns quality differences between approaches: “Summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind.” This suggests that how documents are processed matters as much as the technical approach chosen. Legal documents’ complex structure makes effective chunking particularly challenging.

Real-world implementations confirm these research findings with respect to consistency. CoCounsel’s performance reportedly ranges from 73.2% to 89.6% depending on task complexity, with lower scores on tasks requiring long-context reasoning. When processing documents exceeding 200,000 tokens, even advanced models show accuracy dropping to 46.88%. These metrics, however, come with an important caveat: methodology, sample size, and error measurement criteria are not fully disclosed, limiting their reliability for user decision-making.

III. Best Settings for Full Automation Use

A. High-Volume, Rule-Based Tasks: Where Automation Succeeds

Despite context-window limitations, certain legal tasks are generally amenable to full automation. These applications succeed because they operate within narrow parameters that align with AI’s current capabilities—specifically, repetitive, rule-based processing where comprehensive document understanding is less critical.

“Contract metadata extraction” exemplifies successful bounded automation. This involves using AI to identify and catalog standardized information: party names, effective dates, payment terms, and similar data points. Kira Systems, now owned by Litera and marketed to major law firms, claims 60-90% time savings and the ability to identify over 1,400 clause types. Success in extraction tasks stems from their bounded complexity where “success is objectively measurable”—either a contract contains a specific clause type or it does not. Again, however, these impressive metrics come exclusively from vendor materials rather than independent analysis, highlighting a critical evaluation challenge.

Deadline calculation and docket management represent another automation success story. The stakes are high—missed deadlines represent 40% of malpractice claims according to insurance industry data. Clio advertises that its Manage calendar automation covers 2,300 jurisdictions across all 50 U.S. states, while LawToolBox promotes automatic calculation of complex deadline chains based on triggering events. These systems succeed because court rules, while complex, follow deterministic logic that AI can reliably execute. Yet neither vendor provides error rates or accuracy metrics, leaving actual performance unverified.

Standard form applications round out the automation success stories. Creating routine legal documents from templates—nondisclosure agreements (NDAs), employment agreements, incorporation papers—involves minimal judgment once the appropriate parameters are set. HyperStart CLM advertises “99% accurate extraction through AI . . . [with] zero manual effort,” though this figure, too, lacks independent verification. The pattern remains consistent: vendors make bold claims without providing access to underlying data or third-party audits.

B. Document Processing: Partial Automation with Human Oversight

Document classification in discovery demonstrates both the potential and limits of full automation. Relativity markets itself as having the “industry’s largest customer base and data footprint,” with Technology-Assisted Review (TAR) proponents claiming statistical accuracy surpassing human-only review. Yet these assertions typically come from vendors or consultants with financial interests in the technology. Without access to actual error rates, false positive ratios, or comparative studies, legal professionals must rely on vendor promises rather than empirical evidence.

The limits become clear with privilege determinations. While AI can flag potentially privileged documents based on sender, recipient, and keyword patterns, final privilege calls require legal judgment that current systems cannot provide. This results in a hybrid model where AI handles volume while attorneys make critical decisions—acknowledging that some aspects of legal work resist automation.

C. The Automation Decision Framework

Analysis of successful automation reveals four essential requirements that determine whether a legal task can be fully automated:

First, tasks must have well-defined parameters where success is objectively measurable. Metadata extraction works because a clause either contains a change-of-control provision or it does not—there’s no interpretive ambiguity. Second, judgment calls must be minimal or eliminable. Automation fails when subjective interpretation is required, which explains why privilege review requires human oversight. Third, clear success metrics, where provided, enable continuous improvement. Systems can learn from errors only when “correct” answers exist objectively. Fourth, error tolerance must align with task importance. Contract analysis accepts 10-20% error rates because efficiency gains outweigh occasional mistakes, while litigation deadlines demand near-perfect accuracy.

The pattern is clear: full automation succeeds for high-volume, rule-based tasks within bounded complexity. When legal work requires comprehensive document understanding, subjective judgment, or cross-document reasoning, current AI systems likely cannot operate autonomously. This reality shapes the boundary between what can be automated and what requires human-AI collaboration.

IV. AI as a Legal Tool: An Assistive Model Rather Than Automation

A. Research and Analysis Support: Enhancing Human Capabilities

When deployed as assistive tools rather than autonomous systems, AI transforms legal research despite context-window limitations. The central point lies in maintaining human oversight to compensate for AI’s inability to process comprehensive legal context simultaneously.

Case law research demonstrates this assistive model effectively. Thomson Reuters markets CoCounsel as helping attorneys identify relevant precedents through semantic search that understands conceptual relationships beyond keyword matching. The platform explicitly requires attorney verification, acknowledging that AI may miss critical distinctions or nuanced applications. This positions AI as a powerful first-pass tool that surfaces potentially relevant materials for human evaluation.

Multi-jurisdictional analysis particularly benefits from AI assistance. Rather than manually comparing statutes across fifty states, attorneys use AI to surface variations and patterns. The MyCase 2024 Legal Industry Report found that 53% of lawyers using AI report increased efficiency, with 24% reporting significant gains. Success stems from AI handling mechanical compilation while attorneys apply judgment to determine which variations matter for their specific case.

Legislative history compilation showcases AI’s ability to gather dispersed information efficiently. By searching congressional records, committee reports, and floor debates simultaneously, AI tools create comprehensive timelines that would require days of manual research. However, AI cannot determine which statements carry more weight or how courts might interpret ambiguous legislative purposes. Human judgment remains essential for analysis.

B. Writing and Drafting Assistance: Structure Without Substance

Legal writing assistance represents AI’s most widespread adoption, with tools helping structure arguments while leaving substantive legal reasoning to attorneys. This division of labor plays to each party’s strengths: AI excels at organization and consistency, while humans provide legal analysis and strategic thinking.

Outlining of briefs uses AI to organize research into logical frameworks, identify potential counterarguments, and suggest supporting authorities. The ABA’s Formal Opinion 512 explicitly permits AI use for drafting, provided attorneys maintain competence and confidentiality obligations. This guidance recognizes AI as a tool analogous to legal research databases—powerful when properly supervised but dangerous if blindly trusted.

Citation checking demonstrates effective bounded assistance within writing tasks. AI tools verify citation format, identify broken links, and flag potentially overruled cases. Bluebook compliance without significant error rates may be more difficult. In any event, attorneys must still verify that cited cases actually support their propositions—a judgment call AI cannot reliably make given context limitations.

Style and tone consistency across lengthy documents can benefit from AI’s pattern recognition capabilities. When drafting hundred-page merger agreements, AI helps ensure defined terms are used consistently and boilerplate language matches throughout. This mechanical consistency checking frees attorneys to focus on substantive provisions that require legal judgment, illustrating the optimal division of labor between human and machine.

Conclusion: Embracing Reality Over Hype

The context window problem exposes a fundamental mismatch between AI’s current capabilities and legal work’s inherent demands. While vendors promise transformative efficiency, empirical evidence tells a different story: hallucination rates exceeding 30%, accuracy below 50% for complex documents, and systematic failures when processing interconnected legal materials. These are not temporary bugs awaiting fixes but inherent limitations of how AI at this stage processes information.

Current workarounds—retrieval-augmented generation, hybrid workflows, and context compression—acknowledge rather than solve this core constraint. They enable narrow successes in bounded tasks like metadata extraction and deadline calculation, where objective criteria and limited scope align with AI’s capabilities. But for the comprehensive analysis, cross-document reasoning, and nuanced judgment that define sophisticated legal practice, full automation remains unattainable with current technology.

The path forward requires legal professionals to abandon automation fantasies for practical reality. AI succeeds as an assistive tool that amplifies human capabilities, not as an autonomous system that replaces human judgment. Legal professionals who understand these boundaries can deploy AI effectively—using it to handle mechanical tasks while reserving critical legal judgment for themselves. The future of legal AI lies not in pursuing an illusory goal of full automation that current technology cannot deliver, but in optimizing human-machine collaboration that leverages each party’s strengths.

Understanding these limitations is not a counsel of despair but a blueprint for effective implementation. By recognizing what AI can and cannot do, legal professionals can harness its genuine benefits while avoiding costly failures. The technology will undoubtedly improve, but for now, the most successful legal AI deployments are those that respect the fundamental constraints of context windows and maintain appropriate human oversight. In legal practice, as in law itself, acknowledging limitations is the first step toward working effectively within them.

Source link