Legal Zero-Days

Can AI systems find exploitable flaws in real-world legislation?

This page describes an on-going safety evaluation research project being carried out by Good Ancestors. The evaluation tests AI systems' ability to discover vulnerabilities in legal systems.

Read the Full Paper →

AI’s ability to exploit legal vulnerabilities

Performance on the legal zero-days benchmark over time

Claude

GPT / O

Gemini

A motivating hypothetical

A group uses a jailbroken AI assistant to help plan the acquisition and deployment of a bioweapon. In addition to guidance about building the bioweapon, logistics and operational security, the group asks the AI assistant to find vulnerabilities in the laws and regulations that could stand in their way.

Over several months, the group uses the AI assistant to scan Australian federal and state regulations and legislation for defects, like drafting errors, gaps, lapsed time periods, incorrectly made regulations, inconsistencies, incorrect cross references between overlapping statutes, unresolved conflicts with common law and other grounds for challenge. Several are identified and documented.

Once the group is ready to build and release their biological agent, they deploy the AI’s findings. An advocacy group brings a constitutional challenge to relevant police surveillance powers. A legal academic is tipped off about a procedural defect in biosecurity inspections powers. A privacy group questions the legality of information sharing for background checks on people ordering synthetic DNA. A challenge to surveillance evidence admissibility is raised in an unrelated criminal trial thanks to an anonymous tip. Each argument has genuine legal merit. Each is filed in good faith by the party bringing it. None of the parties are aware of the others or of the group’s involvement.

The effects compound. Within days, agencies face simultaneous uncertainty across warrants, detention powers and evidence admissibility. Legal teams advise pausing operations. The group’s window of opportunity opens.

This is not purely speculative. In 2022, approximately 1,000 Victorian Police officers were found not to have been properly sworn in, casting doubt on the validity of their official actions. The defect went undetected for years. The difference in this scenario is the use of AI to find such vulnerabilities efficiently and the strategic coordination to deploy them simultaneously through lawful channels.

The legal system is vulnerable to AI exploits

AI systems excel at finding bugs in complex rule systems—from exploiting video game mechanics to discovering software vulnerabilities.

Legal systems are also complex rule systems, but unlike code, our legal systems were not created with the benefits of automated fault-finding tools.

Legal Zero-Days could be weaponised to enable catastrophic outcomes:

Bypass AI safety regulations
Impede government response to AI incidents
Allow resource accumulation by misaligned systems
Disrupt enforcement mechanisms

If frontier AI develops the capability to systematically discover these vulnerabilities, the implications for AI governance are profound.

A real world case study

The 2017–18 Australian Citizenship Crisis

A discovery of how a constitutional provision interacted with foreign citizenship laws paralysed government for months.

Stage 1

The discovery

July 2017

A barrister notices that a senator is a dual citizen and Section 44(i) of the Australian Constitution disqualifies dual citizens from Federal Parliament.

Stage 2

The crisis widens

Aug–Oct 2017

Multiple MPs from various parties are found to have dual citizenship. Uncertainty grows around Parliament's legal ability to function.

Stage 3

High Court ruling

Oct–Nov 2017

The High Court confirms these individuals, including the Deputy Prime Minister, are disqualified from serving.

Stage 4

The cost

2018

Government paralysed. Ruling party loses majority. Series of costly by-elections. Legal costs alone: $11.6 million. Source

One legal vulnerability lay hidden in Australia’s Constitution for 116 years before causing 18 months of government disruption. It’s conceivable that AIs could find many more with minimal resources, enabling system-clogging lawfare.

Policy implications

What can be done?

Defensive deployment

Nations could use AI to audit their own legal systems

Early warning

This evaluation could inform AI risk tolerance decisions

Proactive patching

Vulnerabilities could be addressed through statute law revision before exploitation

The capability has positive externalities: a more robust legal system benefits everyone.

How we test this capability

Our evaluation approach

We’ve developed a novel evaluation methodology: “legal puzzles” that simulate real Legal Zero-Days.

The process

1Expert lawyers identify legislation with complex, consequential provisions
2We introduce subtle changes that break the law's operation
3AI models receive the modified legislation and must identify the fault
4An AI judge (validated against human grading) scores responses

What makes these puzzles difficult?

Unlike software bugs, you can't brute-force legal analysis through trial and error
They require models to follow complex chains of legal logic
Models must assess consequences, not just identify differences
Finding real legal zero-days requires processing thousands of pages of law

About this research

This work was developed by Good Ancestors with funding and support from the UK AI Security Institute through their Evaluation Bounty Programme.

Research & Development: Greg Sadler and Nathan Sherburn (Good Ancestors)

Original Concept: Adhiraj Nijjer

Legal Puzzle Development: A network of lawyers across multiple jurisdictions contributed their expertise to develop the evaluation puzzles.