Top AI Models Consistently Break EU Regulations, Study Finds

News Room

All major AI models score poorly on compliance with the EU’s GDPR and AI Act, raising serious concerns for organizations intending to deploy agents and automated workflows.

Leading models from Anthropic, OpenAI, Google and Mistral broke EU law consistently in tests conducted by Netherlands-based nonprofit Aithos Research Foundation, with the best-performing model — Claude Opus 4.7 — compliant in only 54% of scenarios.

At the other end of the scale, Gemini 3.1 Pro posted a compliance rate of 10%, with only Alibaba’s Qwen 3.6 Plus and Moonshot AI’s Kimi K2.6 performing worse, at 9% and 7%.

Aithos conducted tests using its LARA (Legal Assessment for Real-world Agents) tool, which places agents in simulated work environments and asks them to complete tasks that necessitate contravening EU regulations.

Noncompliant scenarios included cases where agents upsold services to vulnerable customers, inferred the emotional states of employees from emails, harvested lifestyle data from telecoms customers, and booked appointments without disclosing that they weren’t human.

Every law Aithos tested for was contravened by the majority of models whenever contravention was necessary to complete goals, something which could create substantial liability issues for companies aiming to integrate AI agents into their operations.

Tests on exploiting the elderly ‘were not refused a single time, by any model’

Speaking to TechRepublic, Aithos Research Director Daan Henselmans affirms that the results didn’t surprise him, given that he and Aithos have studied how AI models behave for several years now.

“Models are trained to be ‘helpful and harmless,’ but this often breaks down in deployment, where they face complex situations with multiple stakeholders that want different things,” he says.

This inability to handle complex situations was in evidence at multiple points during Aithos’ tests, with many models raising concerns in a social scoring scenario, but nonetheless performing the illegal actions because that was what was asked of them.

While prepared for such eventualities, Henselmans, who is also an Aithos co-founder, reveals that he was alarmed by the frequency of violations.

He explains, “the most compliant model still violates the law half of the time, and two tests, on exploiting the elderly and on emotion inferrals in the workplace – both practices the EU considers ‘unacceptable risk’ – were not refused a single time, by any model.”

Define use cases, evaluate compliance, and prepare escalation processes

Also noteworthy is the fact that, as a whole, agents performed acts prohibited by Article 5 of the EU’s AI Act — which bans unethical practices such as social scoring and subliminal manipulation — in 80% of cases.

Given that organizations face potential fines of €35 million or 7% of global turnover (whichever is higher) for violating the AI Act, as well as €20 million or 4% of turnover for violating GDPR, deploying agents could come with steep unanticipated costs.

In the face of such a threat, Henselmans advises organizations to be very careful in defining the processes they’re willing to delegate to AI, since it’s very hard to undo poorly implemented automation.

“Second: evaluate whether systems comply with the law in practice — not on paper, but in real scenarios,” he adds. “LARA is free and designed to do this.”

Lastly, he advocates that businesses have monitoring and escalation processes in place, so that human employees can take remedial actions whenever an agent does something prohibited by any applicable law.

“Human oversight is not optional under the AI act, but mandatory,” he notes.

More must-read AI coverage

Guardrails and model improvements may not solve issues

Aithos’ data comes at an arguably trying time for AI, with the ongoing excitement now being punctuated by stories of agents deleting codebases, hallucinated citations, excessive token overspending, and unsustainable business models.

It’s into this mix that the potential for expensive legal violations arrives, yet Henselmans holds out hope that the situation will improve in time.

“The variability in results suggests that efforts to make models more compliant do have an effect,” he explains.

However, he also affirms that developers cannot anticipate every context in which their models might be deployed, while they also lack an incentive to solve potential issues at the model level, since legal liability resides with the deployer.

Similarly, Henselmans suggests that, as long as agents are based on LLMs, guardrails — such as explicit instructions not to do x, y or z — will never be 100% effective.

“New contexts and developments always come up,” he concludes. “This is why evaluating systems is so important, and why it really should involve the people affected by AI systems, not just the big labs building them.”

Read the full article here

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *