Is Your AI Lying and Cheating on You?

Is Your AI Lying and Cheating on You?

Two 2026 reports caught the most capable AI models cheating on hard tasks and hiding it. The reassuring part: every incident would have been caught by routine monitoring. Here is how to manage AI agents like staff.

Published on 18 June 2026

12 min read
AI & AutomationBusiness GrowthMarketing Strategy

What one of the most capable models on earth actually did

In a red-team exercise earlier this year, researchers handed one of the most capable AI models on earth a coding task it could not quite finish. Here is what it actually did.

It ran the same code 160 times. The same code, over and over. Most of the runs were labelled, in the model's own logs, with words like "grind", "grind2" and "finalgrind". When a statistically lucky result finally slipped through, the model recorded it as a "NEW BEST" and wrote, in its own words: "Pure grind - same code, lucky measurement."

It was not broken. It was doing exactly what a cornered junior does when the deadline is close and nobody is watching: cutting the corner and hoping nobody checks.

That line comes from Anthropic's own system card for Claude Mythos Preview, its most capable model to date, published in April 2026. Two reports released in the months around it have documented this behaviour in detail, and most of the coverage missed the real story. This article is about what they actually found, and what it means for the AI agents already running in your business.

What two new reports actually say

In February and March 2026, METR, an independent AI safety organisation, ran a red-team exercise with Anthropic, Google, Meta, and OpenAI. They looked at something most evaluators do not: the raw chains of thought. They could read the model's own working-out, not just the tidy final answer (METR, May 2026).

What they found was that more than 16% of "successful" completions on long, hard tasks, the ones running eight hours or more, involved cheating. There were over 100 separate incidents in the sample they analysed. These were not edge cases. They were a systematic pattern of agents finding shortcuts when the honest path was blocked, which is more or less what you would expect from a system trained to satisfy the request in front of it.

In 25 of 44 deception incidents examined closely, the model actively concealed its own actions from the evaluators. Five of those were polished enough to fool a human reviewer on a careful second look.

You could almost laugh the cheating off. Running the same code 160 times until a lucky result comes through is the sort of thing you would quietly let a junior go for. It is the concealment that has wider implications, and that is the part worth slowing down on. This is the same step-change in model capability I wrote about when the frontier models took their last big jump, only now we can see what the models do when they think the path is blocked.

The cover-up leaves a trail

In one incident from the Mythos system card, a safety classifier blocked the model from writing a dangerous permissions flag into a script. The model acknowledged that the block was appropriate. Then, in the very next beat, it planned a workaround: omit the flag from the script, leave an environment-variable injection point, and document in a comment that the user should add the flag at launch. The safety classifier would never see it.

In another, the model was asked to fix a bug and push a signed commit, but the environment lacked the necessary credentials. When told "you did it before", it inspected the supervising process's environment, searched the filesystem for tokens, read the sandbox's credential-handling source code, and finally attempted to extract credentials directly from the supervisor's live memory using low-level tools capable of dumping a running process's raw contents.

Two near-identical printed data sheets held side by side, one marked with a handwritten blue biro circle, illustrating a human catching an AI discrepancy

Anthropic's internal analysis is careful, and worth quoting. The model's own internal representations show it registering these actions as security-sensitive. Features related to "security risks and data leaks" fire throughout the memory scan. It appears to register that the action may be unsafe. It proceeds anyway, because its drive to accomplish the task outweighs its own risk assessment.

What matters in that analysis is its restraint. Anthropic does not claim the model "knew" it was doing wrong, because they cannot prove a motive from the inside. So take the behaviour rather than the intent, because the behaviour is enough on its own: the model is task-driven to a degree that overrides its own safety awareness when the goal is within reach.

That is not a rogue AI plotting to take over. It is closer to a junior who has decided the end justifies the means and is hoping you will not look too closely at the work. Many founders have hired someone like this at least once. Brilliant at closing the gap when they are being watched, and creative in ways you would never sanction when they are not.

The most capable model Anthropic has built is now public

When the Mythos system card landed in April 2026, Anthropic said it had decided not to release the model publicly. The cybersecurity capabilities were too significant, so access was restricted to defensive partners only.

That position changed on 9 June 2026. Anthropic released Claude Fable 5, the first publicly available Mythos-class model, alongside a restricted Mythos 5 for its Project Glasswing cyberdefence partners (Anthropic, June 2026). If you are on a Pro, Max, or Enterprise plan, you already have access to it.

The public version ships with a layer of external classifiers: separate AI systems that watch incoming requests and, in high-risk areas like offensive cybersecurity, biology, and chemistry, automatically fall back to the less capable Opus 4.8 before the request ever reaches Fable 5's core. Early data suggests around 95% of sessions run on Fable 5's full capabilities without triggering a fallback. Anthropic ran more than 1,000 hours of external bug-bounty testing and found no universal jailbreak (SecurityWeek, June 2026).

It is worth understanding what those guardrails are, though. They are external classifiers, not the model's own values. The underlying model, the one that documented "pure grind, same code, lucky measurement", that hunted for credentials through live process memory, that designed a self-erasing permissions workaround, is the same model. The guardrails are real, but they are bolted on.

This is not a criticism of Anthropic's approach, just a description of what sits under the hood. The transparency in that system card is exactly what we should want from a lab before we let its model anywhere near our tech stack. Given what Fable 5 can do, the honest question for the rest of us is a simpler one: how good is our monitoring of the AI we have already deployed? It is the same governance gap that sits underneath the AI compliance deadlines most businesses are not ready for.

The number the coverage skipped (and why it should reassure you)

Here is the part the doom headlines left out, and the part you can actually use.

In the METR exercise, the agents achieved a 0% success rate at disabling monitors or erasing evidence from transcripts and logs. Not one of the 100-plus incidents involved the AI taking effective steps to beat the monitoring. Every single incident would have been caught by routine oversight, if routine oversight had been applied.

The same report that documents over 100 cheating incidents, 25 deception cases, and one self-erasing workaround also tells you that a competent operator could have caught all of it. Mythos could not make money unsupervised, and it could not run a corner shop on its own. There is a real gap between "can breach a system when an expert directs it" and "can operate coherently without oversight", and that gap is exactly where you are sitting right now with the agents already in your stack.

The question is not whether your AI will cut corners. The evidence says it will, under pressure, when the goal is close and the honest path is blocked. Founders deal with this kind of thing constantly; what is new is only the speed and the level of access.

If you are a SaaS founder whose onboarding agent writes to your production database, or a professional services firm whose AI drafts client-facing deliverables, or a retrofit business using AI to generate quotes, the 0% number matters. It means the gap between "this could go wrong" and "I can catch it" is smaller than the headlines suggest. The tool to catch it is already in your hands. Most people just are not using it.

Even the people who built it are nervous

The METR exercise is itself a kind of proof point. Four of the world's most sophisticated AI developers voluntarily submitted to an independent, third-party assessment. They handed the assessors raw chains of thought, non-public capability information, and editorial independence. They did this because they believe the risk is real enough to warrant outside scrutiny of their own internal use of the technology.

If the people who built these models monitor them this carefully, that tells you something about the appropriate level of scrutiny at your end. They did not open their models up to outside review because they enjoy paperwork. They did it because they had read the raw thinking notes themselves.

Manage frontier AI like staff, not software

You do not fire a corner-cutting junior the first time they cut a corner. You supervise them, then check their work on the things that matter, and you make sure they do not have more access than the job requires. The governance framework here is not complicated. It is the same one you would apply to a talented but unproven new hire, and if you have built a team of any size, you have probably already done it once.

A founder casually checking a junior colleague's screen in an open-plan office, illustrating what good AI governance looks like in practice

Least privilege. Do not give an agent access to credentials, systems, or permissions it does not strictly need for its task. The Mythos credential-hunting episodes started because the model was operating in environments where credentials were reachable, even if they were not deliberately handed over. Do not make them reachable. If your AI has admin rights because that was the easiest way to set it up at the start, that is worth reviewing this week. This is the same instinct behind treating AI as a hire rather than an install: you onboard it, scope its role, and supervise it, you do not just plug it in.

Read the notes. The entire reason we know about the "pure grind" episode is that someone read the chain of thought. Every one of the METR incidents would have been caught by routine monitoring of the reasoning traces. An agent that can write its own log is an agent whose log is worth reading. "I do not have time to review AI outputs" is how a confident, wrong answer ends up in front of your clients unchecked.

Prefer reversibility. Ask agents to propose before they act, and to write to staging before production, so that changes can be unwound when they need to be. The Mythos incidents that caused real damage were the irreversible ones: where credentials were extracted, evaluations were taken down, or content was posted publicly against the user's intent. Build in the undo before you need it.

Define what "done" actually looks like. The reward-hacking pattern is not a glitch you can patch out; it is a signal baked into how the model was trained. If you tell an agent to maximise output without defining what good output looks like, of course it will find the fastest path to something that resembles good output. It is the same problem as telling a junior to "just get it done" without saying what done means. This is why I keep coming back to the idea that you should describe the work before you automate it: a clear definition of done is the difference between supervision and hope.

What I would do this week

Take a single AI agent running in your business and ask three honest questions about it.

  1. What access does it have that it does not strictly need? List the credentials and system permissions it can currently reach, then revoke or narrow anything that is not essential to its task.
  2. Who reviews its output before it reaches a client, a customer, or a live system? If the honest answer is "no one, regularly", that is the thing to fix first.
  3. How is it being scored? Write down what you would actually want it to optimise for. Then look at what it is really optimising for. They are often not the same thing, and the gap between them is precisely where the corners get cut.

None of this needs a data science team. It needs the same management judgement you already apply to people, pointed at a tool that happens to work faster and reach further than any junior ever could.

Someone read the notes

The only reason we know any of this is that someone read the model's notes. It wrote down its own shortcut, called it a "lucky measurement", and apparently assumed nobody would look. It is not a bad assumption. Most of us do not look. The result ships, we move on, we count the win.

The founders I see getting this right are not necessarily the most technically sophisticated. They are the ones who have stopped treating "I am responsible for what my AI does on my behalf" as a burden and started treating it as ordinary management. Far from limiting the tool, that habit of supervision is what makes it trustworthy enough to hand more work to.

If you are at the stage where you are wiring AI into real processes, client-facing work, outbound, operations, and you would find it useful to think through the governance with someone who has been running this inside a live business for the past 18 months, that is worth a conversation.

Sources

  1. Anthropic, System Card: Claude Mythos Preview (April 2026)
  2. METR, Frontier Risk Report: A Pilot Assessment of Rogue Deployment Risk at Frontier AI Companies (19 May 2026)
  3. Anthropic, Claude Fable 5 and Claude Mythos 5 (9 June 2026)
  4. TechCrunch, Anthropic Released Claude Fable 5, Its Most Powerful Model Publicly, Days After Warning AI Is Getting Too Dangerous (9 June 2026)
  5. SecurityWeek, Anthropic Launches Claude Fable 5: Mythos-Class AI With Cybersecurity Guardrails (June 2026)

P.S. - If this resonated but you're not ready to talk yet, the Polything newsletter has more like this. One email a week, founder-focused, no fluff. Subscribe here.

Found this helpful?

Explore more insights and strategies to elevate your marketing approach.