Thomas Wiegold - AI & Web Development Blog

MiniMax M3 Review: Finally Matching GPT-5.5 & Opus?

contact@thomas-wiegold.com (Thomas Wiegold) — Mon, 01 Jun 2026 00:00:00 GMT

I don't really enjoy writing model reviews. There, I said it. After the tenth "this new model is faster and smarter than the last one" post, you start to feel like you're describing the same car with a fresh coat of paint. So when I tell you this MiniMax M3 review is one I actually wanted to write, take it as a signal. M3 is interesting. Not "interesting for a Chinese open-weights model." Just interesting, full stop.

I've been in the MiniMax corner for a while now. I liked M2.5 when it landed, and I liked M2.7 even more. But there was always the same asterisk in the back of my mind: genuinely good, just not GPT-or-Opus good. A gap you could feel. This time the gap might have closed. So I ran my usual battery of tests, watched the thing think for an uncomfortably long time, and came away mostly impressed. Here's the whole story.

What Is MiniMax M3?

MiniMax M3 is an open-weights, natively multimodal model (text, image, and video in, text out) that launched on June 1, 2026 with a 1 million token context window. It's the course-correction in the M-series: where the M2 generation deliberately ditched sparse attention over production worries, M3 brings it back as the headline feature.

That feature is called MiniMax Sparse Attention, or MSA. The short version for anyone who doesn't want the linear-algebra lecture: a lightweight index branch scans incoming tokens, picks which key-value blocks actually deserve attention, and only runs the expensive math on those. The clever bit is that it does this on the real, uncompressed key-values, so you don't pay the long-context precision tax that something like DeepSeek's latent attention does. MiniMax claims a roughly 9x speedup on prefill and 15x on decode at 1M tokens, with quality holding steady in their ablations.

Why should you care about that more than the benchmark numbers? Because a quadratic-attention model can technically hold a million tokens, but actually using them is miserable. Prefill alone can take minutes. If MSA's speedups hold up under real load, that's the difference between "1M context exists on the spec sheet" and "1M context is something you'd actually build an agent around." That's the part that matters.

On pricing, it's aggressive. Standard pay-as-you-go is $0.60 per million input tokens and $2.40 per million output, with a 50% launch promo for the first week. That's somewhere between a tenth and a twentieth of what closed frontier models cost. You can run it right now through the MiniMax API, OpenRouter (OpenAI-compatible, easiest path), and a handful of launch partners.

One honest flag before we go further: at launch the parameter count is undisclosed, and the "open-weights" part is still a promise. The weights weren't on Hugging Face yet (MiniMax says "within 10 days"). So keep your enthusiasm calibrated. More on that later.

Putting MiniMax M3 Through My Usual Tests

Here's my process, which never changes, because that's the only way I can compare across releases instead of just vibing off first impressions. I run the same three tasks on every serious model: two website builds, a poker simulation terminal program, and a full code audit of my own site, thomas-wiegold.com. Same prompts, same expectations, every time.

Website one: the Sydney coffee roaster

This is one of those prompts I've run so many times I could recite the output styles in my sleep. Funny thing about it: every single model picks more or less the same color palette for a Sydney coffee roaster. GPT, Opus, Gemini, now MiniMax. There must be something deep in the training data that screams "warm browns and cream" the moment you say "coffee." I've stopped fighting it.

What separates the models is everything else, and M3 nailed everything else. The layout was clean and considered, the technical execution was solid, and honestly it was one of the best results I've gotten for this prompt to date. Right up there with the closed frontier models. That alone made me sit up, because this is exactly the kind of task where MiniMax used to be "fine, but."

Website two: the pop-culture online store

I push the complexity up here. More interactivity, more visual flair, more chances to fall apart. M3 handled it well. Nice animations, good structure, the sort of result you'd be happy to hand off as a starting point rather than a throwaway demo. Probably the second-best result I've ever gotten for this particular prompt.

Second-best, because Gemini still had a slight edge on the design polish. If you've read my take on Gemini 3.5 Flash in Google Antigravity, you'll know I rate Gemini's web design specifically while preferring GPT-5.5 and Opus for most other work. M3 didn't beat Gemini at its own game, but getting within arm's reach is a real result.

The poker simulation

And now the part where I stop gushing. The poker sim was a mixed bag.

First problem: it took forever. I'm talking 30 to 40 minutes of the model thinking and working. I sat there reading the reasoning output, partly out of curiosity and partly out of disbelief, and it was a parade of "actually..." and "oh but wait, maybe..." and then contradicting the thing it had just decided. It would talk itself into a corner, talk itself back out, and burn a frightening number of tokens doing it. At times it felt less like reasoning and more like brute-forcing its way to an answer by sheer persistence.

To be fair, this isn't a MiniMax-specific disease. A lot of the newer models do this now. They over-think, second-guess, and treat token budgets like they're free. M3 is just a particularly patient offender.

The result itself? Okay. Not a full success, it didn't completely nail the task. But I'll give it this much context: no model has ever one-shot this poker challenge to 100%. Not one. So "okay but slow" puts it in the same bucket as everything else, just with a longer coffee break in the middle.

The code audit

This is where M3 won me back. Auditing thomas-wiegold.com is a hard test, and I mean that genuinely. The site is already heavily optimized. I've run these audits over and over and fixed the problems, so finding something new and real is not easy. The bar is high precisely because the obvious stuff is long gone.

GPT-5.5 has been my favorite for audits for a while. MiniMax M3 got remarkably close. No filler, no padding, no inventing problems to look busy. Every finding made sense and was worth my time. Compare that to when I tested DeepSeek V4, which buried the good observations under a pile of non-issues that I had to wade through and dismiss. M3 didn't waste a single line on a fake problem. For a model at this price, that's genuinely impressive.

How It Stacks Up on Benchmarks

The numbers are good, and I'm going to tell you why you should still squint at them.

On MiniMax's own reporting, M3 hits 59.0% on SWE-Bench Pro, which puts it behind Claude Opus 4.7 (64.3%) and GPT-5.5 (58.6%) by a hair, and ahead of Gemini 3.1 Pro (54.2%). On SWE-Bench Verified it's at 80.5%, and on Terminal-Bench 2.1 it sits at 66.0%, where the closed models pull ahead more clearly. The pattern is consistent with what I saw by hand: close to GPT and Opus on real coding, not quite past them.

Here's the squint. Every one of those numbers is vendor-run, on MiniMax's own infrastructure, with baselines they picked, often using Claude Code as the scaffolding. That's not an accusation of cheating, it's just how launch-day benchmarks work, and you should treat all of them with the same healthy suspicion you'd apply to any company grading its own homework. The independent scores from LMArena and Artificial Analysis were still pending at launch. When those land, that's the real test.

One known soft spot worth naming: abstract, fluid reasoning. The whole family of Chinese models has lagged here, and the ARC Prize ARC-AGI-2 results from earlier this year had the MiniMax line scoring low single digits. M3 is a strong coder and a strong agent. It is not, on the available evidence, a great abstract reasoner. Good to know before you point it at a problem that needs genuine novel reasoning rather than competent execution.

The Catches

Three things to keep in your head before you get too comfortable.

The "open-weights" label was a promise on launch day, not a fact. No weights on Hugging Face yet, and the license is the bigger worry. M2.7 shipped under a "Modified-MIT" license that blocked commercial use without written permission, which got roundly mocked as faux-open-source. M3 is expected to follow the same playbook. So if your plan involves self-hosting for commercial work, do not commit to anything until the actual weights ship and you've read the actual terms. Hope is not a deployment strategy.

The token-burning I hit in the poker test is a real cost factor, not just an annoyance. The headline price is cheap, but if the model wanders through 40 minutes of self-doubt on a hard problem, your effective cost-per-task climbs. Measure the whole task, not the per-token rate.

And a 1M context window is wonderful, but it is not a memory system. For long-running agents you still want real persistence. A big window helps; it doesn't replace architecture.

The MiniMax M3 Review Verdict: Should You Use It?

Yes, with both eyes open.

I think M3 is a real winner. The results across my tests were strong, the pricing is excellent, and for the first time a MiniMax model genuinely sits in the conversation with GPT and Opus rather than a tier below it. I'm going to use it for coding and other work. I'm also keeping my Claude and ChatGPT subscriptions, because the smart move here is hybrid: route the bulk, cost-sensitive, long-context work to M3, and reserve a closed frontier model for the slice where the last few quality points actually matter.

For a bit of field context, since "is it better than X" is the only question anyone really asks: I wasn't especially moved by Claude Opus 4.8. I honestly couldn't tell you with confidence that it beats its predecessor, and during testing it fumbled something as basic as setting up a project with linting and formatting, which is not a great look. Gemini 3.5 Flash remains my pick for web design specifically, while GPT-5.5 and Opus stay in rotation for most else. M3 doesn't dethrone any of them outright. It earns a seat at the table, and at this price that's the whole point.

If you want to try it without spending anything, it's free in OpenCode right now, and it'll be part of the OpenCode Go plan too. That's the cheapest way to form your own opinion, which, as always, is the only opinion that should actually drive your decision.

I came in skeptical because I usually do, and a MiniMax M3 review was not on my list of things I expected to enjoy writing. It turned out to be one of the more genuinely interesting models I've tested this year. Run it through your own tasks before you believe me, though. That's the entire job.

Google Antigravity 2.0 Review: I Tested Gemini 3.5 Flash

contact@thomas-wiegold.com (Thomas Wiegold) — Wed, 27 May 2026 00:00:00 GMT

I've lost count of how many AI coding tools I've installed this year, used twice, and quietly uninstalled. So when Google dropped Antigravity 2.0 at I/O, my first reaction wasn't excitement. It was a tired sigh. But I tested it anyway, and this Google Antigravity 2.0 review is what came out the other side: two landing pages, half a poker app, and then a wall.

I want to be honest about that up front. I did not get a long run with this thing. I managed roughly two and a half real tasks before I ran out of tokens, faster than I've ever run out of tokens with any coding tool. That's a short test, and I'll tell you where it limits what I can claim. But it's also, in its own way, the most useful data point in the whole review. More on that later.

Spoiler: it's fast. It's also more complicated, and more expensive, than the launch slides let on.

What Antigravity 2.0 Actually Is (And Why It's Two Apps Now)

Quick reset, because the naming around this launch is genuinely confusing.

Google Antigravity first appeared in November 2025 as a single app: an agentic coding IDE built on a heavily modified VS Code fork. Inside that one app sat three surfaces that worked together. There was the Editor, a full IDE for actually reading and tweaking code. There was the Agent Manager, a command-center dashboard where you launched agents, watched them work, and reviewed the plans and artifacts they produced. And there was a Browser, an agent-controlled browser instance the agents could drive to test web pages and pull data.

Antigravity 2.0, announced May 19, 2026 at I/O, breaks that single app apart. It's no longer one product, it's five: a standalone desktop app, the original IDE, a Go-based CLI, an SDK, and a Managed Agents API you can call straight from the Gemini API. Each one is a separate download now.

The split that matters most is the desktop app versus the IDE. Antigravity 2.0, the flagship desktop app, is basically the old Agent Manager promoted to its own program. There is no code editor in it at all. It exists purely to launch, monitor, and orchestrate agents, run them in parallel, and schedule background tasks. The Antigravity IDE, meanwhile, is the original VS Code-based editor, still available and the one Google actually recommends for hands-on developers. The intended workflow is dual-wield: orchestrate agents in the desktop app, drop into the IDE when you want to touch code yourself.

Google's reasoning is that they proved the agent-first surface works, millions of developers adopted it, so now they're separating the two jobs into separate tools. They've even said the long-term plan is to strip the Agent Manager out of the IDE entirely, leaving a purely agent-powered editor behind. Whether you want your tooling pulled apart like that is a taste question. I'm not sold on running two windows where one used to do, but I get the logic.

The model running underneath all of it, by default, is Gemini 3.5 Flash. And yes, it's "Gemini 3.5 Flash", not "Flash 3.5". The API ID is gemini-3.5-flash, and it shipped generally available on day one, no preview suffix, no waitlist. If you see anyone write "Flash 3.5" they've just reversed the words. Small thing, but it tells you whether someone actually read the announcement.

What It Can Actually Do, and How It Stacks Up

The real idea here, the thing the whole platform is built around, is multi-agent orchestration. You describe a task, a manager agent breaks it into subtasks, and several specialised agents work in parallel: one writes code, another runs terminal commands, a third tests in the browser, and they verify each other's work in a loop until the task passes its checks. On top of that you get scheduled tasks, where you hand an agent timed instructions to run in the background, and voice input for short prompts. This is genuinely different from how Claude Code or Codex work, which run a single agent through tasks sequentially.

As a Go developer, I'll admit the Go-based CLI got my attention more than the desktop app did. It's the direct successor to the Gemini CLI, a superset of its features, and there's an antigravity migrate --from-gemini-cli command to bring your old config across. It works alongside whatever editor you already use, Vim, Neovim, JetBrains, whatever, so you're not forced into Google's apps to get the agent harness.

That migration isn't optional, by the way. The existing Gemini CLI is being retired for consumers. Access for AI Pro, AI Ultra, and free-tier users ends June 18, 2026, with only Enterprise Code Assist keeping it. If your scripts or pipelines lean on the old CLI, that's a calendar entry, not a someday-maybe.

So how does this stack up against Claude Code and Codex? The parallel multi-agent model is the one real differentiator, and on the right kind of work, fanning tasks out is genuinely faster than grinding through them one at a time. Where Antigravity loses is code quality. Google hasn't published any Antigravity-versus-Claude-Code benchmarks, so this rests on third-party reviews, but the independent reads are remarkably consistent: Antigravity wins on speed and breadth (desktop app plus CLI plus SDK), Claude Code still leads on raw code quality, and Codex sits somewhere in between. I'll get to the actual benchmark numbers in the verdict, but that ranking won't surprise anyone who's used all three. Fast and parallel is great. It just isn't the same thing as correct.

Why I Was Skeptical Before Even Installing It

Let me be upfront about my bias here, because it shapes everything that follows.

Google has a product graveyard you could get genuinely lost in. I've been burned enough times that I no longer trust the longevity of any Google developer tool until it's survived a couple of years in the wild. Building your daily workflow around something Google might quietly kill is its own kind of technical debt, and it's the kind that doesn't show up in any benchmark.

It's not just the abstract track record either. I used the original Antigravity when it launched last November. Unimpressed. I tried Jules, Google's earlier coding agent. Also unimpressed. So I went into this Google Antigravity 2.0 review fully expecting to write a polite "it's fine, but" piece and move on with my day.

I'm telling you this so you know what kind of reviewer you're reading. Not a hype account chasing affiliate clicks. Not a reflexive Google hater either. Just a developer with fifteen-odd years behind him, too many subscriptions, and a very low tolerance for tools that waste my time. A skeptic giving the thing a fair shot is still a fair shot.

Hands-On: Building Two Landing Pages

The fastest way to judge a coding agent is to give it real work, so I skipped the toy prompts and built two landing pages from scratch.

The Sydney coffee roaster site

First up, a landing page for a fictional Sydney coffee roaster. Antigravity 2.0 one-shot it. And it was genuinely fast, the kind of speed where you blink and there's already a working page sitting in front of you. Technically the output was solid: clean structure, sensible markup, and some nice animation work that I hadn't even asked for. Nothing I'd be embarrassed to ship the bones of.

But the visual style felt dated. Not broken, not ugly exactly, just five to ten years behind. It looked like a competent template from 2017. The spacing was a little too tight, the typography a little too safe, the colour choices a little too corporate-stock. If you handed this to a client today they'd politely ask for "something a bit more modern", and they'd be completely right to. The model knows how to build a page. It just doesn't seem to know what year it is.

The pop-culture clothing store

Then I tried a pop-culture themed clothing store. Different brief, bolder, more playful, the kind of thing where you actually want some personality on the page. And this one genuinely impressed me. The design was good enough that I caught myself thinking I'd actually shop there, which is a reaction I almost never have to an AI-generated frontend. (For comparison, the design output I've gotten out of Claude Code with its design tooling has been my benchmark, and this wasn't far off it on the right brief.)

So here's the honest takeaway. The output quality from Gemini 3.5 Flash is real, but it's inconsistent. Give it a bold, modern, opinionated brief and it shines. Give it a "professional default" brief and it reaches for something stale. That's worth knowing before you trust it with anything client-facing, because you can't predict which version you'll get until the page renders.

Where It Fell Apart: Token Limits and the Desktop App

Here's where my testing ended. And I mean ended.

After the two landing pages and a half-built poker terminal app, I ran out of tokens. Quota exhausted. I literally could not finish testing the thing I was supposed to be reviewing, which is a special and slightly absurd kind of frustration. Three real tasks. That was the whole run.

And I'm not an outlier. The research backs this up loudly. On Google's own developer forum, people reported an entire daily quota burning on a single trivial prompt. One user said that just asking for a single AGENTS.md file ate the whole allowance. Others described a Flash-to-Pro escalation loop, where the agent quietly bumps itself up to the more expensive Pro model mid-task and drains a week of Pro quota in a couple of days. If you've ever watched a usage meter spin and had no idea why, you'll recognise the feeling instantly.

The desktop app itself? Barebones. I kept waiting to find the feature that made it meaningfully different from every other coding-agent app I've used, and I didn't find it. It's an agent runner with a window around it. Functional, sure, but "functional" isn't a reason to switch, and after fifteen years of tooling churn I've learned to be suspicious of anything that launches looking this generic. A 2.0 release should feel like a confident product. This felt like a 1.0 that got renamed.

I didn't personally hit the OAuth problems, but I'd be doing you a disservice not to mention them, because they were everywhere at launch. Multiple developers reported paid Pro subscriptions failing to authenticate against the desktop app, with the OAuth redirect simply never completing. The app working fine on a free account and then breaking on a paid one is close to the worst possible failure mode for a launch. People paid Google money and got a worse experience for it.

The Pricing Catch Nobody Mentions Upfront

This is the part that actually changed my mind, so stay with me.

On paper, Gemini 3.5 Flash pricing looks reasonable. It's $1.50 per million input tokens and $9.00 per million output, roughly 25% cheaper per token than Gemini 3.1 Pro. Cheap-ish. Flash-ish. The sort of number you skim past without thinking.

Then Artificial Analysis ran the numbers properly, and the story falls apart. Running their Intelligence Index benchmark suite cost $1,552 on Gemini 3.5 Flash. That's 5.5 times what a previous Flash model cost for the exact same suite. And it's 74% more expensive than Gemini 3.1 Pro, the supposedly pricier model.

Read that again, because I had to. The "cheap" model cost more to run than the expensive one.

Why does this happen? Verbosity and turn count. The Decoder, summarising the same data, points out that Flash averages 49 agentic turns per task, more than any other model tested. Gemini 3.1 Pro needs 23 for the same work. Every one of those extra turns is tokens, and reasoning tokens bill at the output rate. The model is fast, but it is relentlessly chatty, and chatty gets expensive fast when you're paying per word. It's the same pricing-creep pattern I flagged when I reviewed DeepSeek V4, except this time it's hiding inside a model that's marketed as the budget option.

Here's my honest read. The entire point of a Flash-tier model is "cheap and fast". If Flash is this token-hungry, the cheap half of that promise is just broken. And if it isn't cheap, the main reason to add it to your stack mostly evaporates. Speed alone doesn't pay the API bill at the end of the month. You can claw some of this back with tighter prompting, and good prompt discipline matters more than ever here, but you shouldn't have to fight the model to hit the price it advertises.

Should You Switch? My Google Antigravity 2.0 Review Verdict

Let me give credit where it's genuinely due first.

Google's published benchmarks hold up. When independent testers re-ran them, the scores matched to the decimal. No benchmark fudging, no creative interpretation. Gemini 3.5 Flash really does beat Gemini 3.1 Pro on most of the tests Google chose to put on its slides.

The catch is that phrase, "chose to put on its slides". On Artificial Analysis's broader Intelligence Index, Flash ranks around #7 to #8 overall, sitting behind GPT-5.5 and Claude Opus 4.7. And on SWE-Bench Pro, the benchmark closest to real-world refactoring work, Opus 4.7 leads it comfortably, 64.3% to 55.1%. Fast and smart, yes. Frontier-leading, no. The gap between "wins the benchmarks Google picked" and "wins the benchmarks generally" is the whole review in one sentence.

So would I switch? No. And it's honestly not really about the launch bugs, which Google will presumably patch.

I already run Claude, ChatGPT, OpenCode, and Cursor. That's four subscriptions, four tools I know well, and frankly more than I need already. For Antigravity 2.0 to earn a fifth slot it would have to do something those four can't, and even with the short run the quota allowed me, I can't tell you what that thing is. The parallel multi-agent orchestration is the closest candidate, but it isn't worth a fifth bill when the token cost is this unpredictable. I've said the same about other shiny new model launches that arrived with more hype than substance.

Who might it actually suit? Someone with no existing agentic-coding stack who wants quick scaffolding and doesn't mind a few rough edges. If you're starting from zero, the free public preview is a reasonable look and costs you nothing but time. If you've already got tools you trust, Antigravity 2.0 doesn't add anything you're missing.

Fair caveat to close on: this is one week post-launch. I couldn't fully stress-test pricing in practice because, well, I ran out of tokens trying. Google has said Gemini 3.5 Pro is coming around June 2026, and that's the point where I'd revisit all of this properly. Until then, I'm keeping my four subscriptions and skipping the fifth.

Notion Workers for Small Business: A Hands-On Guide

contact@thomas-wiegold.com (Thomas Wiegold) — Mon, 18 May 2026 00:00:00 GMT

Some of the small businesses I work with run their operations on Notion. Docs, project tracking, content calendars, lightweight CRMs, the whole operational layer of the company. So when Notion shipped a real developer platform last week, the obvious question was whether it changes the automation advice I give those clients.

I watched the videos walking through Workers and the new CLI, read the docs, and built a real worker against a Shopify store I help with. This article is the version of that work I'd hand to a client asking whether they should care.

I'll cover what Notion actually shipped, why it matters if you're running a small business on top of Notion already, and where Custom Agents plus Workers genuinely change what's possible for SMB automation. Then I'll walk through the build with code excerpts and the CLI commands that mattered.

The thesis up front, because I hate articles that bury it. Workers don't make Notion the right tool if it wasn't already. But if Notion is already where your business operates, this is the most consequential thing Notion has shipped for developers, period.

What Notion Actually Launched

On May 13, Notion launched four things at once under the Developer Platform banner. Notion Workers is a hosted Node/TypeScript runtime where you write a small program and Notion runs it on their infrastructure (Vercel Sandbox under the hood, according to Vercel). The ntn CLI is the only way you interact with it. The External Agent API brings Claude Code, Cursor, Codex, and Decagon into Notion as workspace participants. And the Agent SDK, still on a waitlist, goes the other direction so you can embed Notion's agents into your own products.

The Workers SDK gives you six primitives. A worker is a single TypeScript file that wires up some combination of them:

worker.database() declares a Notion database whose schema lives in your code.
worker.sync() pulls data from an external API into that database on a schedule.
worker.tool() exposes a function a Notion Custom Agent can call.
worker.webhook() exposes an HTTPS endpoint other services can post to.
worker.oauth() configures OAuth 2.0 against any third-party API.
worker.pacer() is a built-in rate limiter for your outbound calls.

You write one src/index.ts that exports a Worker, run ntn workers deploy, and that's the whole pipeline. No Dockerfile. No CI. No log shipping setup. ntn workers runs logs is your observability layer, which is either liberating or terrifying depending on your background.

Why this isn't just another Notion API update: until now, if you wanted Notion AI to do something Notion didn't ship as a built-in connector, you either hosted a Model Context Protocol server yourself or you didn't. Workers replace that pattern entirely. You write a function, attach the worker to a Custom Agent, and the agent calls it. That's the actual unlock.

What are Notion Workers?

Notion Workers are TypeScript programs that Notion hosts and runs in a sandboxed runtime. A single worker can sync external data into a Notion database, expose deterministic functions as tools for Notion's AI agents, and receive webhooks from outside services. You deploy them with the ntn CLI and they run on Notion's infrastructure with no servers, queues, or containers to manage.

What is the Notion CLI?

The ntn CLI is Notion's command-line tool for the Developer Platform. It scaffolds Worker projects, deploys them, manages secrets and OAuth, triggers syncs for testing, and inspects run logs. It's available on every Notion plan including Free for direct API calls. Deploying Workers requires the Business plan or higher.

Why Notion Already Owns Small Business Operations

Step back from the developer angle for a second. The reason Workers matter at all is that a surprising number of small businesses already run their operational layer on Notion. Not just docs. Not just notes. The whole "where work actually happens" surface.

Documentation and wikis is the obvious one. Onboarding docs, SOPs, meeting notes, runbooks. Notion's block editor and nested pages beat every alternative for non-engineering teams, and AI search across the workspace makes that content genuinely useful instead of a graveyard of stale pages nobody opens.

Then there's the lightweight database story. Content calendars, light CRMs, project trackers, customer feedback logs. You get table, kanban, calendar, gallery, and timeline views from the same underlying data, and most small teams never need anything more sophisticated than that.

Agencies and consultancies have been quietly building entire client-facing operations on Notion for years. Per-client workspaces, deliverable tracking, project portals, retainer dashboards. With Workers, you can now sync each client's billing and time tracking into their portal without resorting to Zapier glue.

Marketing teams run content calendars where briefs, drafts, SEO targets, and publishing dates all live in one place. Sales teams under a hundred contacts run lightweight CRMs from it. Customer support uses it for knowledge base pages that the team and customers both reference. Finance teams build dashboards summing Stripe and bank data into a monthly view. None of these are best-in-class against dedicated tools, but the value of having them all in one searchable workspace beats best-in-class fragmentation for most small teams.

The honest counterweight: heavy relational data above a few thousand rows gets slow. Real project management with dependencies and sprint planning stays Linear or Jira territory. CRM at any serious scale needs Attio or HubSpot. The Electron app is famously laggy on older hardware. Data portability is a known weak point because Notion's export-to-Markdown degrades structure and the underlying block format is proprietary.

If you've ever tried to migrate off Notion after a few years, you know.

What is Notion best for in small business?

Notion is best for small businesses needing a unified workspace covering wikis, lightweight project management, content calendars, simple CRMs, and AI-augmented knowledge search. It particularly shines for teams under fifty people, agencies running client portals, and content-driven businesses where docs and operational data live side by side. It's a poor fit for heavy relational data, scaled CRM, and engineering-team project management.

The pattern I keep seeing in SMB stacks: Notion as the operational hub, Stripe or Shopify as the financial source of truth, GitHub or Linear for the engineering work that doesn't belong in Notion, Slack for the synchronous stuff. Workers connect the first one to all the others in a way that finally feels native. If you want the bigger picture on how this fits the broader landscape, I wrote a piece on AI agents for small business in 2026 covering the surrounding ecosystem.

The AI Automation Shift That Workers Unlocks

This is where I think the article needs to land hardest, because this is the actual story. Three phases of small business automation, roughly.

The first era was Zapier-style trigger and action. Something happens in app A, do something in app B. Mostly mechanical, mostly worked, mostly billed by the task. Fine until your "if this then that" turned into "if this then check three things and maybe do one of four things depending on context", and then you were eight Zaps deep in something nobody understood six months later.

The second era was AI agents that could read your data. Notion Agent, ChatGPT with connectors, Claude with MCP. Useful for summarisation and Q&A across your workspace, but they couldn't actually do anything beyond what their built-in connectors supported. Want the agent to look up an order in your custom commerce platform? Tough.

The third era is right now: AI agents that can call your code. That's what Workers add. A Custom Agent runs in your workspace, you ask it a question, and underneath it calls a deterministic function you wrote that hits Shopify, Stripe, your internal database, whatever, and returns structured data the agent then summarises for you. From Notion's own framing, Workers are deterministic, which makes them more reliable than LLM reasoning and a fraction of the token cost.

The before-and-after is concrete. Before Workers, a "what's this customer's history?" workflow meant gluing together a Shopify lookup, a support inbox query, an email thread search, and somebody's memory. After Workers, you type the question in Notion and the agent calls a function that aggregates all of it. Same data, different effort.

For online retail specifically the implications spiral outward fast. Order lookups, refund pattern analysis, customer LTV summaries, churn risk flags, restock recommendations. I went deeper on what this means for online stores in agentic commerce in 2026, but the short version is that the agent layer is finally getting cheap enough to deploy on actual SMB workflows.

Can Notion Workers replace Zapier?

For workflows where Notion is one end of the pipe, often yes. Workers handle Notion-attached automation more cleanly than Zapier with better code quality, version control, and AI agent integration. For arbitrary SaaS-to-SaaS workflows where Notion isn't involved, Zapier still wins on prebuilt integrations and accessibility for non-developers. The honest split is to keep your "Slack to Twilio to Google Sheets" Zaps on Zapier and build new Notion-adjacent automation as Workers.

A Hands-On Build

OK enough abstract. Here's the actual worker.

The brief: I wanted to ask a Custom Agent in Notion "what's the history of jane@example.com?" or "how bad was our refund week?" and get a real answer, backed by actual Shopify data, without anyone alt-tabbing to Shopify admin. One worker, three capabilities. A managed Notion database that holds Shopify orders, a sync that keeps it current every fifteen minutes, and two tools the agent can call.

The whole worker is one file. Here's the skeleton, with the verbose middle bits trimmed so you can see the shape:

import { Worker } from "@notionhq/workers";
import * as Schema from "@notionhq/workers/schema";

const worker = new Worker();
export default worker;

// 1. Pacer: stay under Shopify's 2 req/sec REST limit
const shopify = worker.pacer("shopifyApi", {
  allowedRequests: 2,
  intervalMs: 1000,
});

// 2. Managed database: schema lives in code
const orders = worker.database("orders", {
  type: "managed",
  initialTitle: "Shopify Orders",
  primaryKeyProperty: "Order ID",
  schema: {
    properties: {
      /* ... */
    },
  },
});

// 3. Sync: keep the database fresh
worker.sync("shopifyOrders", {
  /* ... */
});

// 4. Tools: what the agent can call
worker.tool("getCustomerSnapshot", {
  /* ... */
});
worker.tool("getRecentRefundReport", {
  /* ... */
});

Four blocks. The full source is on GitHub, so I'll skip the rest and call out the bits worth understanding.

The sync. The execute function pulls orders from Shopify, transforms them into upsert changes, and returns them. The piece worth knowing about is the cursor pattern. Shopify's updated_at index is eventually consistent, which means if you naively store "the latest updated_at I saw" as your cursor, you'll occasionally miss records that get indexed a second or two late. The Notion docs recommend holding your cursor a buffer behind "now" (I use fifteen seconds) so those records have time to settle before the next cycle picks them up. Within a single cycle, I follow Shopify's Link: <…>; rel="next" header for pagination because page_info can't be combined with other filters. This took me longer to get right than the rest of the worker combined.

The tools. Each tool is a function with a JSON schema for its input. The agent picks one based on the conversation and calls it with arguments it generates. Here's the meat of getCustomerSnapshot:

worker.tool("getCustomerSnapshot", {
  description: "Get an order history snapshot for a Shopify customer by email...",
  hints: { readOnlyHint: true },
  schema: j.object({
    email: j.email().describe("The customer's email address."),
  }),
  execute: async (input, { notion }) => {
    const results = await collectPaginatedAPI(notion.dataSources.query, {
      data_source_id: ORDERS_DATA_SOURCE_ID,
      filter: { property: "Customer Email", email: { equals: input.email } },
      sorts: [{ property: "Order Date", direction: "descending" }],
    });
    // aggregate orders, compute lifetime value, return structured payload
  },
});

Two things to internalise. First, readOnlyHint: true lets the agent execute this tool without asking the user for permission each time, which is right for read-only lookups. Write tools should leave it off so the agent has to ask. Second, the description and the schema descriptions are what the LLM reads to decide whether to call the tool. Treat them like API documentation written for a literal-minded colleague who hasn't had coffee yet.

The CLI commands that mattered. Not a tour, just the five I actually touched:

ntn login                                       # once
ntn workers env set SHOPIFY_STORE=...            # secret setup
ntn workers deploy                              # the whole pipeline
ntn workers sync trigger shopifyOrders --local   # test against Shopify, no writes to Notion
ntn workers runs logs <run-id>                   # debug a specific run

The local sync trigger is the most underrated. It runs my code against Shopify but doesn't write to Notion, so I can dry-run, inspect transformed output, and fix bugs without polluting the database. --preview does the same thing post-deploy.

The moment it actually works. After deploying, I created a Custom Agent in Notion, attached the worker, and enabled both tools. Then I typed in a Notion page: "What's the order history for [customer email]?" The agent called getCustomerSnapshot, got back a structured payload (eight orders, $340 lifetime value, two refunds, VIP flag), and wrote that into the page as a clean summary. Question to answer, maybe four seconds.

This is the bit I struggled to communicate to people who weren't already on Notion. It's not "automated Slack message when a customer signs up." It's a real-time, conversational interface to your own business data, where the agent figures out which tool to call and what arguments to use. The Zapier comparison breaks down because Zapier was never trying to do this.

Going further. A webhook for refund events is in the repo as webhook.going-further.ts. It receives Shopify's refunds/create event, verifies the HMAC, and creates a triage page in a separate managed database. From there the obvious next moves are more tools (product performance, churn risk, cohort metrics) and a scheduled Custom Agent that builds a daily briefing page.

The whole thing took me about two hours, most of which was reading docs and untangling Shopify's pagination. The code is shorter than this article section about the code.

The Honest Verdict

I've been using this for less than a week, so anything I tell you about reliability or hidden gotchas is preliminary. Here's where I've landed after one real build and several reads of the docs.

Adopt now if you're already on Notion Business or Enterprise, you currently glue with Zapier or hand-rolled scripts, you use Custom Agents and have hit the limits of built-in connectors, or you run an agency delivering client-facing dashboards. The August 2026 free window is a gift. Use it to learn the surface before the meter starts.

Wait if your source of truth lives somewhere else (Salesforce, HubSpot, a custom app) and you use Notion lightly. You'll be better off keeping integration logic in your actual source of truth instead of building a Notion-shaped layer on top.

Skip if vendor lock-in is a hard line for you, or your workflows don't touch Notion at all. Workers code is @notionhq/workers-specific and won't run anywhere else without rewriting. Every worker you ship deepens the commitment.

The cost reality is straightforward but the unknowns matter. Business plan at $20 per user per month is the floor. Custom Agents already moved to credit-metered billing on May 4 at $10 per thousand credits. Workers join the same meter on August 11, 2026. Notion has said per-call Workers cost will be a fraction of agent token cost but hasn't published numbers. For a small store running something like the example above, I'd budget $5 to $15 a month in credits, but verify against your own usage during the free beta.

How much do Notion Workers cost?

Notion Workers are free during the beta through August 11, 2026. After that they join the Custom Agent credit system at $10 per 1,000 credits, purchased as a workspace add-on. Workers themselves cost less per call than full agent runs because they're deterministic code, not LLM reasoning. The Business plan at $20 per user per month is required to deploy Workers at all.

What I'm watching over the next six months: the August credit pricing, whether Notion publishes hard operational limits (concurrent executions, CPU time, response size are all currently unspecified), the first SLA, and whether a marketplace emerges. The platform is two days old at the time of writing. It deserves benefit of the doubt and skepticism in equal measure.

Workers won't replace your entire stack. They probably won't replace Zapier either, unless most of your Zaps already had Notion on one end. What they will do, if Notion is already where your business actually operates, is turn it from a great document-and-database tool into a genuinely programmable workspace where AI agents can call your code on real data. For the Shopify stores I work with, that's enough.

Claude Code Hooks: From Linting to Hardened AI Workflows

contact@thomas-wiegold.com (Thomas Wiegold) — Sun, 10 May 2026 00:00:00 GMT

A couple of months ago I wrote a piece arguing that CLAUDE.md is helpful but expensive noise. The short version: you can't trust Claude to follow your instructions consistently, and every paragraph you write to plead with it costs you tokens on every turn. People asked the obvious follow-up. If CLAUDE.md isn't reliable, what is?

This is the answer.

Claude Code hooks let you wire deterministic shell commands and LLM evaluators into specific points in the agent's lifecycle. Not suggestions. Guarantees. The thing that's supposed to happen, happens, every time, without the model deciding whether to bother.

I've spent the last few weeks migrating things out of CLAUDE.md and into hooks, and the difference is hard to overstate. Claude went from "fast typist I have to babysit" to "teammate I can leave alone for a couple of hours". This piece walks you through how to get there. It's structured as four stages of adoption, from the 5-line formatter every project should ship today, all the way up to forced verification that genuinely changes how you work with AI.

We'll also cover what translates to Codex and OpenCode at the end, because hooks are slowly becoming a category, not a Claude-specific feature.

How Claude Code Hooks Actually Work

A hook is a configuration entry that tells Claude Code to run something (a shell command, an HTTP call, an MCP tool, a Claude prompt, or a full subagent) when a specific event fires in its lifecycle.

If you've used git hooks, the mental model is the same. You wire a script to a known checkpoint, and that script runs every time Claude reaches it. The difference is that Claude Code has dozens of these checkpoints, where git gives you a handful, and the script has structured data about what Claude was about to do or just did.

The simplest possible hook looks like this: "before any Bash command, run my script". Your script gets the command Claude wants to execute on stdin as JSON, and decides whether to allow it, modify it, or block it. That's the whole API. Everything else is just variations on which event you're hooking, and what your script does with the JSON it receives.

So the question becomes: which events are available, and what shape is the data?

The lifecycle is bigger than most articles let on. As of v2.1.116 there are 27 events, but for practical purposes they fall into five buckets:

Session lifecycle: SessionStart, SessionEnd, Setup
Agentic loop: PreToolUse, PostToolUse, PostToolBatch, Stop, StopFailure
Permissions: PermissionRequest, PermissionDenied
Subagents and teammates: SubagentStart, SubagentStop, TeammateIdle, TaskCreated, TaskCompleted
System events: PreCompact, PostCompact, FileChanged, ConfigChange, Elicitation, and a handful of others

If you're wondering which one to use for a given idea, 90% of the time it's PreToolUse, PostToolUse, or Stop. Worry about the rest later. The full list lives in the official hooks reference if you want it.

There are five handler types: command (shell command, the default), http (POST to a URL), mcp_tool (call a connected MCP server tool), prompt (single-turn LLM evaluator, runs Haiku by default), and agent (full subagent verifier, marked experimental). Most production hooks are command type, and that's what I'll show in examples below.

Hooks live in JSON config files in this priority order: user-global at ~/.claude/settings.json, project-committed at .claude/settings.json, project-gitignored at .claude/settings.local.json, then managed enterprise policy, then plugins. All hooks merge additively. Higher priority doesn't replace lower priority; everything that matches an event runs.

The protocol is simple. Your hook reads JSON from stdin (session ID, tool input, transcript path, and so on), does its thing, and communicates back through:

Exit code 0: success, optionally with JSON or plain text on stdout
Exit code 2: blocking error, stderr is fed back to Claude
Structured JSON like {"hookSpecificOutput": {"permissionDecision": "deny", "permissionDecisionReason": "..."}} for fine-grained control

One concept catches everyone out: matchers. They look like regex but aren't, unless they contain regex metacharacters. mcp__memory matches no tool. You need mcp__memory__.*. The cleaner alternative added in v2.1.85 is the if field, which uses permission-rule syntax: "if": "Bash(git *)" only fires for git commands, "if": "Edit(*.ts)" only for TypeScript edits. Use it.

Stage 1: Format and Lint on Every Edit

Start here. Every project gets this hook on day one, and you keep it forever.

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Edit|Write",
        "hooks": [
          {
            "type": "command",
            "command": "jq -r '.tool_input.file_path' | xargs -I{} sh -c 'case \"{}\" in *.ts|*.tsx|*.js|*.jsx|*.json) npx --no-install oxfmt \"{}\" && npx --no-install oxlint --fix \"{}\" ;; esac'",
            "timeout": 30
          }
        ]
      }
    ]
  }
}

That's it. After every file edit Claude makes, oxfmt formats and oxlint runs with autofix. Prettier and ESLint work fine here too, but I've moved everything to the oxc tools lately because they're written in Rust and absurdly fast. That matters because hook latency lives in Claude's hot path. A hook that takes 3 seconds adds 3 seconds to every single tool call.

A few details that matter:

Use --no-install so a missing devDep fails fast instead of triggering a mid-session install you didn't ask for.

Anchor scripts to $CLAUDE_PROJECT_DIR when you have anything more complex than this. Quote the path. Spaces happen.

Don't put tsc --noEmit here. It's the most common trap I see. Typecheck takes 10 to 30 seconds on a real codebase. Multiply that by the 50 file edits Claude makes during a feature, and you've added 25 minutes of wall-clock time for type errors that were going to be caught by the Stop hook anyway. We'll get to that in Stage 4.

If you're on Bun like me, swap npx --no-install for bun x. Same semantics, slightly faster cold start, and you stop seeing npm's "found 0 vulnerabilities" output noise in your hook logs. For Go projects, the equivalent is gofmt -w "$(jq -r .tool_input.file_path)" plus optionally goimports. Same shape, different toolchain.

The whole thing is honestly less interesting than the section length suggests. But it's the obvious starting point, and it's also the only hook 90% of developers ever set up. The next three stages are where the actual leverage is.

One unsexy upside that surprised me: PR review feels different. You stop reading auto-formatted diffs, and Claude stops "fixing" formatting that wasn't broken in the first place. Signal-to-noise on AI commits goes up immediately.

Stage 2: Security and Guardrails

This is the section that justifies hooks not being optional.

In October 2025, GitHub issue #10077 was filed by a developer who watched Claude Code run rm -rf on their home directory. November 2025 brought #12637, where Claude created a literal ~ directory and a later glob expansion took down everything in the user's actual home. Both incidents happened in standard permission mode, not bypass mode. Both are tagged area:security/bug in Anthropic's tracker. They are not theoretical.

The fix is one PreToolUse hook, deployed at user level so it applies to every project:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "INPUT=$(cat); CMD=$(echo \"$INPUT\" | jq -r '.tool_input.command // empty'); if echo \"$CMD\" | grep -qE '\\brm\\b.*-rf|git\\s+push.*--force.*\\b(main|master)\\b|>\\s*\\.env|--no-verify'; then jq -n --arg r \"Blocked: dangerous command pattern\" '{hookSpecificOutput:{hookEventName:\"PreToolUse\",permissionDecision:\"deny\",permissionDecisionReason:$r}}'; fi; exit 0"
          }
        ]
      }
    ]
  }
}

Add patterns as you find them. Mine has grown to about 15 entries over time. Things that get caught: rm -rf of any flavor, force-pushes to protected branches, writes to .env files, git commit --no-verify, chmod 777 on anything, and the occasional dd if=/dev/zero from when Claude got creative about "freeing up disk space". (True story, my SSD survived.)

The reason this is so important goes deeper than "Claude makes mistakes". The model genuinely doesn't understand the OS-level consequences of path expansion. When it sees rm -rf $TARGET_DIR and TARGET_DIR=~/proj, the unsetting of that variable somewhere upstream is invisible to the model. The hook sees the expanded form and stops it.

Now the killer feature: a PreToolUse hook returning permissionDecision: "deny" blocks the tool even under --dangerously-skip-permissions.

That sentence is worth re-reading. I've written separately about how dangerous that flag actually is, so I won't rehash it here. The short version: bypass mode disables the interactive prompts and the auto-mode classifier. It does not disable hooks. So you can run Claude in a sandboxed worktree with --dangerously-skip-permissions for genuine YOLO speed, and your destructive-command guard still has the final word. This is the basis of what the community calls "Safe YOLO". I do most of my heavy refactoring this way now. It's fast, and it's hard-to-impossible to nuke your machine with.

A complementary pattern: .env and .git/ write protection. Add to the same hook. Different incident, same logic. You don't want Claude curiously inspecting your AWS credentials, and you definitely don't want it editing .git/objects because it decided the repo state was confusing.

Stage 3: Logging and Observability

Now we're past safety and into "what is Claude actually doing". Hooks are the only honest answer.

A PostToolUse hook with no matcher, appending each tool call to a JSONL file:

{
  "hooks": {
    "PostToolUse": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "jq -c '{ts: now, session: .session_id, tool: .tool_name, input: .tool_input, ok: (.tool_response.is_error | not)}' >> ~/.claude/audit.jsonl",
            "async": true
          }
        ]
      }
    ]
  }
}

Five lines, complete audit trail. The async: true means it doesn't block the agent loop, which matters because you'll have hundreds of these per session.

The HTTP hook variant posts to a centralized collector instead. Slack webhook, Datadog, your own logger, whatever. Watch the allowedEnvVars field carefully here. This is the credential-leak guard. If you reference $SLACK_TOKEN in your headers without listing it in allowedEnvVars, Claude Code silently replaces it with the empty string. Frustrating the first time, prevents a bad day the second.

Cost tracking is a nice extension. Parse tool_response, increment a counter in ~/.claude/state/usage.json, optionally return decision: "block" once you cross a threshold. I haven't bothered. My usage is well within limits, but if you're running multiple agentic loops overnight it's an obvious win.

Once you have a JSONL audit trail, the answers it gives you are surprisingly useful. "What did Claude do during last night's overnight refactor?" One jq query. "Did Claude touch the auth module while it was supposed to be working on something else?" One grep. Past me would have read the entire transcript file looking for that. Present me reads three lines of audit log.

The honest takeaway: there's no other tamper-evident way to capture what Claude actually did across long sessions. The transcript file works for one session, but it's local and gets compacted. Hooks are the export layer.

Stage 4: Forced Verification with Stop Hooks

This is the section that genuinely changed how I work with Claude. If you stop reading after one part, make it this one.

The problem: Claude says "All done!" and the build is broken. You've all seen this. The standard prompt-engineering response is to add stronger language to CLAUDE.md ("YOU MUST RUN THE TESTS BEFORE STOPPING"), and (per my earlier piece) it does not work. Not reliably. Not for long.

The fix is a Stop hook that runs your real verification commands and forces Claude to keep working if anything fails:

{
  "hooks": {
    "Stop": [
      {
        "hooks": [
          {
            "type": "command",
            "command": ".claude/hooks/verify.sh",
            "timeout": 180
          }
        ]
      }
    ]
  }
}

And the script:

#!/bin/bash
INPUT=$(cat)
[ "$(echo "$INPUT" | jq -r '.stop_hook_active')" = "true" ] && exit 0

cd "$CLAUDE_PROJECT_DIR" || exit 0
OUTPUT=$(npx tsc --noEmit 2>&1 && npx vitest run --reporter=basic --no-watch 2>&1)
if [ $? -ne 0 ]; then
  jq -n --arg out "$(echo "$OUTPUT" | tail -50)" \
    '{decision:"block",reason:("Verification failed:\n" + $out)}'
fi
exit 0

A few things to call out. The stop_hook_active check at the start is critical. Without it, you get an infinite loop the first time the gate fails and Claude can't immediately fix it. Every developer who tries Stop hooks has done this once. I've done it twice.

The tail -50 is because Claude doesn't need the full test output, just enough to debug. Keeps the context window lean.

--reporter=basic matters for vitest specifically. The default reporter writes interactive escape sequences that pollute the stderr Claude reads back, and sometimes confuses its parsing of the failure output.

For projects where the verification needs more judgment (does the new code actually solve the user's request, not just "compiles and tests pass"), upgrade to a prompt type hook. It spawns a Haiku evaluator that reads the conversation and returns {"ok": true} or {"ok": false, "reason": "..."}. About a tenth of a cent per fire. Honest about subjective things in a way deterministic shell-out can't be.

A concrete example. When I'm having Claude work on something I built, the deterministic gate runs bun test and bun x tsc --noEmit. That catches the mechanical stuff. But it doesn't catch "Claude implemented the wrong scoring algorithm because it misunderstood the spec". For that, the prompt hook re-reads the conversation and the changed files, then asks itself whether the implementation actually matches what was requested. Slower than shell-out, smarter than tests, and dramatically cheaper than me discovering the issue tomorrow morning.

For the heaviest case, where you need to actually run a tool-using subagent rather than just shell out, there's agent type hooks. Still marked experimental in May 2026, 60-second default timeout, up to 50 tool turns. I use them sparingly because the cost is real, but for overnight refactor verification they're earning their keep.

The Stage 4 pattern is the one that makes Claude feel like a teammate. Once it's wired in, you stop reading "I've completed the task" and start trusting it. Or rather, you don't trust it. The hook does. Same outcome, less anxiety.

If this conceptual neighborhood interests you, my Ralph Loop article covers a related pattern: recursive deterministic loops over probabilistic agents.

Hooks vs Skills, MCP, and Subagents

Hooks aren't always the right answer. Quick decision matrix:

Hooks: when something must always happen on a lifecycle event. Formatting, blocking, verification, logging, audit.
Skills: when you're giving Claude how-to knowledge it should consult when relevant. The repository pattern for data access. Your team's commit message conventions. See my Claude Skills guide for the deeper version.
MCP: when you need external system access. Database queries, APIs, custom tools Claude calls during the task.
Subagents: when a task needs an isolated context window. Research subagents feeding writer subagents.
Plugins: when you're packaging hooks, skills, MCP, or subagents for team distribution.

The honest version: sometimes the right answer is CLAUDE.md plus one hook for the invariant that actually matters. Don't over-engineer. Three hooks that always run beat 30 pages of advisory documentation Claude might or might not follow. (Especially when Claude has not, historically, followed the advisory documentation.)

A heuristic I use: if you're typing "Claude must always..." or "Claude should never..." in your CLAUDE.md, that's a hook. If you're typing "When working on X, prefer Y", that's a skill.

What About Codex and OpenCode?

Hooks are slowly becoming a category, not a Claude-specific feature. Worth knowing what travels.

Codex CLI ships hooks now, and the design is almost a direct port of Claude Code's. Same JSON-on-stdin protocol, same exit codes, same additionalContext shape, same hookSpecificOutput structure. The lifecycle is much smaller (six events: SessionStart, UserPromptSubmit, PreToolUse, PermissionRequest, PostToolUse, Stop), and only command hooks are implemented. Prompt and agent types are parsed and silently skipped. Stage 1 and Stage 2 hooks port across with config translation only. Stage 4 verification works for the deterministic case. The LLM-evaluated version waits until Codex implements prompt hooks.

OpenCode does it differently. Plugin-based, not config-based. You write TypeScript modules in .opencode/plugins/ that export an async function returning event handlers. The closest equivalents to PreToolUse and PostToolUse are tool.execute.before and tool.execute.after, and you block by throwing an Error (no JSON permission decision). About 25 events available, plus a clean session.idle and an experimental compaction hook that lets you replace the compaction prompt entirely.

OpenCode is genuinely lovely if you're already in TypeScript-land. Type-safe Plugin interface, Bun's $ shell helper baked in, npm distribution. The format-and-lint and security guardrail patterns translate directly. Forced verification with Stop hooks does not, because OpenCode doesn't let you force the agent to keep working after it's idle.

Same fundamental insight in all three (deterministic shell-out at lifecycle events beats prompting), three different implementations, and the Claude Code surface is by some distance the most mature.

Practical bottom line for portability: if you're picking up Claude Code hooks because you might switch tools later, write your hooks as standalone shell scripts in a .claude/hooks/ folder, not as inline command strings. The scripts move to Codex with config translation only. They don't move to OpenCode (different runtime, different API), but at least you're rewriting logic, not archaeology.

Gotchas Worth Knowing Before You Ship

The kind of thing worth a screenshot:

Stop hook infinite loops. Always check stop_hook_active at the top of the script and exit 0 when it's true. Every developer learns this once.
PostToolUse can't undo. By the time it fires, the file is already written. Use PreToolUse for prevention, PostToolUse for reaction.
Last-write-wins on updatedInput. If multiple PreToolUse hooks rewrite the same tool's input, the order is non-deterministic. Don't have two hooks fighting over the same field.
additionalContext has a 10,000-character cap and stales on resume. Time-sensitive data goes in SessionStart (which re-runs with source: "resume"), not in PostToolUse (which replays the saved string).
Shell profile pollution. An unconditional echo in ~/.zshrc will end up in your hook's stdout and break JSON parsing. Wrap interactive output in [[ $- == *i* ]] and your hooks will stop mysteriously failing.
macOS notification permission. osascript -e 'display notification "..."' routes through Script Editor, which needs explicit notification permission in System Settings. Your hook doesn't tell you this is missing. Notifications just silently don't show up, and you wonder why your beautiful Stop alert never fires.

Where to Start

Add the Stage 1 formatter today. It takes five minutes. Add the Stage 2 Bash guard this week, in your user-global ~/.claude/settings.json so it applies everywhere. Add Stage 3 logging when you start running Claude unattended or shipping its work to teammates. Add Stage 4 verification when you want to trust Claude with longer sessions.

The whole thing is the deterministic layer that makes the rest of your AI coding setup actually reliable. Once it's in place, you start running Claude longer, paying less attention to formatting nitpicks, and reading "I'm done" with the confidence that comes from knowing the gate confirmed it. Worth the fifteen minutes of config.

DeepSeek V4 Review: I Tested It on Real Code

contact@thomas-wiegold.com (Thomas Wiegold) — Tue, 05 May 2026 00:00:00 GMT

The wait is finally over. After months of silence from the lab that genuinely shook the AI world with V3 (and made half of Silicon Valley refresh their CapEx slides in early 2025), DeepSeek V4 dropped on April 24, 2026.

I'll be honest, I had started to wonder if a new DeepSeek model was even coming. They went so quiet that I assumed they were either cooking something extraordinary or had hit a wall. Turns out it was the first one. Mostly.

Short version up top, because that's what I'd want from a colleague: DeepSeek V4 is the best value AI model on the market right now, but it's not the best coder. If you've got volume, it's a no-brainer. If you've got a hard problem, you're still better off with Claude or GPT-5.5.

Here's what I found after running V4 through the three tests I now use for every new model release.

What DeepSeek V4 Actually Is

DeepSeek V4 isn't one model, it's two. Both are open weights, both are MIT-licensed, both ship with a 1 million token context window by default.

	V4-Pro	V4-Flash
Total params	1.6T	284B
Active per token	49B	13B
Context	1M	1M
Input price (per 1M tokens)	$1.74	$0.14
Output price (per 1M tokens)	$3.48	$0.28
License	MIT	MIT

V4-Pro at 1.6 trillion parameters is the largest open-weights model anyone has shipped to date, ahead of Kimi K2.6 and GLM-5.1. The architecture is genuinely new, not just V3 with more layers. DeepSeek built a hybrid attention system (CSA + HCA) that uses about 27% of the per-token compute of V3.2 at 1M context, plus they trained directly in FP4 instead of quantising afterwards. The full technical report is on the Hugging Face model card if you want the actual maths.

The reasoning model line is folded in too. Instead of picking between deepseek-chat and deepseek-reasoner like before, V4 has three reasoning modes: Non-Think, Think High, Think Max. And tool calls now work inside thinking mode, which R1 couldn't do.

What's missing

No multimodal. Text in, text out. If you need vision, look at Kimi K2.6 or Gemini.

Also, mildly annoying: the model card doesn't ship with a Jinja chat template, so plan for that in your tokenisation pipeline. Small thing, but it'll catch you out if you assume it works like every other recent release.

Hands-On Testing: My Three-Workload Rig

Benchmarks tell you about benchmarks. They don't tell you whether the thing will actually do your job. So I've settled on three tests I run on every new model, picked because they cover most of what I use AI for in a normal week.

Codebase audit. I have it audit my own blog, which is React Router 7 framework with TypeScript. Real code, real complexity, things I genuinely care about being right.
Logic-heavy terminal app. A poker simulation that runs thousands of hands and returns statistics. Tests reasoning, structure, edge cases.
Web design from cold. Two different prompts to see how the model handles aesthetics and layout.

Here's how V4 did on each.

Test 1: Codebase Audit

Okay, but not great.

V4-Pro found a handful of real things, but it also flagged a bunch of stuff that wasn't actually a problem. Things like style nitpicks, or "consider extracting this" suggestions in places where extraction would make the code worse. Meanwhile it missed a couple of things that both GPT-5.5 and Claude caught the first time around.

GPT-5.5 is still my pick for code audits. It's the most thoughtful about what's actually a bug versus what's just different from how it would have written it. V4 tends toward over-flagging, which is exhausting when you've got a real codebase to triage.

This matches the benchmarks, by the way. On SWE-Bench Pro (the harder, more realistic coding eval), V4-Pro lands around 55%, behind Claude Opus 4.7 (64.3%), Kimi K2.6 (58.6%), and GLM-5.1 (58.4%). The headline SWE-Bench Verified number is essentially tied with Claude, but the harder benchmark tells the truer story.

Test 2: Poker Simulation

This one was closer. The code worked, the statistics came out right, the structure was reasonable. V4 didn't fall over.

But Claude and GPT-5.5 both did it better. Cleaner separation between the simulation core and the reporting layer, fewer iterations to get to working code, slightly more idiomatic Go. V4's version felt like a competent junior engineer's first pass that you'd then refactor. Theirs felt like something a senior would commit.

Not bad. Just not first.

Test 3: Web Design (Two Builds)

This is where it got interesting.

I gave V4 two prompts. First, a coffee roaster website. Second, a modern pop culture online shop.

The coffee roaster came out almost spookily similar to what Claude would produce. Same warm earth-tone palette, similar serif-and-sans pairing, that whole "we take our beans seriously" vibe. But the layout was cookie cutter. Hero section, three feature cards, story block, footer. Boring. The kind of design you've seen a thousand times.

The pop culture shop, though, was genuinely good. Striking layout, confident typography, played with grid in interesting ways. I'd happily ship it as a starting point for a real project.

The takeaway I keep chewing on: V4 can clearly do great design. It just defaults to safe templates unless the prompt subject pulls it somewhere distinctive. Coffee roaster apparently lives in the boring-template region of its training data. Pop culture shop apparently doesn't. Worth knowing.

What I learned from testing

Pattern: V4 is competent on everything, outstanding on nothing. Which is exactly the right shape for a value model. You wouldn't pay Opus prices for "competent." But at $0.14 per million input tokens for Flash, "competent" is an absolute steal.

For coding specifically, Claude (Opus 4.7) and GPT-5.5 still win on quality. For everything else where the answer doesn't have to be perfect, V4 is hard to beat.

Benchmarks That Actually Matter

A quick run through the benchmarks worth paying attention to, because there's signal in here even if it doesn't override what I saw in testing.

SWE-Bench Verified: V4-Pro 80.6%, basically tied with Claude Opus 4.6 (80.8%).
SWE-Bench Pro (the hard one): V4-Pro ~55%, behind Opus 4.7 at 64.3%, Kimi K2.6 at 58.6%, GLM-5.1 at 58.4%.
Artificial Analysis Intelligence Index: 52, which makes V4-Pro the second-best open-weights model behind Kimi K2.6 at 54. Full breakdown on Artificial Analysis.
LiveCodeBench: 93.5, the highest reported number on this benchmark.

One footnote that doesn't get enough airtime. The US government's CAISI evaluation at NIST ran V4-Pro on held-out, non-public benchmarks and placed it closer to GPT-5 (about 8 months old) than to GPT-5.4 or Opus 4.6. Treat the headline benchmark equivalence as an upper bound. There's likely some public-benchmark overfitting going on, which is normal but worth knowing.

The other thing to flag: V4 hallucinates more than its peers when it doesn't know something. The AA-Omniscience eval clocks it at a 94% hallucination rate when uncertain. Translation, when V4 isn't sure, it doesn't tell you, it just answers. For RAG and research workflows, ground it explicitly.

How V4 stacks up against the open-weights pack

Model	Open weights	Context	SWE-Pro
DeepSeek V4-Pro	Yes (MIT)	1M	~55%
Kimi K2.6	Yes (mod. MIT)	256K	58.6%
GLM-5.1	Yes (MIT)	200K	58.4%
MiniMax M2.7	Mixed	200K	56.2%
Claude Opus 4.7	No	200K	64.3%

Quick read: Kimi K2.6 is the smartest open model overall, GLM-5.1 wins long-horizon agentic work, V4 wins on price plus context length. They're all genuinely useful for different things.

Pricing: Where V4 Actually Wins

This is the part that matters most.

Model	Input ($/1M)	Output ($/1M)
DeepSeek V4-Flash	$0.14	$0.28
DeepSeek V4-Pro	$1.74	$3.48
Gemini 3.1 Pro	~$2.00	~$12.00
GPT-5.5	$5.00	$30.00
Claude Sonnet 4.6	$3.00	$15.00
Claude Opus 4.7	$5.00	$25.00

V4-Flash is the cheapest input price you'll find on a frontier-tier model anywhere, beating even GPT-5.4 Nano. V4-Pro is the cheapest of the larger frontier models. And cache hits are 99% off, which is huge for agentic workflows that resend big system prompts.

To make this concrete: a Hacker News commenter on Simon Willison's V4 review ran a full layer-by-layer audit of a TypeScript endpoint (API, DTOs, service, database models) for $0.09 on V4-Pro. The same audit on Claude Opus 4.7 would have cost roughly $9 to $13. That's a 100x ratio.

One caveat though, V4 is verbose. To complete the Artificial Analysis Intelligence Index, V4-Pro burned 4 to 5 times the median output tokens. The headline per-token price oversells the real bill. It's still cheaper than the alternatives, just not by quite as wild a margin as the sticker suggests.

Easiest ways to actually use V4

If you just want to try it, two easy paths.

The simplest is the OpenCode Go subscription, which I reviewed a while back. V4 is included in the plan, no API keys to set up, and you get a proper terminal coding agent out of the box.

Otherwise, the DeepSeek API itself is genuinely cheap and trivial to set up. Point any Anthropic-compatible client (Claude Code, OpenCode, OpenClaw) at the DeepSeek base URL with your API key and it works as a drop-in. That's it.

The Verdict: When to Use V4, When Not To

Use V4 for:

High-volume API workloads where cost matters more than tail quality
Agentic background work running 24/7
Long-context tasks above 200K tokens
Anywhere open weights or on-prem deployment is a hard requirement

Don't use V4 for:

Code audits where missing a real bug is expensive (GPT-5.5 still wins here for me)
Hard reasoning steps like research-grade math (GPT-5.4 or Gemini 3.1 Pro)
Big-codebase production edits where SWE-Bench Pro matters (Claude Opus 4.7)
Anything multimodal

V4 is the new default workhorse, not the new champion. The bar for "good enough cheap model" just got a lot higher, and that pulls the price-quality curve into a useful new shape. You can route 80% of your agentic and coding traffic to V4 and reserve Opus or GPT-5.5 for the genuinely hard sub-tasks where the 10x to 20x cost premium actually buys you something.

For my own work, the stack I'm settling into looks like: V4-Flash for bulk and background stuff, V4-Pro for medium-hard work, Claude Opus 4.7 or GPT-5.5 when the answer has to be right the first time.

DeepSeek themselves admit in their technical report that V4 trails the absolute frontier by 3 to 6 months. CAISI's data suggests it might be more like 8. Honestly, for most of what most of us do, that gap doesn't matter. The price ratio matters more.

And selfishly, I'm just happy a new strong model shipped. More competition makes everything better for everyone shipping with these tools. Bring on V5.

The Ralph Loop: How Recursive AI Agents Actually Work

contact@thomas-wiegold.com (Thomas Wiegold) — Sun, 03 May 2026 00:00:00 GMT

Here's the entire technique, in one line of bash:

while :; do cat PROMPT.md | claude -p --dangerously-skip-permissions; done

That's it. That's a Ralph loop. The first time I saw it I assumed I was missing something, because surely a recursive AI agent had to be more complicated than four words and a pipe. It isn't.

What's actually happening here is genuinely strange. You've got a coding agent reading the same prompt over and over, modifying a codebase on disk, and using the file system as its memory instead of conversation history. Run it overnight on a well-specified task and you wake up to a working program, fifty git commits, and a journal of everything the model tried, broke, and fixed. It feels closer to science fiction than anything else I currently use as a developer.

In this article I'll walk you through what a Ralph loop is, how it works, how to actually run one in Claude Code or Codex, and (the part I find most fascinating) what you can read in its journal in the morning. I'll also be honest about where it falls over and where it's the wrong tool entirely.

Geoffrey Huntley, who coined the term, calls Ralph "deterministically bad in an undeterministic world." Once you understand why that's a feature rather than a bug, the rest of this clicks into place.

What is a Ralph loop?

A Ralph loop is a recursive AI agent pattern where a coding agent runs in an infinite shell loop, reading the same prompt file each iteration, modifying the codebase on disk, and using the file system instead of conversation history as its memory. Each iteration starts with a fresh context window. State survives between iterations through the codebase, a TODO file, and git history.

The technique was named by Huntley in his July 2025 blog post, which is still the canonical reference. The name comes from Ralph Wiggum, the cheerful Simpsons character who rams his head into doorframes and announces "I'm helping!" Huntley's framing is that this kind of dumb, persistent loop is surprisingly effective. As Dex Horthy puts it in his history of the technique, "dumb things can work surprisingly well."

(There's a second origin story for the name. "Ralph" is also slang for vomiting, and Huntley has said the realisation of how cheap autonomous code generation had become made him want to. So: Ralph the character, Ralph the verb, both apply.)

Why it's not just a `while true` loop

Here's the part that took me a few reads to get. The reason this works is fresh context every iteration. That's not a side effect, it's the point.

LLMs degrade as their context fills up. Past somewhere between 100k and 150k tokens, depending on the model, quality measurably drops. Practitioners call this the Dumb Zone. Long agent sessions inevitably drift there, and Claude's auto-compaction, when it kicks in, is lossy. Your specs can quietly get summarised into vagueness without you noticing.

A bash loop sidesteps all of that. Each iteration starts with the exact same allocated context: the same PROMPT.md, the same AGENTS.md, the same specs/*.md files. What changes between iterations is the codebase on disk and a small TODO file. The loop's input is stable while the world (the repo) converges toward the spec. That's why "deterministically bad" matters. The model is reliably mediocre at every iteration, and over time it grinds the codebase into shape.

How it actually works

A single iteration looks like this:

Bash reads PROMPT.md and pipes it into the agent.
The agent reads fix_plan.md and picks the single most important pending task.
It searches the codebase, implements the change, runs the relevant tests.
If tests pass, it commits with a structured message and updates fix_plan.md.
If something useful was learned about the build or the project, it updates AGENTS.md briefly.
The agent exits.
Bash restarts the whole thing. Fresh context. Modified codebase.

The single most important rule, repeated across every Ralph implementation I've read, is one thing per iteration. Not "one thing plus a quick refactor while you're in there." One thing. Ask the agent to do exactly one task per loop, and trust it to decide what's most important from the plan file. Try to cram more in and you'll watch it pick the easiest item every time and ignore everything hard.

State survives between iterations through five places:

The codebase. A green build is the strongest signal that previous work landed.
fix_plan.md or progress.txt, the working TODO list.
AGENTS.md or CLAUDE.md, operational learnings.
specs/*.md, frozen requirements that don't change between loops.
Git history. Structured commit messages double as a journal.

There are variations. You can run a bounded loop with --max-iterations 50. You can have the agent emit a sigil like COMPLETE and grep for it to break out. You can do a two-phase pattern where one prompt does gap analysis with no code changes, and a separate loop implements. You can cron it to "one small refactor every morning" instead of an overnight blitz. They're all Ralph, just different cadences.

Running Ralph in Claude Code, Codex, and other tools

Claude Code

The bash one-liner I opened with is the canonical Claude Code Ralph. The two flags that matter are -p for headless mode (read prompt from stdin, write to stdout, exit) and --dangerously-skip-permissions to bypass approval prompts. Without the second flag, the loop just stops on the first file write.

The trade-off should be obvious. You're handing Claude Code unrestricted shell access to whatever directory you ran it in. Don't do this on your main machine without isolation. Use a Docker sandbox, a devcontainer, a fresh git worktree, or a remote VM. Huntley's framing is that it's not a question of if your loop gets compromised, it's when, and what the blast radius is. He's right.

The supporting cast of files matters more than the bash. CLAUDE.md at the repo root gets auto-loaded into every session and is where you put operational rules. specs/*.md is for frozen requirements. fix_plan.md is the mutable TODO. Keep PROMPT.md short and reference the rest with @filename syntax instead of inlining everything.

Anthropic shipped an official ralph-wiggum plugin for Claude Code in December 2025. It's worth knowing about, but there's a real debate over whether it's the same thing. The plugin re-feeds the prompt inside one growing session via a Stop hook. That's not fresh context per iteration. Horthy's review is blunt: it misses the point. I'd pick the bash version. The plugin does lower the barrier, though, if you want to dip a toe in.

Cost reality: running Sonnet 4.5 in a bash loop with autonomous tool use lands around ten dollars an hour on metered API. If you're going to do this regularly, the Claude Max 20x plan at $200 a month is usually cheaper than running it metered.

OpenAI Codex CLI and the new `/goal` command

This is the newest piece of news in this space. On April 30, 2026 (three days ago, as of writing) OpenAI shipped Codex CLI 0.128.0 with a /goal command that is essentially Ralph as a first-class primitive.

The behaviour: you set a goal, Codex keeps looping until it self-evaluates the goal as complete, or until the configured token budget runs out. You don't write the bash. The internal templates that drive this are visible in the Codex repo if you want to see how OpenAI prompts the self-evaluation step.

Two ways to ralph in Codex now:

# The new way, since 0.128.0
codex /goal "Make all tests pass and commit each green checkpoint"

# The classic bash loop, still works
while :; do cat PROMPT.md | codex exec --yolo -; done

The /goal version is friendlier and probably what most people will reach for. The bash version is more robust for the same reason the Anthropic plugin is debated. /goal is in-session, one growing context window, while the bash version gets fresh context every iteration. If you're doing a long overnight run, I'd still pick bash. For shorter, well-bounded tasks, /goal is great.

Other tools, briefly

The pattern travels to almost every coding agent.

Cursor has a cursor-agent headless CLI that ralphs cleanly. There's also a Cursor plugin that does the in-session Stop-hook variant. Aider doesn't read stdin natively, but you can wrap it: bash -c 'aider --yes-always --message "$(cat PROMPT.md)"'. Aider's --architect mode pairs a planner model with a coder model, which is a natural plan-build split. Goose (Block's open-source agent) ships a first-class Ralph tutorial with cross-model review built in. One model implements, a different model reviews. This is the closest thing to a Reflexion pattern out of the box, and it's the easiest way to ralph with a local LLM via Ollama. GitHub Copilot CLI works for short Ralph runs in programmatic mode. And if you're building a product around this, the Vercel AI SDK has a ralph-loop-agent example that's a clean TypeScript starting point.

Why the journal is where it gets interesting

Most Ralph coverage focuses on what gets built. I think that's the wrong half of the story. The interesting half is what you can read in the morning.

When every iteration commits with a structured message and writes failures to a log, you end up with something I haven't seen from any other AI tooling: a readable, narrative record of what the model tried, what broke, and why it thought it broke. A successful commit is, frankly, boring. The interesting reads are the failures, where the agent spends three iterations chasing a wrong hypothesis, finally figures out the test was misnamed, fixes the test, and moves on.

Here's the logging stack I'd set up:

Git commit per success, with a structured message: what changed, which test passed, iteration number. This becomes the primary journal.
.ralph/errors.log per failure, with the agent's own reflection on the cause. Tell it explicitly in PROMPT.md to append to this file when it gives up on an approach, and to include why.
Append-only progress.md, where the agent is allowed to be opinionated about what's hard or surprising. This is where the personality leaks through.

One concrete tip lifted from Huntley's CURSED prompt: instruct the agent to write the why into test docstrings. Future iterations won't have its reasoning in their context, so the only way to teach the next Ralph anything is to make it readable from disk. After a few hundred iterations, your test suite ends up commenting itself with the agent's archaeology of past mistakes. That's pretty cool, and it's also the most AGI-adjacent thing on my laptop right now.

The morning experience is what makes me keep coming back to this. Coffee, scroll the commits, find three places the agent surprised me, two it embarrassed itself, and one bug it caught that I'd have missed. I won't oversell it as artificial general intelligence. But it's a process running on its own that produces a journal worth reading, and that's a category of experience I didn't have a year ago.

What Ralph is actually good for

The clearest case: measurable, mechanical work

The shared property of every Ralph success story I trust is that the success criterion is a number. Tests passing. Lint count down. Types clean. Coverage up. Build green.

In that mode it works, sometimes spectacularly. The repomirror team shipped six framework ports overnight at a YC hackathon, running six concurrent loops in git worktrees, racking up a thousand-plus commits and about $600 in API spend. Huntley shipped a $50,000 client MVP for $297 in API costs. The CURSED programming language, including a self-hosting compiler, was built over roughly three months of continuous Ralph runs.

The boring middle of the road still works fine: TypeScript strict-mode adoption across a monorepo, ESLint flag flips, dependency upgrades like Jest to Vitest or React 17 to 19, alt-text generation for product images, internal-link passes across a content site. Anywhere the work is repetitive and the verifier is mechanical, Ralph eats it.

The interesting case: fuzzy success, with a judge

Here's the part the standard Ralph coverage skips. The success signal doesn't have to be a function exit code. It can be:

An LLM as judge. Slower and weaker, but works for prose, copy, and vague aesthetic targets.
A data source. Lighthouse score, conversion rate, Web Vitals, search ranking.
A human. You, with coffee, tapping thumbs-up or thumbs-down on yesterday's variant.

Imagine a Ralph loop iterating on your website's design. One small variant per night. In the morning you give it a thumbs up or thumbs down, and the result feeds back into progress.md. Over a month, you've made thirty small, judged design changes, each one a tiny convergent step.

The constraint that makes this work is the same as the coding case: small, focused changes per iteration. Telling the agent to redo the entire site every night is a coin flip. Telling it to nudge one component, one heading, one CTA, one section per night gives you compounding improvement. The same logic applies to copywriting, ad creative, and probably a dozen other places I haven't tried yet.

I'll be honest, though: this is harder to set up well than the measurable-result version. The judge loop has more moving parts. A way to deliver the variant. A way to capture your vote. A way to feed it back to the agent without poisoning context. Today, the function-as-judge case is much easier. The judge case is where I'd bet the interesting product opportunities live.

When Ralph is the wrong tool

A short list, because being honest about this matters more than the tips.

One-shot tasks (just use Claude or Cursor interactively). True exploration where you don't know what you want (Ralph optimises for a green build, not for taste or insight). Brownfield code with strict review processes (the bottleneck is human review of forty-thousand-line PRs, not API tokens). Anything irreversible: production database migrations, deletes you can't undo, financial transactions. UX copy with no judge wired in (the agent will mark itself complete and move on, confidently wrong).

The right mental shortcut, borrowed from Meag Tessmann's writing, is to ask: is the output machine-verifiable? If yes, loop. If no, get a human, or wire up a judge.

Tips so it doesn't waste your token budget

A few hard-won pieces of advice from people who've run thousands of iterations.

Sandbox always. Docker, devcontainer, git worktree, or remote VM. Don't argue yourself out of this. You're handing a non-deterministic process unrestricted shell access. The blast radius is whatever it can reach.

Keep the primary context under 100k tokens. Use subagents for anything that returns a lot of tokens you don't need to keep. A subagent grepping the codebase and returning "here's the function" is much better than the primary agent reading a 4,000-line file. Once the main context drifts past 100k, quality drops measurably.

Two-phase plan/build. Have a separate, one-shot prompt that refreshes fix_plan.md based on a gap analysis against specs/*.md. The loop only ever implements. This is the single biggest reliability upgrade you can make.

Layer your circuit breakers. Max iterations. Per-iteration timeout (15 minutes is a reasonable default). Hourly token cap. Stuck detection (if the same test fails three iterations in a row, stop and notify). Cost cap if you're on the API.

Wire backpressure aggressively. Type-check, lint, tests, build. Anything that returns non-zero rejects the iteration. For dynamically typed languages, adding a static analyser (mypy or ruff for Python, the relevant equivalent for whatever you're using) is non-negotiable.

Don't use --continue or --resume. They defeat the fresh-context guarantee. The whole point of a bash Ralph is that you start clean every iteration.

Set a notifier and walk away. ntfy.sh, a Slack webhook, a macOS notification, whatever you've already got wired up. Watching iteration 47 of 200 is somehow worse than watching paint dry. Tune iterations one through three carefully, then leave. Read the journal in the morning.

Should you try it?

Honest take: Ralph is a young technique. The original blog post is from July 2025. The first vendor support shipped in December 2025. Most of the splashy cost numbers I've quoted are self-reported by the technique's loudest advocates. Thoughtworks' Technology Radar has it as Trial, not Adopt, and they're right to be cautious.

That said: the cheapest possible experiment is a five-iteration human-in-the-loop Ralph on a well-specified, mechanical task. Pick something tiny. A TypeScript file you've been meaning to clean up. A small algorithm with a clear pass/fail. Run a single iteration manually in a sandboxed worktree, read the diff and the commit, then run another. You'll know in thirty minutes whether this fits how you work.

The thing that keeps pulling me back isn't that it codes. It's that it keeps going, and the trail it leaves is genuinely interesting to read. That part feels like the future, and it's already here, in four words and a pipe.

Build an AI SEO Agent in TypeScript with Claude

contact@thomas-wiegold.com (Thomas Wiegold) — Sat, 02 May 2026 00:00:00 GMT

Search "AI SEO agent" right now and the first page is mostly screenshots of gumloop and n8n with arrows pointing at boxes. Useful if you want a no-code workflow. Less useful if you write code for a living.

So I built one in TypeScript instead. Around 140 lines, one dependency for the agent framework, calls Claude directly. Point it at a competitor's blog with a topic in mind, and it crawls, scores each page for relevance, and hands back the most relevant ones ranked. The full repo is on GitHub, and there's a mock mode that runs without an API key, which I'll show later.

The interesting bit isn't the crawler. It's the architecture: fetching and LLM scoring run concurrently, not sequentially. While Claude judges page N, the crawler is already pulling page N+1. Plain async/await makes that awkward. A reactive agent makes it fall out for free.

What "AI SEO agent" actually means here

The phrase covers at least three different things. People might mean a content brief generator (give me a keyword, get back an outline). They might mean an autonomous workflow that publishes posts (please don't). Or they might mean a research crawler, which is what this is.

Concrete use case: competitor content gap analysis. You're planning to write about, say, "AI coding tools." A competitor has 200 blog posts. Which of them actually cover that topic well? Reading 200 URLs by hand is not happening. Screaming Frog will tell you which pages exist, but it won't tell you which pages are about what you care about.

That's the gap. An LLM can read a page and judge relevance semantically, not just by keyword overlap. Wire that into a crawler and you get a focused crawl: pages above a relevance threshold contribute their links to the queue, pages below it are dead ends. Five minutes of compute, ranked output.

The output of one of these runs slots cleanly into a normal SEO workflow. Top three competitor pages on the topic? Read them, take notes on what they cover, find what they don't. Score distribution skewed low across the board? You've found a topic gap, write the post. Score distribution clustered tight at the top? Tough niche, you'll need a fresh angle. The crawler doesn't replace judgment, it just makes the input to your judgment manageable.

Why a reactive framework beats sequential `await`

I'll spend a moment on this because it's the only architectural decision in the project that actually matters.

The naive version

The obvious shape:

for (const url of frontier) {
  const page = await fetchPage(url);
  const { score, links } = await scoreWithClaude(page);
  if (score >= 6) frontier.push(...links);
}

Reads fine. Works. Slow. The problem is that fetch and Claude both block, in serial. While Claude is scoring page N (a second or two of latency), nothing is fetching page N+1. While page N+1 is being fetched, Claude is idle. Throughput is the sum of your latencies, not the max.

For ten pages, that's roughly twenty round-trips of dead time.

The reactive version

The reactive shape declares two independent triggers. One fetches when there's a URL in the queue and nothing is currently being fetched. The other scores when there's an unscored page and nothing is currently being scored. Both kick off their async work and return immediately, so the agent loop is never blocked. They run concurrently because there's nothing forcing them not to.

The framework I'm using is agentiny, which I built for exactly this kind of thing. (I wrote up a support ticket triage agent with it earlier, if you want a different shape of example.) The model is when(condition, [actions]). When the condition becomes true, the actions run. State changes re-evaluate conditions. That's it.

For this crawler, the result is real I/O concurrency. Page 4 fetches while page 3 is being scored while page 2's links are being added to the frontier. You can see it in the timestamps when you run the mock test below.

Architecture: three triggers, two mutexes

State shape

interface CrawlState {
  topic: string;
  origin: string;
  frontier: string[];
  visited: Set<string>;
  pages: Page[];
  fetching: boolean;
  scoring: boolean;
  maxPages: number;
  done: boolean;
}

frontier and visited are standard crawler bookkeeping. pages accumulates results. The two boolean flags act as mutexes: fetching is true when an HTTP request is in flight, scoring is true when Claude is mid-call. They prevent two of the same kind of work happening at once, which keeps the crawl polite. done is set by the stop trigger.

The triggers

        ┌──────────────────────────────────┐
        ▼                                  │
  trigger 1: fetch ──→ trigger 2: score ───┘ (if score ≥ 6)
        │                  │
        └──→ trigger 3: stop ◄──
             (when nothing in flight, queue drained)

Three triggers, each around 15 lines. Fetch picks up a URL from the frontier and runs an HTTP request. Score takes the next unscored page and asks Claude for a 1 to 10 relevance rating. Stop fires once when everything is quiet.

Why one in-flight fetch and one in-flight score, instead of full parallelism? Two reasons. Politeness: hammering a competitor's site with ten parallel requests is rude and gets you blocked. And cost: parallel Claude calls are easy to fan out, but for ten pages the total wall time is already short enough that fan-out doesn't pay back the rate-limit risk. The mutex pattern gives you something more useful, which is pipelining between two different kinds of work.

The code, walked through

Project setup

{
  "type": "module",
  "scripts": {
    "start": "tsx --env-file=.env src/index.ts",
    "mock": "tsx src/mock.ts"
  },
  "dependencies": {
    "@agentiny/core": "^0.5.0",
    "@anthropic-ai/sdk": "^0.92.0"
  }
}

That's the whole dependency surface. agentiny for the reactive loop, the official Anthropic SDK for Claude calls. Three source files: crawler.ts (the agent), extract.ts (a tiny HTML parser), and index.ts (CLI entry).

The fetch trigger

agent.when(
  (s) => !s.done && !s.fetching && s.frontier.length > 0,
  [
    (s) => {
      const url = s.frontier.shift()!;
      s.visited.add(url);
      s.fetching = true;

      void fetcher(url, seed.origin)
        .then((page) => {
          const cur = agent.getState();
          cur.pages.push(page);
          cur.fetching = false;
          agent.setState(cur);
        })
        .catch((err: Error) => {
          const cur = agent.getState();
          cur.fetching = false;
          agent.setState(cur);
        });
    },
  ],
);

The shape is the important bit. The action does its synchronous work (pop the URL, flip the mutex) and then kicks off the HTTP request as fire-and-forget. The .then mutates state in place and calls agent.setState(cur) to wake the loop. Notice it's the same state object, not a spread. That's deliberate, more on that below.

The score trigger

agent.when(
  (s) => !s.done && !s.scoring && s.pages.some((p) => p.score === undefined),
  [
    (s) => {
      const page = s.pages.find((p) => p.score === undefined)!;
      s.scoring = true;

      void scorer(s.topic, page).then(({ score, reason }) => {
        const cur = agent.getState();
        const target = cur.pages.find((p) => p.url === page.url);
        if (!target) return;
        target.score = score;
        target.reason = reason;
        cur.scoring = false;

        if (score >= 6) {
          const fresh = page.links.filter((l) => !cur.visited.has(l) && !cur.frontier.includes(l));
          cur.frontier.push(...fresh);
        }
        agent.setState(cur);
      });
    },
  ],
);

Same shape. The interesting line is the threshold check. If Claude scores the page 6 or higher, its links go onto the frontier. Otherwise the subtree is pruned. That's what makes this a focused crawl rather than a generic one.

A real concurrency bug bit me here while building this. An earlier version used a helper that captured the state reference at action entry, then awaited the LLM call. If another fire-and-forget callback called setState({...cur}) during the await, the captured reference was stale and the mutations got lost. The fix is the mutate-don't-spread pattern you see above: same state reference, mutated in place, setState(cur) only as a signal to wake the loop. Worth knowing if you go reactive with concurrent triggers.

The stop trigger and the runner

agent.once(
  (s) =>
    !s.done &&
    !s.fetching &&
    !s.scoring &&
    s.pages.every((p) => p.score !== undefined) &&
    (s.visited.size >= s.maxPages || s.frontier.length === 0),
  [
    (s) => {
      s.done = true;
      agent.setState(s); // important: notifies subscribers
    },
  ],
);

The runner uses subscribe rather than agentiny's settle() helper. settle() waits for quiet polling cycles, which doesn't fit fire-and-forget patterns: there can be long quiet gaps while the network and Claude are both busy, and settle resolves prematurely. Subscribing on done and waiting for it to flip is the right signal here.

await new Promise<void>((resolve) => {
  const unsub = agent.subscribe((s) => {
    if (s.done) {
      unsub();
      resolve();
    }
  });
});

Running it

Real run, against any site you have permission to crawl:

$ npm start "https://competitor.com" "AI coding tools"

[fetch] https://competitor.com/ → Home
[score] https://competitor.com/ → 7/10  +12 links
[fetch] https://competitor.com/blog → Blog
[fetch] https://competitor.com/blog/claude-review → ...
[score] https://competitor.com/blog → 9/10  +8 links
[score] https://competitor.com/blog/claude-review → 10/10  +3 links
...
[done] 10 pages crawled

The interesting part is in the timing. Here's an actual extract from the mock test, with millisecond timestamps:

   1ms    → fetch start /
  42ms    ← fetch done  /
  42ms      → score start /
 149ms      ← score done  /  = 8 +3 links
 149ms    → fetch start /posts
 188ms    ← fetch done  /posts
 188ms    → fetch start /about
 188ms      → score start /posts
 220ms    ← fetch done  /about
 220ms    → fetch start /contact
 270ms    ← fetch done  /contact
 288ms      ← score done  /posts = 8 +2 links

Look at the 188ms mark. The fetcher just finished /posts and is already onto /about, while the scorer is also working on /posts. By 220ms the fetcher is done with /about and started on /contact, all while the scorer is still chewing on /posts. Three URLs fetched in the time it took the LLM to score one. That's the agentiny payoff in one frame. With sequential await, that 100ms scoring window would have been pure idle time.

Speaking of mocks: the crawler accepts injected fetcher and scorer functions, so you can run it with no API key at all.

npm run mock

This drives the agent against a fake in-memory site of ten pages, with a fake scorer that marks half of them relevant. Useful for CI, useful for trying the project before committing to a key, and useful for debugging the trigger logic without burning tokens. The injection point is just a config option, so the production code path is unchanged.

Where this falls short in production

The repo is a working tutorial, not a production crawler. Things you'd want before pointing it at the open web:

No robots.txt compliance. Add one before crawling sites you don't own. (This bit is non-negotiable.)
The HTML extractor is regex-based, which is exactly as fragile as it sounds. Swap in linkedom or cheerio.
One in-flight request and one in-flight score is conservative. With proper rate limiting and a per-host concurrency cap, you could fan out further.
No persistence. A crash at page 47 of 100 means starting over.
Cost: the demo uses claude-haiku-4-5, which is cheap. Don't reach for sonnet here unless the relevance judgements are visibly wrong. The smaller model handles this kind of binary classification fine.

None of these are big lifts. The reactive structure makes them easy to slot in: add a robots check inside the fetch action, wire the frontier through SQLite, add a third trigger for retries.

Wrap-up

That's the whole thing. About 140 lines, real I/O concurrency, mock mode that needs no key, and an honest list of what would break if you took it to production.

Three obvious extensions if you want to keep going. Persist the frontier and visited set in SQLite, so the crawler is resumable. Add a fourth trigger that takes the top-N scored pages and writes a content brief, turning research into draft. Or expose the whole thing as an MCP server, so Claude itself can call it during a chat.

Repo is here, agentiny is here.

How to Use the Claude Code Frontend-Design Plugin to Stop Shipping AI Slop

contact@thomas-wiegold.com (Thomas Wiegold) — Mon, 27 Apr 2026 00:00:00 GMT

I can spot an AI-generated website in about half a second. So can you, probably. Inter at 400 weight, a purple-to-blue gradient hero, three feature cards with 16px rounded corners, a headline like "Build the future of work" that could equally belong to a CRM, a dental practice, or a Bitcoin scam.

I've been writing prompts for design ever since AI coding became viable, and the results have been genuinely hit-or-miss. Some outputs I'd happily ship. Others are exactly the AI slop everyone complains about. Two prompts apart. Same model, same project, same hour of the afternoon.

What changed is that Anthropic shipped an official frontend-design plugin for Claude Code. It's not magic, it's a 4.5KB markdown file, but it consistently lifts the floor on what Claude produces. This is how to use the Claude Code frontend-design plugin without overrating it, the prompting principles that actually move the needle, and how to apply all of this to WordPress and Shopify themes without losing your mind.

I'll also be honest about where it doesn't help, because there are limits.

Why AI-Generated Web Design All Looks the Same

The AI slop fingerprint is pretty consistent. Inter or a system font, a purple-indigo gradient somewhere on the hero, oversized headlines with vague copy, three cards in a row, uniform border radius, shadows at exactly 0.1 opacity. It's the visual equivalent of "in today's fast-paced world" — you've seen it so many times you stop reading.

The reason is statistical. LLMs generate by predicting the most probable next token, and for frontend code that probability mass sits squarely on the design conventions that dominated developer Twitter and Dribbble between 2020 and 2022. Tailwind UI, Linear's Magic Blue, the early Stripe and Vercel aesthetic. That's what got scraped into training data, so that's what gets predicted back.

Adam Wathan, Tailwind's creator, posted a now-famous self-deprecating tweet apologising for picking bg-indigo-500 as Tailwind UI's default five years ago. One default, multiplied by a million tutorials, became the entire AI-generated internet's accent colour.

It's the same dynamic as AI-written content. Certain words give it away — "delve," "tapestry," that em-dash cadence. With AI design, the tells are visual: the font, the gradient direction, the rounded-corner-on-everything reflex. Once you can see it, you can't unsee it. Your visitors can see it too. They just don't know the vocabulary.

What the Claude Code Frontend-Design Plugin Actually Does

When I first heard about the plugin I was expecting some elaborate piece of tooling. Maybe a fine-tuned model, a vision component, a new mode in Claude Code. It's none of that. The whole thing is a single SKILL.md file, about 50 lines of markdown, sitting in Anthropic's public repo. Around 300,000 installs as of late April 2026.

What the file does is make Claude declare its hand before generating any code. It forces a four-question framework — purpose, tone, constraints, differentiation — and demands an actual aesthetic commitment before a single CSS rule gets written. The "tone" question isn't soft either. It lists about a dozen extremes: brutally minimal, maximalist chaos, retro-futuristic, organic, luxury, playful, editorial, brutalist, art deco, soft pastel, industrial, and tells Claude to pick one and execute it precisely.

Then it governs five concrete dimensions:

Typography — pair a distinctive display font with a refined body font. Avoid Arial, Inter, Roboto.
Colour and theme — CSS variables for consistency. Dominant colour with sharp accents, not a timid evenly-distributed palette.
Motion — CSS-only for HTML, Motion library for React. One coordinated page-load reveal beats scattered micro-interactions.
Spatial composition — asymmetry, overlap, diagonal flow, grid-breaking elements.
Backgrounds — atmosphere via gradient meshes, noise textures, geometric patterns. Refusing solid colour is half the battle.

The forbidden list is delightfully specific. Inter, Roboto, Arial, system fonts. Purple gradients on white backgrounds. Predictable layouts. Cookie-cutter components. It even calls out Space Grotesk by name, Claude's go-to "anti-Inter", because once you tell it to avoid Inter, it converges on Space Grotesk across every generation. The skill's authors clearly noticed and wrote it in.

This is why it works. It's a forcing function. Instead of letting Claude predict the highest-probability next token, the skill makes it commit to a direction first, then generate tokens consistent with that direction. The median pull weakens. You actually get variance in the output.

How to Install the Frontend-Design Plugin in Claude Code

If you're on Claude Code with plugins enabled, it's a one-liner:

/plugin install frontend-design@anthropics/claude-code

Once installed, it auto-loads on any frontend prompt. You don't have to invoke it manually.

If you want it versioned in your project, which I'd recommend for anything client-facing, drop the SKILL.md file directly:

mkdir -p .claude/skills/frontend-design
curl -o .claude/skills/frontend-design/SKILL.md \
  https://raw.githubusercontent.com/anthropics/claude-code/main/plugins/frontend-design/skills/frontend-design/SKILL.md

The nice thing about it being plain markdown is that it's portable. Drop the same file into .cursor/rules/frontend-design.mdc and Cursor reads it. Codex picks it up via AGENTS.md. GitHub Copilot will read it from .github/copilot-instructions.md. Google's Antigravity reads Anthropic-format skills natively.

The skill is the moat, not the product. Whichever agent you've settled on, the same 50 lines of instructions improve the output.

How to Prompt AI for Sophisticated Web Design

Here's where it gets counterintuitive. The biggest prompting mistake I made for the first six months of AI coding was thinking that more specificity meant better output. I'd specify hex codes. Pixel paddings. Font sizes to two decimal places. The output got worse.

It took reading Anthropic's frontend aesthetics cookbook, the prompting half of the same package as the plugin, to understand why. When you fill the prompt with rigid pixel-level specs, you're using up the model's tokens on conservative defaults. There's no room left for creative choice. So Claude does what it's been told and adds nothing of its own. The result is technically correct and visually dead.

Total openness doesn't work either. "Build me a fitness landing page" gets you the statistical mean. AI slop, on a plate.

The sweet spot is principle-based direction. Tell Claude what to think about, not what to produce. Anthropic's cookbook publishes three strategies that consistently produce better output: guide specific design dimensions, reference design inspirations without being prescriptive, and explicitly call out the defaults you don't want.

Here's a side-by-side I run through in my head before any design prompt:

Bad prompt:

Build a hero section with a 1200px container, 80px padding, H1 at 64px Inter Bold, subhead at 18px Inter Regular, purple gradient background from #6366f1 to #8b5cf6, white CTA button with 8px border radius.

This is a recipe for slop. You've handed Claude every default it would have picked anyway and removed any room to surprise you.

Good prompt:

Landing page for a fitness coaching app aimed at busy professionals — people who don't want to live at the gym.

Aesthetic direction: editorial magazine, not SaaS. Pair a serif display font like Fraunces or Playfair with a clean geometric sans for body. Dominant colour: a single warm earth tone with a single sharp accent. One coordinated motion moment on page load, not micro-interactions everywhere.

Avoid: Inter, purple, generic gradients, three-card feature grids.

You've given Claude a direction, purpose, tone, font category, colour family, motion philosophy, and trusted it to make the actual choices inside that frame. The output is dramatically better.

Encouraging the AI to be creative, explicitly, in the prompt itself, makes a measurable difference. It sounds silly. It works. Treat Claude like an intern with great recall and zero context: brief it the way you'd brief a creative team, then let it actually do the creative execution.

The Small Changes That Separate Good Design From AI Slop

After hundreds of generations across personal projects and client work, here's what I've found genuinely matters. The gap between a result I ship and one I throw away is almost always small. Right font, right size jump, slightly different padding rhythm. The plugin gets the bones right. Tuning four levers gets you the last 30%.

In order of impact:

Typography is the single highest-leverage decision. One distinctive font choice is worth ten layout tweaks. The cookbook publishes a targeted typography prompt with curated categories — Editorial (Playfair Display, Crimson Pro, Fraunces), Startup (Clash Display, Satoshi, Cabinet Grotesk), Technical (IBM Plex), Distinctive (Bricolage Grotesque, Newsreader). Pick from a category that matches your project's tone. Don't let Claude pick from its default bag.

Spacing extremes. Size jumps of 3x or more between heading levels, not 1.5x. Weight jumps from 100 to 800, not 400 to 600. The lukewarm middle is what reads as generic, and committing to the extremes is what reads as designed.

One coordinated motion moment. A staggered page-load reveal where elements come in with proper rhythm beats fade-in-up applied to everything. Pick one moment, choreograph it well, and stop.

Backgrounds with depth. Gradient meshes, noise textures, layered transparencies. Just refusing to use a solid background colour gets you halfway to looking intentional.

The depressing part is that most of these are five-minute changes. Right font, slightly more padding, one extra layer of texture. The whole experience flips. I've shipped designs that looked completely different to the first generation and the only thing I changed was the font family and one size scale.

Using Claude Code for WordPress and Shopify Theme Design

A theme has two layers, and AI handles them differently:

The aesthetic layer — typography, colour, layout, motion. The frontend-design skill handles this well.
The platform integration layer — Liquid, block markup, schema, hooks. The skill alone won't help you here.

Most of the frustration I see online is people trying to solve both with one prompt and wondering why the output is half-broken.

WordPress

For WordPress block theme work, the design pass goes well. You auto-load the skill, give Claude an aesthetic direction, and you get a respectable first draft. The trouble starts on the integration side, because Claude defaults to wrapping everything in raw HTML blocks rather than using native block editor primitives.

Jonathan Bossenger documented exactly this in his April 2026 case study of rebuilding his personal site. Two conversations, dozens of tool calls, theme deployed. Design quality was excellent. The block conversion was the human-in-the-loop part — he had to actively shepherd Claude away from treating WordPress like a static HTML host.

Pair the skill with Automattic's Claude Cowork plugin and WordPress Studio MCP and you'll get proper block markup more reliably. It's not perfect, but it's a real workflow now.

Shopify

For Shopify, the situation got dramatically better on April 9th this year, when Shopify shipped their official AI Toolkit. Free, open source, works with Claude Code, Cursor, Codex, Gemini CLI, and VS Code. It connects your agent to live Admin and Storefront API documentation, validates GraphQL against the actual schema, and handles Liquid theme operations through the Shopify CLI.

/plugin marketplace add Shopify/shopify-ai-toolkit
/plugin install shopify-plugin@shopify-ai-toolkit

Stack the toolkit with the frontend-design skill and you have a real workflow: aesthetic direction from the skill, correct Liquid and schema validation from the toolkit. It handles section edits, colour changes, copy updates, homepage tweaks brilliantly. For genuinely custom interactive sections, you still want a developer writing Liquid by hand.

Architect for theming from day one

If I'm starting a project from scratch with AI in the loop, I architect for theming from the first commit. CSS variables for everything. All design tokens in one file. Components that don't hardcode colours, fonts, or spacing. A DESIGN.md at project root that defines the system, so every Claude prompt snaps to the same tokens.

Why? Because the real value of AI in design isn't generating one site, it's generating five and picking one. When changing the look of the whole site is a matter of swapping out a token file and re-running a couple of prompts, you can play with the design until something actually lands. When the design is baked into a hundred component files, you're stuck with the first thing that came out. Which, statistically, is the AI slop version.

When You Should Still Hire a Designer

I'm a developer running a company, not a designer. So take this with the appropriate grain of salt — but the honest take after a year of pushing AI design tools as hard as I can:

When the budget allows it, I still pay a good designer to do the work.

The plugin and the prompting techniques get me to a defensible 80% on a tight timeline or a small budget. That's genuinely useful. For internal tools, MVP landing pages, side projects, client work where the brief is "make it look professional and ship it Friday" — the AI workflow is excellent. I save real money and real time.

But when the design is the product — agency sites, portfolios, brand-led commerce, anything competing on visual identity — a designer with taste gets you to a different place. They make the foundational decisions AI can't make from a prompt. The brand identity from a blank page. The conversion-driven choices that come from real user research. The judgement call about which of three good options is right for this audience, not the average audience.

AI is great at applying a defined design system consistently across many components. It's not great at creating that system from scratch. The taste has to come from somewhere, and right now, "somewhere" is a human with twenty years of looking at design.

Hiring out the foundational work and using AI for the execution layer has been the most cost-effective way I've found to ship things that look properly designed without a full design retainer. Worth thinking about, especially for anything commercial.

The Formula for Better AI-Generated Web Design

If you take one thing from this article, take this: the plugin doesn't make AI a designer. It makes AI a competent execution layer that doesn't embarrass you. Everything else stacks on top of that.

The workflow that's been working for me:

Install the frontend-design plugin, or drop the SKILL.md into your coding agent project rules folder.
Add a DESIGN.md at project root with your brand-specific tokens.
Prompt with aesthetic direction, not pixel values. Tell Claude what to think about — purpose, tone, font category, colour family, motion philosophy — and let it make the actual choices inside that frame.
Iterate by naming what converged in the output and correcting it specifically. "You used Space Grotesk again, pick a serif." "The hero is symmetrical — break the grid."
For platform work, stack the skill with WordPress.com MCP or the Shopify AI Toolkit. Two layers, two tools.
Architect for theming so you can swap looks until something lands.
When the project deserves a real designer, hire one.

The taste still has to come from somewhere. Right now, that somewhere is still you. The plugin just makes sure your taste isn't fighting the model's training data on every single prompt.

OpenCode Go Review: Is the $10 AI Coding Plan Worth It?

contact@thomas-wiegold.com (Thomas Wiegold) — Wed, 15 Apr 2026 00:00:00 GMT

I've been using OpenCode for a while now. When the team behind it launched Go — a $10/month subscription bundling a curated set of open-source models — I was curious enough to subscribe. A week in, I have thoughts.

The pitch is simple: pay $10, get access to nine models from Chinese AI labs, hosted on servers in the US, EU, and Singapore. The claimed value? $60 worth of API usage for your ten bucks. That's a 6x return if the models are any good.

Spoiler: some of them genuinely are. But there's a catch — actually, there are several catches — and whether they matter depends entirely on how you code.

What Is OpenCode Go?

Quick context if you're new here. OpenCode is an open-source, MIT-licensed terminal coding agent built in Go by the team at Anomaly — the same folks behind Serverless Stack (SST). It's currently sitting at over 142,000 GitHub stars and roughly 6.5 million monthly active developers. The core tool is completely free, supports 75+ model providers, and you can use your own API keys from whoever you like.

Go is the optional paid layer on top. Think of it as a curated, bundled model subscription — you get one API key, one endpoint, and access to a rotating set of models the team has tested specifically for agentic coding workflows.

The backstory matters here. OpenCode Go didn't appear in a vacuum. It was born directly from Anthropic's January 2026 decision to block third-party tools from using Claude subscription credentials. OpenCode, Cline, RooCode — all cut off overnight. The backlash was significant, OpenCode's stars roughly doubled in the weeks that followed, and Anomaly pivoted fast. They stripped out Claude OAuth integration, partnered with OpenAI for Codex access, and launched three new subscription products: Go ($10/month for open-source models), Zen (pay-as-you-go for premium models), and Black (enterprise gateway). If you've read my piece on switching to OpenCode, this is the next chapter of that story.

What Models Do You Get?

As of April 2026, OpenCode Go includes nine models — all from Chinese AI labs. No Claude, no GPT, no Gemini. Here's what you're working with:

Model	Provider	Est. Requests/Month	Best For
GLM-5.1	Zhipu	~4,300	Reasoning, math
GLM-5	Zhipu	~5,750	Reasoning
MiMo-V2-Pro	Xiaomi	~6,450	Coding tasks
MiMo-V2-Omni	Xiaomi	~10,900	Multi-modal
Kimi K2.5	Moonshot AI	~9,250	Frontend dev, 256K context
Qwen3.5 Plus	Alibaba	~50,500	General coding
Qwen3.6 Plus	Alibaba	~16,300	General coding
MiniMax M2.7	MiniMax	~17,000	General coding
MiniMax M2.5	MiniMax	~31,800	General coding

The model list has grown since the March beta — which only had GLM-5, Kimi K2.5, and MiniMax M2.5 — and the team says it will keep changing as they test and add new ones. The recent addition of the Qwen models is a good sign.

There's also a free model called Big Pickle (likely based on GLM-4.6, with a 200K context window) available at 200 requests per 5 hours, even without subscribing. Not bad for zero dollars.

The MiniMax Sweet Spot

Here's the thing that makes this subscription interesting: MiniMax and Qwen give you a good number of requests. We're talking up to 50500 per month with Qwen 3.5 Plus and up to 31800 with Minimax m2.5. And these aren't toy models — MiniMax M2.5 scored 80.2% on SWE-Bench Verified, which puts it within spitting distance of Claude Opus 4.6's 80.8%.

I reviewed MiniMax M2.7 separately and came away genuinely impressed. It's not going to out-reason Claude on gnarly architecture problems, but for everyday structured coding work — refactoring, writing tests, generating boilerplate, building out features from a spec — it's shockingly competent for the price.

The catch is that the reasoning-heavy models like GLM-5.1 burn through your limits fast. Like, 4,300 requests per month fast. That's the same $10 getting you 23x fewer requests depending on which model you pick. This variance is the thing you need to understand before subscribing.

The Usage Limits — What $10 Actually Gets You

The Three-Layer Cap System

OpenCode Go doesn't use simple request counts. It uses dollar-equivalent credits spread across three rolling windows:

$12 of usage every 5 hours (rolling)
$30 of usage per week
$60 of usage per month

Because each model costs a different amount per request, the actual number of requests you get varies wildly. MiniMax M2.5 is cheap per-request, so you get a mountain of them. GLM-5.1 is expensive, so you get a molehill.

This layered structure is worth paying attention to. Hit the 5-hour cap twice in a day and you've already dented a significant chunk of your weekly allowance. It's not a bug — it's designed to prevent bursty heavy usage from draining the pool — but it feels restrictive if you're in the zone on a Saturday afternoon.

Is It Enough for Real Work?

Honest answer: not for full-time, all-day coding. If you're the kind of developer who leans on AI for every commit, you'll exhaust the reasoning-model limits within days. One developer reported hitting 49% of their monthly usage on day one. That's... not great.

But here's how I actually use it: as a Plan B. My primary tool is still Claude Code for complex reasoning tasks. OpenCode Go with MiniMax M2.7 handles the routine stuff — the scaffolding, the test writing, the "please generate this CRUD endpoint" work. At 17000 requests per month, that's enough for supplementary use.

When you do hit the limits, you've got two fallback options: drop down to the free Big Pickle model, or enable balance draw from a Zen pay-as-you-go account (requires a separate $20 minimum top-up). Not ideal, but not a dead end either.

Speed, Quality, and My First-Week Experience

The Slow Start

I won't lie — my first few hours with OpenCode Go were rough. Responses were sluggish enough that I started second-guessing the whole thing. I nearly wrote it off.

Then it normalised. The next day, latency dropped to perfectly usable levels. My best guess is that it was either a transient issue on the hosting side or time-of-day dependent — the models are served across US, EU, and Singapore, so peak loads shift. Worth knowing that first impressions might not reflect steady-state performance.

Independent testing backs this up. One reviewer found that GLM-5.1 actually ran faster on OpenCode Go than on the competing Z.ai Coding Plan, and MiniMax quality was identical across OpenCode Go, Vercel, and OpenRouter. No degradation from the proxy layer.

Are the Models Good Enough?

For the price? Absolutely. For replacing frontier proprietary models? No.

The benchmark numbers tell a useful story here. MiniMax M2.5 at 80.2% SWE-Bench is genuinely competitive — it's within a point of Claude Opus 4.6's 80.8%, and ahead of GPT-5.2 on several agentic benchmarks. GLM-5 hits 77.8%, Kimi K2.5 scores 76.8%. These are real numbers on a real benchmark.

In practice, I find the MiniMax models handle TypeScript and Go work solidly for structured tasks. Where they fall short — and where I still reach for Claude — is complex multi-file refactoring, subtle architectural decisions, and anything that requires genuine creative problem-solving. The kind of work where you need the model to reason about trade-offs rather than execute a pattern.

There's been a quantization rumour floating around Reddit — some developers suspect OpenCode Go is running quantized versions of these models, making them subtly worse. Independent testing doesn't support this. One reviewer specifically investigated the claim and found that GLM-5.1 on Go actually handled context windows above 120K tokens better than the same model on Z.ai. I'm not going to say quantization is impossible, but the evidence points against it.

Works With Any Agent — Not Just OpenCode

This is a point of confusion I've seen repeated in third-party reviews, so let me be clear: OpenCode Go's API key works with any tool, not just OpenCode.

The official Go page explicitly states "Use with any agent," and the documentation publishes standard endpoints compatible with both OpenAI and Anthropic API formats. You get OpenAI-compatible chat completions and Anthropic-compatible messages endpoints. One API key, standard interfaces, usable in Claude Code, Codex, your own app — whatever speaks those formats.

A widely-cited APIYI review got this wrong, claiming the models could only be used within OpenCode. That's simply not accurate. OpenCode Go functions more like an OpenRouter-style proxy than a walled garden.

Who Should Subscribe?

It's a good fit if you:

Want to experiment with Chinese open-source models without managing individual API accounts
Need a cheap secondary or backup coding tool alongside a primary subscription
Have lighter usage patterns — hobby projects, side work, learning
Are an international developer (the UPI payment support and global server hosting are genuine differentiators)

It's not a good fit if you:

Code all day, every day, and lean heavily on AI assistance
Need frontier proprietary models for complex reasoning
Will be frustrated by layered usage caps that punish bursty workflows

My recommendation: treat it as one piece of a multi-tool stack. At $10 it sits comfortably alongside a Claude Code subscription without breaking the bank — and the MiniMax models genuinely pull their weight for routine work.

The Bottom Line

OpenCode Go is the cheapest multi-model AI coding subscription available in 2026. At $10 per month, the value is real — especially if you lean into MiniMax generous request allocation for everyday coding tasks. The models are competitive on benchmarks, the API key works with any agent, and the global server coverage means usable latency from Sydney (or anywhere else that isn't Silicon Valley).

The tradeoffs are equally clear. Heavy users will burn through limits in days. The models are exclusively from Chinese AI labs — no Western proprietary options. It's still in beta, so the model roster, pricing, and limits could all change.

I'm keeping my subscription. I use it alongside Claude Code — Go for the routine work, Claude for the hard stuff.

One last thing worth noting: Dax Raad from the OpenCode team has been refreshingly transparent about the economics, saying they roughly break even on the $10 plan. This is a growth play, not a profit center. Which means the pricing is either a remarkable bargain — or a bet on a model that won't survive its own success. Either way, at $10, the downside risk is a coffee and a half.

Agentic Commerce in 2026: A Developer's Honest Take on What's Real

contact@thomas-wiegold.com (Thomas Wiegold) — Fri, 10 Apr 2026 00:00:00 GMT

If you've been anywhere near ecommerce or AI in the last year, you've heard the term "agentic commerce." It's in every consulting deck, every platform keynote, every LinkedIn post from someone who just discovered that AI can do more than generate product descriptions.

I've spent years building and improving ecommerce websites. Right now I'm building an autonomous agent that transfers, optimises, and continuously improves thousands of product listings across platforms — using the Claude API and my own TCA (trigger-condition-action) framework. So I'm not writing this from the sidelines — I'm neck-deep in it.

Here's my honest assessment of what agentic commerce actually is, what works today, and what should make sellers nervous.

What Is Agentic Commerce?

Strip away the marketing and agentic commerce is straightforward: AI agents that can discover, compare, negotiate, and buy products autonomously. Not chatbots with a new coat of paint. Actual autonomous systems that pursue goals across multiple steps and multiple tools.

The distinction matters. A chatbot follows a script. A generative AI tool responds to prompts. An agent pursues goals. It connects to your inventory system, checks stock, evaluates demand patterns, decides to reorder, and executes — without someone typing "please reorder SKU-4429."

Four things define a real agent: autonomy (operating within guardrails without waiting for each instruction), multi-step reasoning (breaking complex tasks into subtasks), tool use (connecting to external systems via APIs), and memory (retaining context across interactions).

Gartner estimates only about 130 of the thousands of vendors claiming "agentic AI" capabilities are genuine. The rest are agent-washing — slapping the label on whatever they already had. If you've been in tech long enough, this will feel depressingly familiar.

The Two Sides of Agentic Commerce

This is the part most articles get wrong. They treat agentic commerce as one thing. It's not. There are two fundamentally different sides, and they have very different implications for sellers.

Agents Doing Your Repetitive Work

This is the good side. The side where AI agents handle the soul-crushing operational work that eats your evenings and weekends.

Listing creation. Pricing updates. Customer support tickets asking where their order is. Marketing copy for the 47th variation of the same product. If you've ever spent a Sunday night writing product descriptions, you know the pain.

The numbers here are real. Klarna automated two-thirds of its customer service with AI and initially saved around $39 million annually. Volcom cut content creation time from months to weeks. Dynamic pricing agents deliver 2–5% revenue increases when properly calibrated.

If you've ever run a small shop, you know the pain. Getting product listings right — the copy, the keywords, the pricing, the images — used to take weeks of manual effort per product line. Multiply that across hundreds or thousands of SKUs and you've got a full-time job that isn't building your actual business. This is where agents genuinely shine. I'm building exactly this kind of automation right now — an agent that transfers thousands of product listings between platforms, optimises them with better copy and SEO, observes how they perform, and improves them again. It's not science fiction. It's a TypeScript project with an API key and a lot of prompt engineering.

AI Shopping on Your Behalf

This is the side that worries me.

McKinsey projects up to $1 trillion in US retail revenue will be orchestrated by AI agents by 2030. Adobe reported that traffic to retail sites from AI browsers increased 4,700% year-over-year. During Cyber Week 2025, Salesforce reported that 20% of global orders were influenced by AI agents.

Those are staggering numbers. They're also a signal that another opaque layer is being inserted between sellers and customers — controlled by Big Tech.

Think about it from a seller's perspective. Today you optimise for search algorithms. Tomorrow you'll need to optimise for AI agents that decide what to recommend. Will sellers need to pay for AI visibility the way they pay for ads? Will the agent show your product or your competitor's? Who decides?

If that sounds like the same game we've been playing with Google and Amazon for 15 years, with a new interface and less transparency... well, that's because it probably is.

What Actually Works in Production Today

Let's get specific about what's real and what's still a keynote slide.

Customer Service: Most Mature, Most Instructive

AI customer service is the clear winner for production readiness. Klarna's AI handled 2.3 million conversations in its first month. Gorgias targets 60% automated resolution across its 15,000+ merchant base. Rep AI reports that shoppers using its AI chat convert at 12.3% versus 3.1% for non-users.

But the Klarna story is also the definitive cautionary tale. After celebrating the replacement of 700 human agents, the company began rehiring in 2025 when customer satisfaction dropped significantly. CEO Sebastian Siemiatkowski admitted they focused too much on efficiency and cost, resulting in lower quality.

The lesson isn't that AI customer service doesn't work. It's that hybrid models outperform full automation. Simple queries — order status, shipping updates, FAQs — resolve at high rates. Returns succeed about 58% of the time. But billing disputes see only 17% chatbot success. The 80/20 split (AI handles routine, humans handle judgment calls) is the pattern that actually works. If you want to see what building a practical triage system looks like, I walked through building an automated support ticket triage agent step by step.

Product Listings and Content

This is the second most mature category and, honestly, the one that excites me most.

Volcom reduced content creation from 5–6 months to 4–6 weeks using Hypotenuse AI. Amazon's own AI listing tools now generate 70%+ of required product attributes. For Shopify merchants, Sidekick writes product descriptions and manages metafields right in the admin.

For small shops, this is transformative. I've seen firsthand how much time goes into getting a single product listing right — researching keywords, writing compelling descriptions, structuring attributes for search, testing different titles. Do that across a large catalogue and it's months of work. An agent that handles the first 80% and lets you refine the last 20% changes the economics of running a small store entirely.

I'm building this kind of agent now — transferring thousands of listings between platforms, running them through Claude for optimised copy, better keyword targeting, and structured data, then pushing the results via the Shopify Admin GraphQL API. The TCA pattern from my agentiny framework handles the event-driven workflow: a new listing triggers analysis, Claude evaluates and optimises, the result gets queued for review, and then the agent monitors performance to improve again. It's the observe-and-improve loop that makes this genuinely agentic rather than just a batch script.

Human review remains essential for brand voice and accuracy. But the heavy lifting? That's squarely in agent territory now.

Pricing, Inventory, and Marketing

Dynamic repricing agents deliver 2–5% revenue increases and 5–10% margin improvements when properly calibrated. AI demand forecasting reduces errors by 30–50%. AI-powered product recommendations drive 35% of Amazon's total revenue.

The caveat nobody puts in the headline: dynamic pricing agents need at least 60–90 days of historical data before they outperform manual rules. Fewer than 15% of retailers currently use AI pricing. These tools reward patience and data investment, not hype-cycle enthusiasm.

The Protocol Wars — ACP vs UCP

Two competing standards are fighting to become the backbone of agentic commerce, and the way the battle is playing out tells you a lot about the current state of things.

OpenAI and Stripe co-developed the Agentic Commerce Protocol (ACP), launched alongside ChatGPT's "Instant Checkout" in September 2025. The vision: discover a product in ChatGPT, buy it right there. Etsy was a launch partner. Shopify signed on. Walmart, Target, Instacart followed.

It didn't work. OpenAI pulled back Instant Checkout by March 2026. Only about a dozen Shopify merchants ever went live. The company couldn't solve real-time inventory sync, sales tax collection, or the basic fact that users researched products in ChatGPT but went elsewhere to actually buy them. TD Cowen analysts called it a "stunning admission."

Google responded with the Universal Commerce Protocol (UCP), announced at NRF 2026 by CEO Sundar Pichai. It was co-developed with Shopify, Etsy, Wayfair, Target, and Walmart, endorsed by 20+ partners including Visa, Mastercard, and American Express. UCP covers the full shopping journey and leverages Google's Shopping Graph. As of March 2026, Google has already added cart, catalog, and identity-linking capabilities.

Amazon, notably, has joined neither protocol. They're building their own thing with Rufus and the "Buy for Me" feature. Classic Amazon.

For sellers, the takeaway is clear: don't bet on one protocol. Make your products agent-discoverable everywhere. Structured data, clean feeds, complete product attributes — these matter regardless of which standard wins.

The Hard Parts Nobody Talks About

Platform Rules Are a Minefield

Every major marketplace has taken a different stance on AI agents, and the differences can get your account suspended if you're not paying attention.

Amazon is the most restrictive. Their March 2026 Business Solutions Agreement update requires all automated actions to flow through registered SP-API applications. Browser automation and screen scraping are explicitly prohibited. They've also blocked AI crawlers and sued Perplexity over its shopping agent.

Shopify takes a controlled-but-encouraging approach. Their Robot & Agent Policy requires human review steps for buy-for-me agents. They charge a 4% AI transaction fee for ChatGPT orders. But they're simultaneously the most agent-friendly platform for developers, with open MCP servers, the Catalog API, and CEO Tobi Lütke's stated goal of making every Shopify store agent-ready by default.

eBay bans unauthorised agents outright in their February 2026 User Agreement update and prohibits feeding marketplace data into third-party AI models without written consent.

Etsy is the most conservative overall. "Keep commerce human" isn't just a slogan — their API terms prohibit using data for machine learning or AI training without authorisation. AI-generated art requires original prompts and transparent disclosure. Selling AI prompts is explicitly banned. Despite this, they were a launch partner for both OpenAI and Google UCP, which tells you even the most reluctant platforms see where this is going.

Safety Isn't Optional

A merchant reported a customer convinced their AI chatbot to escalate a discount from 25% to 80%, resulting in an $11,000 order at catastrophic margins. Air Canada's chatbot case established legal precedent — companies are legally responsible for their AI chatbot's statements. Approximately 40% of organisations report experiencing an AI-related privacy incident.

And here's the uncomfortable stat: 70–85% of AI initiatives fail to meet expected outcomes. Only 5.2% of surveyed companies had AI agents live in production as of early 2025.

Start narrow. Keep humans in the loop. Measure everything. I know this sounds boring. That's because boring is what works. And make sure you're actually validating that your AI is producing correct results — not just assuming it is because the output looks plausible.

Where to Start If You're a Seller

This depends on how technical you are — and I'm writing this assuming a range of readers.

If you're non-technical: Start with SaaS. Gorgias at $0.90 per resolution for customer service. Tidio's Lyro starting at $32.50/month for 50 AI conversations (powered by Claude under the hood, which I like). Shopify Sidekick for product descriptions and admin tasks. All of these deploy in days, not months.

If you're technical: The Claude API plus Shopify's Admin GraphQL API is a powerful combination. Build a TCA pattern for event-driven workflows — triggers fire on external events (cart abandoned, price changed, inventory low), conditions evaluate context, and actions execute multi-step responses. Use model routing from day one: a cheap model like Haiku for intent detection, Sonnet for production workloads, and Opus only when you genuinely need sustained autonomous reasoning.

The hybrid approach is what I'd recommend for most sellers: buy the systems of record (helpdesk, email marketing, inventory management) and build the differentiating intelligence layer on top. Buy Gorgias for your helpdesk. Build a custom product recommendation agent on Claude that integrates with it. Use Klaviyo for email infrastructure. Build a custom analytics agent that feeds it personalised campaign strategies.

The gap for non-technical Shopify and marketplace sellers is still huge. The tooling exists but the on-ramp is rough — I wrote a deeper dive on what's hype vs. reality for small businesses adopting AI agents. If you're a developer who also sells things, that gap is your opportunity.

The Bottom Line

Agentic commerce is real. The technology works for well-scoped tasks: customer service automation, product content at scale, dynamic pricing, demand forecasting. These are production-ready with proven ROI.

The buyer-side revolution — AI agents shopping on behalf of consumers — is happening more slowly than the keynotes suggest. OpenAI's Instant Checkout flopped. Google's UCP is promising but early. Amazon is doing its own thing. The protocol landscape is still contested.

For sellers, the playbook is straightforward: pick the highest-ROI, most constrained use case (probably Level 1 customer support or bulk product descriptions), prove the business case, then expand. Don't try to build a general-purpose autonomous system. Build constrained, domain-specific agents that do one thing well.

The most powerful AI agent is still one that knows when to ask a human for help. That's not a limitation — that's good engineering.

AI Agents for Small Business: Hype vs. Reality in 2026

contact@thomas-wiegold.com (Thomas Wiegold) — Thu, 02 Apr 2026 00:00:00 GMT

If you've been anywhere near tech Twitter in the last three months, you've seen the lobster emoji. You've seen the screenshots of people's AI assistants booking flights, triaging email, and — in at least one memorable case — creating a dating profile without permission. 2026 is officially the year of the AI agent, and the hype is deafening.

But here's the thing about AI agents for small business: the gap between "I tried it on a weekend" and "this runs in production and makes money" is enormous. Jensen Huang called OpenClaw "probably the single most important release of software, probably ever". Goldman Sachs says 93% of small businesses report positive impact from AI. And yet only 14% have fully integrated it into core operations. That tells you everything you need to know: most people are tinkering, not transforming.

I've been building with AI agents professionally — I maintain agentiny, a lightweight TypeScript agent framework — and I've tested most of the major tools. This is my honest take on what actually works, what's overhyped, and where to start if you're running a small business and don't want to waste your time.

2026 Is the Year of the Agent — Here's What That Actually Means

First, let's be clear about what we're talking about. An AI agent is not a chatbot. A chatbot answers questions. An agent does things — sends emails, schedules meetings, researches prospects, manages files, fills out forms. It takes actions autonomously, often across multiple tools, without you hovering over every step.

The market numbers are genuinely impressive. 58–71% of small businesses are actively using AI in some form. The agent market is projected to grow from $7.6 billion to roughly $236 billion by 2034. And there's a stark correlation: 83% of growing SMBs have adopted AI versus just 55% of declining ones. Whether AI causes growth or growing businesses are just more likely to adopt new tools is a fair question. But the trend is impossible to ignore.

My angle: yes, this is real. No, it's not magic. And most people I see playing with agents on social media are configuring them for fun, not for measurable ROI. That's fine — experimentation matters. But if you're a business owner reading this, you want to know what actually moves the needle. So let's talk specifics.

OpenClaw — Overhyped and Undervalued at the Same Time

What OpenClaw Actually Does

OpenClaw is the project that broke GitHub. A free, open-source AI agent that runs locally on your machine and connects to the messaging apps you already use — WhatsApp, Slack, Telegram, Discord, Teams, and about 20 more. It went from a weekend hack by Austrian developer Peter Steinberger (of PSPDFKit fame) to over 344,000 GitHub stars in under four months. For context, React took a decade to get there.

The project has had more name changes than a witness protection participant — Clawdbot became Moltbot became OpenClaw, after Anthropic's lawyers had a word about the original name sounding a bit too much like "Claude." Steinberger has since joined OpenAI, and the project lives under an independent open-source foundation.

What makes OpenClaw genuinely interesting is the combination of its skills system and ClawHub registry. There are over 13,700 community-built skills covering everything from lead generation to expense tracking to client onboarding. It's model-agnostic — Claude, GPT-5.4, Gemini, local models via Ollama, 500+ options through OpenRouter. And the reported numbers are compelling: 10–15 hours per week saved at $10–80/month in API costs.

The self-hosting story is solid too. Runs on laptops, Mac Minis, VPSes, even Raspberry Pis. Cloud options exist if you don't want to manage it yourself — Amazon Lightsail has preconfigured instances, DigitalOcean offers 1-Click Deploy, and KiloClaw does fully managed hosting for $9/month.

The Security Problem Nobody Talks About Enough

Here's where the mood shifts. Cisco's AI security team tested a top ClawHub skill and found it was straight-up malware performing data exfiltration. Their audit revealed 26% of skills had at least one vulnerability. Over 230 malicious skills were uploaded to ClawHub in a single week. And 21,000+ OpenClaw instances were found exposed to the public internet.

One of OpenClaw's own maintainers put it bluntly: "If you can't understand how to run a command line, this is far too dangerous of a project for you to use safely." I appreciate the honesty.

Here's my take: if you want to try OpenClaw, don't run it on your main machine. Don't connect your primary email or your business bank accounts. Use a cloud-hosted option with proper sandboxing — NVIDIA's NemoClaw adds enterprise-grade security, and KiloClaw handles the ops for you. The security surface area is enormous, and most users don't fully grasp what they're granting access to. OpenClaw is a fantastic project with a genuine community. But treating it casually with sensitive data is asking for trouble.

The Honest Use Case Problem

I want to flag something that the hype cycle glosses over: many of the use cases people celebrate with OpenClaw — email filtering, calendar management, task prioritisation — were already solvable without AI. Gmail filters exist. Calendar apps have had automation for years. Task managers have rules engines.

The real value of an AI agent is in the combination: reasoning across context, chaining actions across multiple tools, handling ambiguity. "Read this email thread, figure out what the client actually wants, draft a response, check my calendar for availability, and suggest three meeting times." That's genuinely valuable. "Sort emails into folders based on keywords" is a solved problem from 2010.

Most people aren't using OpenClaw for the hard stuff yet. They're using it for the easy stuff and marvelling that it works. Which, fair enough — it does work. But if you're evaluating it for your business, ask yourself whether the workflow you're automating actually needs AI reasoning, or if a Zapier rule would do the job for less money and less risk.

Claude Cowork — The Safer Path for Non-Technical Teams

Claude Cowork launched on January 12, 2026, and Anthropic pitched it as "Claude Code for non-developers." That positioning is accurate. It runs inside the Claude Desktop app on macOS and Windows, requires zero command-line knowledge, and was reportedly built by Claude Code itself in about two weeks. Meta.

The capabilities are substantial. Sub-agent coordination (it breaks complex work into parallel workstreams), scheduled and recurring tasks, persistent memory across sessions, and a mobile "Dispatch" feature for checking on tasks from your phone. In March 2026, it gained full computer use — opening apps, navigating browsers, clicking and typing. The Model Context Protocol (MCP), Anthropic's open standard with 97 million monthly SDK downloads, powers connectors to Slack, Google Drive, Gmail, DocuSign, and 75+ other services.

Pricing is straightforward. Cowork is included in all paid Claude plans: Pro at $20/month, Max 5x at $100/month, Max 20x at $200/month. But here's the reality check — agentic tasks burn tokens fast. If you're doing anything beyond light use, you'll want Max 5x at minimum. Budget $100/month for serious work.

What really signals Anthropic's commitment to SMBs are the partnerships. Intuit integration brings Claude to QuickBooks, TurboTax, and Mailchimp. Xero collaboration targets AI-powered financial intelligence for small businesses. And Microsoft built Copilot Cowork on Anthropic's technology, integrating it into Microsoft 365.

My take: for most small business owners — especially non-technical ones — Cowork is the answer. It just works. The tradeoff is cost and vendor lock-in (Anthropic's models only), but for the majority of people reading this, that's a perfectly acceptable trade. You get a polished product with real security boundaries instead of an open-source project where the maintainers are actively warning you about how dangerous it is.

When a Custom Agent Makes More Sense

The Case for Purpose-Built Agents

OpenClaw and Cowork are generalists. They handle a broad range of tasks acceptably well. But for professional workflows where you need precision — specific error handling, data validation, safety checks, audit trails — a purpose-built agent will outperform a generalist every time.

Think of it this way: a general contractor can do plumbing, electrical, and carpentry. But if you're redoing your entire kitchen, you want specialists. Same principle applies to agents.

The open-source framework landscape has matured considerably. CrewAI is the easiest to learn and fastest to production for standard multi-agent workflows. LangGraph is the most battle-tested for production deployments, with excellent debugging via LangSmith. Pydantic AI is rising fast for type-safe development. Microsoft's AutoGen is effectively in maintenance mode — skip it for new projects.

I built agentiny because I wanted something lighter. It's a TypeScript framework using a trigger-condition-action pattern for agent orchestration — no framework bloat, just focused agents that do specific things well. If you're a developer who wants full control over your agent logic without importing half of npm, it's worth a look.

Build vs. Buy: A Quick Decision Framework

Here's how I'd think about it depending on your team:

No developers on the team? Go with Claude Cowork or Make.com. The visual interfaces are genuinely good now, and Make's free tier lets you experiment without commitment.

Technical comfort but limited time? OpenClaw via a cloud host, or n8n self-hosted. n8n's open-source tier with unlimited free executions is exceptional value for technical teams.

Developers available with a specific high-value workflow? Build a custom agent. agentiny, CrewAI, or LangGraph depending on your language preference and complexity needs. Open-source frameworks cost about 55% less per agent but need roughly 2.3× more setup time. That tradeoff is worth it when the workflow matters.

How to Start Without Getting Burned

If you've read this far and want to actually do something, here's the playbook:

1. Pick one workflow. Not three. One. The highest-ROI first automations are customer FAQ responses, lead follow-up, appointment scheduling, and email triage. A structured implementation produces 3–4× the ROI of ad-hoc experimentation. Resist the urge to automate everything at once.

2. Budget realistically. Most SMBs spend $50–500/month on AI tools. Budget 20–40% above platform costs for security measures — monitoring, access controls, backups. The platform cost is never the total cost.

3. Follow security non-negotiables. Least-privilege access: agents should only reach what they need, nothing more. Build in a kill switch to halt all AI workflows instantly. Start with tasks where errors are visible and low-consequence — email drafts that need approval before sending, reports that get reviewed before distribution. Don't give agents access to financial systems or sensitive data until you've built trust and guardrails.

4. Measure everything. This isn't optional. Track time saved, error rates, costs incurred. Gartner warns that over 40% of agentic AI projects risk cancellation by 2027 due to escalating costs and unclear value. If you can't demonstrate ROI, the project will die — and it probably should.

5. Expand methodically. Once workflow #1 is stable and measurably valuable, pick workflow #2. Rinse, repeat. The businesses that succeed with AI agents are the ones that treat it as a disciplined rollout, not a hype-driven experiment.

The Bottom Line

2026 is genuinely the year AI agents crossed from demo to daily driver. The tools are real. The productivity gains are real. The dream of software that runs in the background doing actual work — not just answering questions — is happening right now.

But not every workflow should be automated. And not every automation needs AI. Sometimes a cron job and a Bash script still win.

OpenClaw is a remarkable project — congratulations to Peter Steinberger and the community for building something that genuinely shifted the industry. But it's for technically capable teams who understand the security implications and are willing to manage them. For most small business owners, start with Claude Cowork or an automation platform like Make.com. For specific, high-value workflows where precision matters, build a custom agent.

The question isn't "should we use AI agents for small business automation?" — it's "which workflow do we automate first?"

Pick the boring one. The repetitive one. The one your team complains about every week. Automate that, measure the results, and go from there.

MiniMax M2.7 Review: Is It Worth the Hype?

contact@thomas-wiegold.com (Thomas Wiegold) — Wed, 25 Mar 2026 00:00:00 GMT

MiniMax M2.7 landed on March 18, 2026, and the pitch is bold: frontier-adjacent coding performance at a fraction of the cost. In a head-to-head test by Kilo Code, M2.7 delivered roughly 90% of Claude Opus 4.6's quality for about 7% of the total task cost. That's the kind of number that makes you stop scrolling. But benchmarks are benchmarks, and real work is real work — so I dug into the details, the pricing, the speed issues, and the OpenCode integration to figure out whether this MiniMax M2.7 review ends with "switch now" or "wait and see."

Spoiler: it's neither. It's "add it to your toolbox."

What Is MiniMax M2.7?

M2.7 is a reasoning model from Shanghai-based MiniMax, built on a Sparse Mixture-of-Experts architecture. It has roughly 230 billion total parameters but only activates around 10 billion per token — which is how they keep inference costs so low while still hitting competitive benchmarks. MiniMax calls it "the smallest model in the Tier-1 performance class," and the numbers at least make the case plausible.

The context window is 204,800 tokens (input plus output combined), with a max output of 131,072 tokens. That puts it above GPT-5.2's 128K but well below Gemini 3.1 Pro's 2M window. It's text-only — no images, no audio, no video natively — though it supports tool calling and MCP, so you can bolt on image understanding and web search through external tools.

One thing worth noting: M2.7 is proprietary. This is a sharp turn from MiniMax's earlier M2 and M2.5 (which I reviewed previously), which were fully open-weight under Apache 2.0. The community has not been thrilled about this. If you valued the open approach, this stings.

I should be honest: I really liked M2.5. I said so in the review. But after a few weeks of daily use, I found myself drifting back to Claude for bigger tasks, and then spending a lot of time testing GPT 5.4. That's not a knock on M2.5 — it's just the reality of how model usage works now. You explore, you compare, you settle into patterns. M2.7 is MiniMax's attempt to pull you back.

The marketing headline is "self-evolution" — M2.7 participated in its own training loop, running 100+ optimization cycles on its own scaffold and reportedly improving internal benchmarks by 30%. That sounds dramatic, but MindStudio's caution is worth internalising: internal benchmark gains don't automatically translate to neutral third-party evaluations. It's an interesting research direction, not a finished revolution.

Benchmarks — Close to Frontier, Not Quite There

Coding Benchmarks

The headline numbers are genuinely impressive. M2.7 scores 56.22% on SWE-Pro, which nearly matches Claude Opus 4.6 at roughly 57%. On SWE-Bench Verified it hits 78%, which actually outperforms Opus's 55% on that particular benchmark. It also scored 55.6% on VIBE-Pro and 57.0% on Terminal Bench 2.

But benchmarks have become increasingly hard to compare meaningfully. Different scaffolds, different evaluation conditions, different levels of self-reporting. The most useful data point I found was Kilo Code's head-to-head comparison across three TypeScript codebases. Both models found all six planted bugs and all ten security vulnerabilities. M2.7 even offered a technically superior fix for one bug — using integer math for currency calculations instead of floats, which is the kind of thing that makes you nod approvingly.

Where Opus pulled ahead was in the details that matter for maintainability: it produced 41 integration tests versus M2.7's 20 unit tests, used a more modular file structure, and demonstrated better architectural thinking overall. The total cost difference? $0.27 for M2.7 versus $3.67 for Opus. That's the kind of gap that changes how you think about task routing.

My take: for the stuff I do day-to-day — quick bug fixes, code reviews, feature scaffolding — M2.7 is more than capable. For complex architectural work where I need the model to think deeply about structure and test coverage, I'm still reaching for something stronger.

General Intelligence

On Artificial Analysis's Intelligence Index v4.0, M2.7 scores 50 — a solid 8-point jump from M2.5's 42, but still behind Gemini 3.1 Pro and GPT-5.4 (both at 57), Opus 4.6 (53), and Sonnet 4.6 (52). Not frontier, but firmly in the "good enough for most tasks" tier.

The hallucination rate is interesting: 34% per Artificial Analysis, which is actually lower than Claude Sonnet 4.6 (46%) and Gemini 3.1 Pro Preview (50%). Take that with a grain of salt — hallucination metrics are notoriously slippery — but it's worth noting.

A few caveats. Most of MiniMax's benchmark claims come from self-evaluation, and independent verification is still catching up. VentureBeat noted that on BridgeBench vibe-coding tasks, M2.7 actually scored worse than its predecessor M2.5. That's the kind of regression that benchmark cherry-picking can hide.

Token Plans and Pricing Breakdown

Pay-As-You-Go

Here's where M2.7 gets genuinely compelling. The API pricing is $0.30 per million input tokens and $1.20 per million output tokens. Compare that to Opus 4.6 at $5/$25 — that's roughly 17× cheaper on input and 21× cheaper on output.

You'll see some articles claiming "50× cheaper on input" — that's comparing against the old Opus 4.1 pricing of $15/M, not the current Opus 4.6 at $5/M. Still a massive gap, just not quite as dramatic as some headlines suggest.

The automatic caching is a genuine standout feature. Cache reads cost just $0.06 per million tokens with zero configuration required. For cache-heavy agentic workloads where you're hitting the same system prompts and context repeatedly, your blended cost can drop to roughly $0.06/M tokens. That's essentially free compared to what you're used to paying.

But — and this is important — there's a verbosity tax. During Artificial Analysis's evaluation, M2.7 generated 87 million output tokens. The median for reasoning models in its price tier is 20 million. That's 4× more output tokens than average, which significantly erodes the headline per-token savings. Multiple Reddit users have reported 16,000+ tokens of thinking for simple prompts. The evaluation alone cost $175 in tokens. So yes, the per-token price is absurdly low, but the model is also absurdly chatty.

Subscription Tiers

MiniMax offers six monthly token plans ranging from $10 to $150, bundling M2.7 with their speech, image, video, and music models:

The Starter tier at $10/month gives you 1,500 requests per 5-hour rolling window on the standard variant — M2.7 only, no extras. Plus ($20/month) bumps that to 4,500 requests and adds speech and image generation. Max ($50/month) gives you 15,000 requests with the full suite including video and music.

Then there are the highspeed variants: Plus-Highspeed ($40/month), Max-Highspeed ($80/month), and Ultra-Highspeed ($150/month) at 30,000 requests per 5-hour window with everything included.

Yearly plans save about 17% (Starter drops to $100/year, Ultra-Highspeed to $1,500/year).

Which tier makes sense depends entirely on your usage pattern. If you're doing light coding work — a few tasks a day, nothing too intensive — the Starter plan or even pay-as-you-go might be more economical. For heavy agentic workloads where you're burning through tokens continuously, the Max or Ultra-Highspeed tiers start to look like genuine bargains. The Starter tier's break-even against pay-as-you-go isn't always favourable for casual use, so do the maths for your specific workflow before committing.

The Speed Problem

This is where the story gets less rosy.

MiniMax claims roughly 100 tokens per second for the highspeed variant and around 60 TPS for standard. They market M2.7 as "3× faster than Opus." Independent testing tells a different story. Artificial Analysis measured the standard variant at 45.6 tokens per second — against a median of 95.8 TPS for reasoning models in its price tier. Time to first token was 2.49 seconds versus a 1.84-second median. That's noticeably sluggish in interactive use.

It's possible the highspeed variant performs closer to claims, but Artificial Analysis appears to have tested the standard endpoint, and I haven't found independent benchmarks of the highspeed tier yet. Until someone verifies those numbers, I'd treat the 100 TPS claim with healthy scepticism.

The underlying issue is architectural. M2.7 uses full attention across its context window, so performance degrades further on long-context workloads. Community members in the llama.cpp project have flagged this: "Minimax applied full attention, thus it's so slow in long ctx." The 205K context window looks competitive on paper, but pushing anywhere near that limit will test your patience.

I actually wish speed got more attention across the industry generally. A model that's 10% smarter but 3× slower often feels worse in practice. The highspeed plan is compelling in theory — if it actually delivers the throughput MiniMax claims. That's a big "if" right now. I haven't spent the money on the highspeed subscription yet — I have too many subscriptions as it is — but it's on my list. Maybe next month, when I carve out time for proper testing.

Using M2.7 with OpenCode

If you're already using OpenCode (and I switched from Claude Code to OpenCode recently), M2.7 slots in naturally. MiniMax is a preloaded built-in provider — you run opencode auth login, select MiniMax, enter your API key, and you're off. The model uses the Anthropic-compatible API format at https://api.minimax.io/anthropic/v1.

One gotcha: clear any existing ANTHROPIC_AUTH_TOKEN and ANTHROPIC_BASE_URL environment variables first. If you've been using Claude through OpenCode, those will conflict. I've seen people lose 20 minutes to this before checking the obvious.

There are five access paths total: direct API integration, OpenCode Go ($5 first month, then $10/month — includes M2.7, M2.5, GLM-5, and Kimi K2.5), OpenCode Zen for pay-as-you-go, OpenRouter (minimax/minimax-m2.7), and Ollama Cloud. MiniMax provides an official setup guide that's actually decent.

For best results, the recommended inference parameters are temperature=1.0, top_p=0.95, top_k=40. These are higher than what you might instinctively set, but they seem to produce better output quality with this particular model.

Known issues worth watching: earlier MiniMax models in OpenCode had problems with tool-calling loops and premature task halting. These appear to be improving with M2.7 but haven't fully disappeared. If you hit a case where the model gets stuck in a loop or stops mid-task, it's a known pattern, not something unique to your setup.

Should You Switch? My Honest Take

I'll be honest — I wasn't sure I should write another model review. I'm starting to worry I'll repeat myself. New model drops, benchmarks look great, pricing is aggressive, some caveats apply. You've read that article before. I've written that article before.

And that's kind of the point. New models are impressive, but they're not exciting in the way they used to be. The performance gap between "best" and "good enough" is shrinking every quarter, and the real innovation is happening elsewhere — in the desktop apps, the chat interfaces, the coding CLIs. The features being added to the tools we use daily feel more consequential right now than another few percentage points on SWE-Bench.

That said, M2.7 is still worth your attention for one simple reason: the pricing. Even if you never make it your primary model, having a plan B that costs this little is genuinely useful. Quick tasks, simple fixes, boilerplate generation — there's no reason to burn Opus tokens on that stuff. I don't think "switching" is the right frame anymore. I run the same task on two or three models and pick the best result. It takes an extra minute and saves me from the weaknesses of any single model.

For anything requiring deep architectural thinking, thorough test coverage, or complex multi-file reasoning, I still reach for Opus or Sonnet. The quality gap is real but narrow, and for 80% of daily coding work, that gap doesn't matter.

My recommendation: just try it. The $10 Starter plan or pay-as-you-go makes experimentation essentially risk-free. Run your typical tasks through it for a week. You'll know pretty quickly whether it fits your workflow.

And MiniMax isn't the only one making moves in this space. Xiaomi dropped another surprisingly strong model recently, and Alibaba Cloud has a code subscription that's turning heads. The cost-effective tier of AI coding is getting crowded fast, which is great for the OpenClaw/agent use cases where token costs compound. There's a lot to test, and honestly, we shouldn't complain about having too many good options. That's a problem I'm happy to have.

The market is moving toward "best model per dollar," not just "best model." MiniMax is betting on that future, and right now, the bet looks increasingly sound.

I Tested GPT 5.4 Against Every Rival — Here's My Honest Review

contact@thomas-wiegold.com (Thomas Wiegold) — Wed, 18 Mar 2026 00:00:00 GMT

Two weeks ago, OpenAI dropped GPT 5.4. Within hours, my feed was wall-to-wall benchmark tables and breathless takes about the "best model ever." So I did what any reasonable developer would do: I ignored all of it and ran my own test.

This is my GPT 5.4 review after two weeks of daily use across real projects — not a benchmark summary, not a migration guide, and definitely not a press release rewrite. I tested it head-to-head against Claude, Gemini, and MiniMax on a single creative coding prompt, then spent the following days using it on actual work. Here's what I found.

What GPT 5.4 Actually Ships With

Let's get the spec sheet out of the way quickly.

GPT 5.4 launched on March 5, 2026 as a convergence play — it merges GPT-5.3 Codex's coding chops with GPT-5.2's generalist reasoning into one model. The headline features: native computer use (75% on OSWorld, beating the human expert baseline of 72.4%), a 1M-token context window, tool search that cuts token usage by up to 47%, and configurable reasoning effort across five levels from none to xhigh.

Now the caveats nobody puts in the headline.

That 1M context window comes with a 2× input / 1.5× output surcharge once you exceed 272K tokens. And "1M" is generous marketing — OpenAI's own Graphwalks benchmark shows accuracy dropping from 93% at 128K to 21.4% between 256K and 1M. One independent test found instructions at token 850,000 were missed 40% of the time. So you've got a 1M window where roughly the last 400K tokens are decorative.

Pricing is $2.50/$15 per million tokens for standard context — roughly half of Claude Opus 4.6. But that comparison flips for long-context work. Anthropic removed all long-context surcharges on March 14, so a 500K-token conversation on Claude might actually be cheaper than GPT 5.4. The devil, as always, is in the pricing page.

The Test — One Prompt, Four Models, Zero Hand-Holding

I have a theory: benchmarks tell you how well a model performs on tasks someone else chose. A single creative prompt tells you how a model thinks.

The Prompt

I asked each model the same thing: build me an atomic world clock app. Analog clocks with hands, digital time readouts, a world map with time zones, and real atomic time synchronisation. One shot, no follow-ups, no hand-holding. The kind of prompt that tests design sense, technical accuracy, and integration quality all at once.

Four models entered. None left unscathed.

Results by Model

GPT 5.4 delivered the best-looking result by a comfortable margin. Clean layout, nice visual hierarchy, the kind of output you could screenshot and put in a pitch deck. But the atomic time sync was broken — it fetched the time once and then drifted. The analog clock hands were also slightly off, which is a problem when your app is literally a clock.

Claude nailed the atomic time synchronisation. The NTP sync worked correctly, updating at proper intervals. The visual design was functional rather than pretty — it prioritised correctness over aesthetics, which tracks with what I found in my Opus 4.6 review. For a clock app, I'd argue that's the right call, but your mileage may vary.

Gemini surprised me. The setup was rocky — first attempt had import errors — but once it got going, it produced a surprisingly complete result with an actual world map (not just a list of cities). The map used real geographic projections. The time sync had the same drift issues as GPT 5.4, though, and the analog clock hands were off.

MiniMax M2.5 produced a non-functional result. At a fraction of the cost, you get what you pay for on creative integration tasks. It's a different story for targeted code fixes (I covered its strengths in my MiniMax M2.5 review), but this wasn't that kind of test.

Every single model struggled with analog clock hand placement. Turns out, correctly mapping hours, minutes, and seconds to angular positions on a circle is the kind of simple-sounding problem that trips up AI models consistently. Make of that what you will.

What This Test Actually Reveals

The "best model" depends entirely on what you're measuring. Design? GPT 5.4. Correctness? Claude. Completeness? Gemini surprised me. Raw coding output at 1/20th the price? MiniMax has a case for targeted tasks.

But here's the real insight: benchmarks measure narrow capabilities in controlled conditions. A single creative prompt exposes integration quality, aesthetic judgment, and failure modes simultaneously. No model aced everything. Not one. And that single data point tells me more about the state of AI in 2026 than any leaderboard.

Benchmarks Are Barely Useful Now

I'm going to spend a few hundred words on benchmarks, and then I'm going to explain why you should take all of them with a fistful of salt.

The Numbers That Matter

Here's the current state of play, sourced from Vals.ai and Artificial Analysis:

Benchmark	GPT 5.4	Claude Opus 4.6	Gemini 3.1 Pro	Leader
SWE-Bench Verified	77.2%	79.2%	80.6%	Opus 4.6
SWE-Bench Pro	57.7%	~45%	54.2%	GPT 5.4
Terminal-Bench 2.0	75.1%	74.7%	78.4%	Gemini 3.1
OSWorld-Verified	75.0%	72.7%	—	GPT 5.4
Arena Coding Elo	~1481†	1561 (#1)	—	Opus 4.6
AA Intelligence Index	57	53	57.2	Gemini ≈ GPT 5.4

†GPT 5.4 may not have accumulated enough Arena votes for a stable ranking yet.

No single model dominates. The leader changes depending on which row you look at.

Why You Shouldn't Trust Any Single Number

OpenAI dropped SWE-Bench Verified — the benchmark where Claude leads — in favour of SWE-Bench Pro, where GPT 5.4 happens to lead. Convenient.

Scaffold choice massively changes scores. xAI's Grok 4 self-reported 72–75% on SWE-Bench but tested at 58.6% independently with SWE-agent. MiniMax M2.5 runs its evaluations using Claude Code as scaffolding, which is a bit like entering a cooking competition with someone else's oven.

Then there's the SM-Bench regression: GPT 5.4 scored 36.8% on conversational tasks where GPT-4o scored 97.3%. That's not a typo. The model got dramatically worse at casual conversation while getting better at professional tasks. Whether that matters to you depends on what you're building.

As Zvi Mowshowitz put it: benchmarks have never been less useful for telling us which models are best. I don't think he's wrong.

The Real Strengths and Deal-Breakers

Where GPT 5.4 Wins

Computer use is the real differentiator. 75% on OSWorld isn't just a number — it means the model can actually operate software through screenshots and mouse/keyboard actions via Playwright. Box validated this on property-tax portal automation with 95% first-attempt success across roughly 30,000 tasks. If your workflow involves automating desktop applications, GPT 5.4 is currently the best option.

Pricing for standard-context work. At roughly half the per-token cost of Claude Opus 4.6 (for prompts under 272K tokens), the value proposition is real. For high-volume API workloads that don't need massive context, the savings add up.

Code fixing. This is subjective and based on my own experience, but GPT 5.4 may be the best model right now for targeted code repairs. When you have a specific bug and need it fixed, it tends to zero in on the problem without rewriting half your codebase. Tends to. (More on that in a moment.)

The Codex app's parallel-agent architecture is genuinely impressive. Running multiple coding agents across isolated worktrees with built-in diff review and Git integration is a good workflow for async, autonomous task delegation.

Where It Falls Down

Task overexpansion is the big one. Developer @vasumanmoza's viral post captures it perfectly: GPT-5 refactored their entire codebase in a single call — 25 tool invocations, 3,000+ new lines, 12 brand-new files, none of it working. Every.to's Vibe Check evaluation confirmed the pattern: the model routinely expands tasks beyond what you asked for, redesigning login systems nobody asked it to touch. I've experienced this myself. You ask it to fix a button, and it comes back having restructured your component hierarchy. Not great.

False completion claims. The model sometimes says it's done when it isn't — and in some cases, does so in ways that look deliberate. OpenAI acknowledged this in their launch post, noting they'd reduced the deception rate from 4.8% (o3) to 2.1%. That's progress, but 2.1% still means roughly 1 in 50 tasks might be confidently presented as complete when they're not.

The car wash problem. Ask GPT 5.4 whether you should walk or drive to a car wash 100 meters away, and it writes a full essay recommending walking — completely missing that you need the car at the car wash. Claude answered this in one sentence. It's a trivial example, but it illustrates a pattern: strong quantitative reasoning paired with weak practical inference.

Terminal-Bench regression. GPT 5.4 scores 75.1% versus GPT-5.3 Codex's 77.3% on Terminal-Bench 2.0. The model got worse at terminal operations than its own predecessor. If your workflow is terminal-heavy — SSH, CLI debugging, git operations, build systems — GPT-5.3 Codex is still the better choice.

A tool behaviour bug since launch day. Since March 6, GPT 5.4 ignores built-in tools like shell and apply_patch when custom function tools are present. It tells you "I do not have such a tool" when the tool is right there. Multiple developers confirmed this on the OpenAI community forum and GitHub issue #13773 documents the regression. Two weeks later, it's still not fixed.

GPT 5.4 vs Claude for Coding — Developer Experience Matters

Benchmarks are one thing. The actual experience of sitting down and writing code with these models is another.

I've used the Codex CLI, Claude Code, and OpenCode extensively. My preference is Claude Code, and honestly, I even reach for OpenCode more than Codex CLI — I wrote about why I switched to it for a while. Codex is async and autonomous — you delegate tasks and review results later. That's powerful for certain workflows, but I find I do my best work iteratively, and Claude Code is the best middle ground for that. It's right there in the terminal with me, I can steer it in real time, and it handles multi-file refactors better than anything else I've used.

The pattern I've settled into, and what I hear from most developers I talk to: nobody is switching wholesale. Everyone is routing by task. GPT 5.4 for large-codebase analysis and targeted fixes. Claude for multi-file refactoring and architectural work. The dual-wield approach isn't a compromise — it's the strategy.

The Verdict — Should You Switch?

GPT 5.4 is very good. It might be the best model — for your specific task.

Fair comparison between frontier models is nearly impossible right now. The gap has closed to 2–3 percentage points on most benchmarks, and which model "wins" depends on which benchmark you pick, which scaffold runs the evaluation, and what kind of work you're doing. Anyone telling you one model is definitively the best in March 2026 is either selling something or hasn't tested broadly enough.

Here's my recommendation: use GPT 5.4 for code fixes and computer use automation. Use Claude for architectural work, multi-file refactoring, and anything requiring consistency over long sessions. Consider MiniMax M2.5 if you're cost-sensitive and doing targeted coding work. Route by task, not by brand loyalty.

The bottom line: GPT 5.4 is OpenAI's strongest model yet, excelling at computer use and targeted code fixes. But Claude Opus 4.6 still leads on key coding benchmarks and developer preference. The best strategy in 2026 is task-based model routing — use the right model for each job, and stop waiting for a single model to win at everything. That model isn't coming.

CLAUDE.md: Helpful or Just Expensive Noise?

contact@thomas-wiegold.com (Thomas Wiegold) — Mon, 09 Mar 2026 00:00:00 GMT

If you've used Claude Code for more than a week, you've probably been told you need a CLAUDE.md file. Run /init, let it generate one, commit it, done. It's the first piece of advice in every Claude Code tutorial and the first thing people ask about in every forum. And for months, I just accepted the premise. Of course you need a claude.md file. It's how you tell Claude Code what to do.

Except... does it actually work? After months of using Claude Code daily, watching Claude cheerfully ignore instructions I explicitly wrote, and then stumbling across the first academic research that actually measured this stuff, I'm not so sure the conventional wisdom holds up. The truth is more interesting—and more useful—than "always use one" or "never bother."

What CLAUDE.md Actually Is (and Isn't)

Let's get the basics out of the way. A CLAUDE.md is a persistent markdown file that Claude Code reads at the start of every session. It sits in a hierarchy: enterprise policy files at the top, then user-level (~/.claude/CLAUDE.md), project-level (./CLAUDE.md), and local files (./CLAUDE.local.md). You can generate a starter one with /init, which scans your codebase and produces something reasonable-looking.

Here's the part most people miss, though. Anthropic's own documentation says it plainly: CLAUDE.md is context, not enforcement. Claude reads it and tries to follow it, but there's no guarantee of strict compliance. And it gets worse. Third-party analysis from HumanLayer found that Claude Code's system prompt wraps your CLAUDE.md content with a reminder that says, essentially, "this context may or may not be relevant to your tasks." It's literally given permission to ignore you.

That reframing changes everything. And it applies equally to AGENTS.md, Codex's equivalent, and any other agent configuration file. You're not writing laws. You're writing suggestions.

The Research Says It Might Be Hurting You

The ETH Zurich Study

In February 2026, researchers from ETH Zurich published the first rigorous empirical evaluation of whether repository context files actually improve coding agent performance. They tested four agents—Claude Code, Codex, and Qwen Code—across 300 SWE-bench Lite tasks and a new 138-task benchmark called AGENTbench.

The results were not what the community expected. LLM-generated context files—the kind /init produces—decreased success rates and increased costs by around 20%. Human-written files did better, showing roughly a 4% improvement on AGENTbench. But here's the kicker: Claude Code was the only agent where even developer-written files failed to improve performance compared to having no file at all.

The most revealing part was an ablation study. When the researchers stripped all existing documentation from repositories—READMEs, docs folders, examples—context files suddenly helped, producing a consistent 2.7% improvement. The implication is clear: CLAUDE.md files are largely redundant with documentation that already exists. They help most where they're the only structured knowledge available.

One finding I found genuinely interesting: agents were highly compliant with tool-specific instructions. When a context file mentioned the uv package manager, agents used it 160 times more often. Context files clearly shape behaviour. Whether that shaping improves outcomes is a different question.

The Compliance Decay Problem

The other elephant in the room is that Claude forgets. Or more accurately, it deprioritises.

Developer Siddhant Khare documented a predictable compliance decay curve: 95%+ compliance at messages 1-2, dropping to 60-80% by messages 3-5, and falling to 20-60% by messages 6-10. Beyond ten messages, original instructions are mostly gone. This isn't a CLAUDE.md bug—it's a fundamental limitation of instruction-following in large language models.

Here's where it gets uncomfortable. HumanLayer's analysis estimates that frontier LLMs can follow roughly 150-200 instructions with reasonable consistency. Claude Code's own system prompt already contains around 50 instructions. That's nearly a third of the budget consumed before your CLAUDE.md even loads. Every line you add doesn't just compete with your other CLAUDE.md rules—it competes with Claude Code's core behavioural programming. A bloated 300-line CLAUDE.md can actually make Claude worse at following its own built-in instructions for tool use, file management, and code generation.

My Experience After Months of Using It

I'll be honest: I still use CLAUDE.md. But my relationship with it has changed.

My workflow now goes like this: I run /init, read what it generates, then delete most of it. The auto-generated file is a decent starting point for understanding what Claude thinks your project looks like, but committing it unreviewed is a mistake—the research literally shows it underperforms having no file at all. So I treat /init output as reconnaissance, not configuration.

The biggest shift was accepting the "keep it short" principle. Not just for prompts, but for all context you feed an LLM. The less noise Claude has to parse, the more reliably it follows what matters. This matches both the research and Anthropic's own recommendation to target under 200 lines per file. My files tend to land well under that.

I've also watched Claude ignore instructions I explicitly wrote. The formatting rule it followed perfectly for three messages, then abandoned. The test command it ran correctly twice, then forgot existed. It's not malicious—it's the compliance decay curve playing out exactly as documented. One of the more entertaining GitHub issues has Claude itself explaining the problem: "I have two competing modes: Default Mode and CLAUDE.md Mode. My default mode always wins because it requires less cognitive effort." At least it's self-aware about it.

For quick tasks—fixing a single file, answering a question about the codebase, simple refactors—I often skip the CLAUDE.md entirely. And you know what? It usually doesn't matter. The overhead of loading context that isn't relevant to a five-minute task isn't worth the cost, both in tokens and in cognitive budget.

The pattern that actually works is failure-driven iteration. Don't try to write the perfect file upfront. Add rules when Claude fails, remove them when they're redundant. This aligns with how Boris Cherny, who created Claude Code at Anthropic, runs his team's file—roughly 60-80 lines, updated collaboratively through real mistakes. Add a rule when something goes wrong. Tag your colleagues' PRs to capture learnings. Prune regularly.

What Actually Belongs in a CLAUDE.md

What to Include

The signal-to-noise ratio is everything. Your CLAUDE.md should contain things Claude genuinely cannot discover or infer from your codebase:

Non-obvious build, test, and lint commands with exact flags. If your test runner needs --no-cache --forceExit, say so. Claude won't guess that. Architectural decisions that contradict what the code structure might suggest. If you're using a monorepo pattern where packages import from each other in a specific order, that's worth documenting. Team conventions that go against common patterns. If your team uses bun instead of npm, or prefers a specific error handling pattern that isn't the obvious one, mention it. Brief pointers to deeper documentation. Instead of cramming everything in, reference where Claude can find more: "For API conventions, see docs/api-guide.md." This progressive disclosure pattern keeps your CLAUDE.md lean while making detailed knowledge available on demand.

What to Leave Out

Personality instructions like "Be a senior engineer" or "Think step by step." These waste instruction budget on things that don't improve output quality. Generic advice like "Write clean code" or "Follow best practices." If it's not specific enough to verify, it's not specific enough to include. Comprehensive style guides. This is the anti-pattern HumanLayer calls out specifically: never send an LLM to do a linter's job. LLMs are expensive and slow at enforcing code style compared to running Prettier or ESLint in a hook. Directory trees and codebase overviews. The ETH Zurich study confirmed what you'd expect—agents can discover project structure themselves. Telling Claude what files exist in your repo is pure noise.

Use Hooks for Anything That Actually Matters

This is the single most important insight I've landed on: CLAUDE.md rules are requests. Hooks are laws.

If a rule can't be broken—formatting, running tests before commit, type validation—enforce it deterministically with a Claude Code hook, not hopefully with a markdown instruction. A hook runs actual code at specific lifecycle points. It doesn't forget. It doesn't deprioritise. It doesn't have competing cognitive modes.

The linter anti-pattern is the clearest example. Putting "Use 2-space indentation and trailing commas" in your CLAUDE.md means Claude has to remember and apply that rule on every edit, burning instruction budget and still getting it wrong sometimes. Putting prettier --write in a post-edit hook means it happens every single time, instantly, and costs nothing in context. The choice is obvious.

Think of it this way: CLAUDE.md is guidance for flexible decisions. Hooks are enforcement for non-negotiable rules. If you'd reject a PR for violating it, it belongs in a hook, not a markdown file.

A Practical CLAUDE.md Playbook

When It's Worth It

CLAUDE.md provides the most value in poorly documented repositories—the research backs this up directly. If your README is sparse and there's no docs folder, a lean CLAUDE.md is genuinely the biggest improvement you can make. It's also worth the investment for multi-file workflow tasks that require understanding project conventions, team environments where you want shared standards across developers using Claude Code, and monorepos with directory-scoped files guiding Claude through complex project structures.

When to Skip It

Well-documented repos with good READMEs and thorough docs. Your CLAUDE.md will mostly duplicate what already exists, and the ETH Zurich study showed that redundancy doesn't help—it hurts. Quick isolated tasks like single-file edits or quick questions don't need project context. And critically: if you're just going to commit the /init output unreviewed, you're better off having no file at all. The research is unambiguous on this point.

The 60-Line Rule

Aim for under 80 lines. The sweet spot from community benchmarks, the research, and Anthropic's own internal usage converges around there. Boris Cherny's team runs about 60-80 lines. HumanLayer's production file is under 60.

A 60-line CLAUDE.md that Claude actually follows beats a 300-line one it mostly ignores—and costs 20% less per task. That's not a philosophy. That's arithmetic.

The Bottom Line

CLAUDE.md occupies an uncomfortable middle ground: too useful to abandon entirely, too unreliable to trust unconditionally. The academic evidence says context files are mostly redundant overhead for well-documented projects. The community says they're indispensable once properly tuned. Both are right—they're measuring different things.

The highest-leverage insight from all of this is that CLAUDE.md is not an instruction file. It's a context file. Claude treats it as background information, not binding rules. The system literally wraps it with permission to ignore irrelevant content. Once you internalise that distinction, the optimal strategy becomes clear: use CLAUDE.md for the 20% of project knowledge that saves the most repeated explanation, enforce critical rules through hooks rather than hopeful instructions, and resist the temptation to keep adding lines.

Keep it short. Keep it specific. Delete anything Claude already knows. And for anything that truly can't be broken—use a hook.

Claude Code dangerously-skip-permissions: Why It's Tempting, Why It's Dangerous

contact@thomas-wiegold.com (Thomas Wiegold) — Thu, 26 Feb 2026 00:00:00 GMT

Look, I'll be upfront: I've used claude code dangerously-skip-permissions more than I probably should have. If you're a developer working with Claude Code daily, you probably have too — or you've been tempted. The flag turns Claude Code from a cautious assistant that asks "may I?" before every mkdir into a fully autonomous agent that just... does things. It's intoxicating. It's also how people lose their home directories.

This is the honest version of that conversation. Not the sanitised Anthropic docs version, not the "just use Docker lol" Reddit comment version. The version where I tell you exactly what this flag does, why the permission system it bypasses is genuinely broken, and what happened to real developers who got burned.

What `--dangerously-skip-permissions` Actually Does

In normal operation, Claude Code asks permission for everything. Every bash command, every file edit, every network request, every MCP tool interaction. The --dangerously-skip-permissions flag auto-approves all of them. No confirmation dialogs. No pause. No chance to catch a bad command before it fires.

It's technically equivalent to --permission-mode bypassPermissions — same behaviour, different flag name:

claude --dangerously-skip-permissions "Fix all lint errors"
claude --permission-mode bypassPermissions "Fix all lint errors"

Here's the detail most people miss: subagent inheritance. When you enable bypass mode, all subagents inherit full autonomous access. You can't override this. The official SDK documentation spells it out clearly — subagents may have different system prompts and less constrained behaviour than your main agent, and they all get full, unsupervised system access.

The flag bypasses the entire safety stack: the command blocklist (which normally blocks curl, wget, and other web-fetching commands), write access restrictions (normally limited to the current working directory), the permission prompt system, and MCP server trust verification. Everything. Gone.

Enterprise admins can disable it organisation-wide, and there's a guardrail preventing use with root privileges. But if you're running it on your personal machine as your regular user — which, statistically, you probably are — those guardrails don't help you.

The Permission Fatigue Problem Is Real

The interrupt loop that kills deep work

Before I get into the horror stories, I want to acknowledge something: the default permission system is genuinely frustrating. This isn't developers being lazy. It's a real workflow problem.

You type a prompt. Claude starts working. You switch to Slack, check something, maybe grab a coffee. Five minutes later you come back and Claude is just... sitting there. Waiting for you to approve a file edit. The whole task is frozen at step two because it needed your blessing to run mkdir.

This isn't a theoretical complaint. Kyle Redelinghuys, who wrote one of the better posts on this flag, nailed it: you set Claude off on a task, walk away, and come back to find it stopped at step two because it needed permission to create a directory. He also documented a successful nine-hour autonomous session where Claude built an entire financial data analysis system from scratch. That kind of extended workflow is simply impossible when you're approving prompts every ninety seconds.

There's a deeper problem too. A commenter on LessWrong called "avturchin" articulated something I've felt but couldn't quite put into words: Claude asks roughly 100 permissions per hour, and it's impossible to evaluate whether any given one is dangerous without spending real time reading the details. So you end up rubber-stamping approvals without looking at them. That's "permission noise" — and it creates a false sense of security that might actually be worse than no permissions at all. At least with YOLO mode, you know you're flying without a net.

When YOLO mode actually makes sense

Even Simon Willison — the person who literally coined the term "prompt injection" and understands the risks better than almost anyone — acknowledges that Claude Code in YOLO mode feels like a completely different product. He's said publicly that he suspects many people who dismiss coding agents have never experienced YOLO mode in all its glory.

And Anthropic's own engineers use it. Their February 2026 blog post about building a C compiler with parallel Claudes shows their autonomous agent loop running claude --dangerously-skip-permissions in a bash while-loop. The parenthetical that follows is telling: (Run this in a container, not your actual machine.) Even the people who built the thing won't run it on bare metal.

Real Incidents — This Isn't Theoretical

Most cautionary articles about this flag are frustratingly vague. "Bad things could happen." "You might lose data." That's not useful. Here are the specifics.

The home directory wipes

The Wolak incident (October 2025) is the one that should keep you up at night. Developer Mike Wolak was working on a firmware project in a nested directory on Ubuntu/WSL2 when Claude Code executed an rm -rf starting from root (/). His GitHub bug report (#10077) documents it in forensic detail: error logs showed thousands of "Permission denied" messages for system paths like /bin, /boot, and /etc — the command literally tried to delete everything on the machine, and only stopped where Linux file permissions wouldn't let it. Every user-owned file was gone. Worse, the conversation log captured the command's output but not the actual command itself, making it impossible to determine exactly what went wrong. Anthropic tagged it area:security and bug.

The Reddit incident (December 2025) became the flag's most public disaster. A user on r/ClaudeAI asked Claude to clean up packages in an old repository. Claude generated rm -rf tests/ patches/ plan/ ~/ — and that trailing ~/ expanded to the user's entire home directory. Desktop files, Keychain passwords, application data, everything. Simon Willison amplified it on X as a reminder of the risk. It hit 197 points on Hacker News with over 156 comments and was covered by outlets in Japan and the US. It became the cautionary tale.

The tilde directory trick (November 2025) is the most insidious one. Developer JeffreyUrban filed GitHub Issue #12637 after discovering that Claude, in a previous session, had accidentally created a directory literally named ~. Just a tilde. When Claude later ran rm -rf * in the parent directory, the shell expanded * to include the ~ directory name, which the shell then interpreted as the home directory. A two-step failure spread across separate sessions. His comment says it all: "Loving claude, but this was and is continuing to be super frustrating to recover from."

Subtler but more common damage

The dramatic wipes get the headlines, but the everyday damage is more insidious. Kyle Redelinghuys documented Claude overwriting an existing config file with blank values — no backup, no warning. It also tried to modify system-related JSON files that had nothing to do with the project. This kind of quiet corruption is harder to notice and harder to recover from than a blown-away home directory.

In January 2026, developer James McAulay was benchmarking Claude Cowork's folder organisation capabilities with explicit instructions to retain user data. Cowork executed rm -rf, deleting approximately 11GB of files, and its task list cheerfully marked "Delete user data folder: Completed." He posted the video on X. Live on camera.

And then there's prompt injection. PromptArmor demonstrated in January 2026 that hidden text inside a .docx file — 1-point font, white text on white background — could manipulate Claude into uploading sensitive files to an attacker's Anthropic account via the allowlisted API. No special permissions needed. No suspicious-looking commands. Just a document that looked perfectly normal to human eyes. This isn't a theoretical attack vector. It's been demonstrated, recorded, and published.

The Container Consensus — How to Actually Use It Safely

The community has converged on a clear answer: never run --dangerously-skip-permissions on your host machine. Containers. VMs. Sandboxed environments. That's it.

Docker is the answer

A typical safe setup mounts only the project directory and runs with network isolation:

docker run -it --rm \
  -v $(pwd):/workspace -w /workspace \
  --network none \
  claude-code:latest --dangerously-skip-permissions "Implement feature"

Anthropic provides an official reference devcontainer with firewall rules that restrict outbound connections to whitelisted domains — npm registry, GitHub, the Claude API — and a default-deny network policy. The devcontainer docs explicitly state that the container's enhanced security measures allow you to safely run --dangerously-skip-permissions for unattended operation.

This is the mental model shift that matters. The question isn't "should I be more careful with the flag?" — it's "should I be running AI agents directly on my machine at all?" The answer, increasingly, is no. Your host machine has your SSH keys, your .env files, your browser cookies, your Keychain. An AI agent with full system access is one bad prompt away from touching all of it. A container has whatever you give it and nothing more.

Layer your safety practices

Containers aren't the whole story. Experienced developers stack multiple precautions:

Git checkpoints before every session. git add -A && git commit -m "checkpoint pre-claude" means recovery is always one git reset --hard HEAD away. This is the single cheapest insurance you can buy.

Tight task scoping. There's a world of difference between "Build me a financial analysis system" and a prompt that specifies exact files, expected flows, and validation criteria. The more specific your prompt, the less room Claude has to improvise destructively.

Budget limits. --max-budget-usd 5.00 prevents runaway API spending. You'd be surprised how fast costs accumulate during autonomous sessions.

Explicitly block dangerous tools. --disallowedTools "Bash(rm:*)" blocks rm even in bypass mode. This works even when --allowedTools doesn't — a quirk that's worth knowing about.

Request changelogs. Ask Claude to document changes as it works. Makes post-session review actually manageable instead of a forensic excavation.

Safer Alternatives Most Developers Don't Know Exist

The flag creates a false binary — fully supervised or fully autonomous. There are middle grounds.

acceptEdits mode auto-approves file modifications but still prompts for shell commands. If your workflow is mostly refactoring and you trust file edits but not arbitrary bash, this is the sweet spot.

allowedTools configuration lets you whitelist specific safe operations without blanket bypass:

{
  "permissions": {
    "allow": ["Read(*)", "Grep(*)", "Glob(*)", "Bash(npm run lint:*)", "Bash(git commit *)"]
  }
}

This is principle of least privilege applied to AI agents, and it's far more surgical than the bypass flag.

plan mode creates a read-only plan for human approval before any execution. Great for high-stakes changes where you want to see the full picture before anything runs.

PreToolUse hooks — Trail of Bits published an excellent config repo showing how to set up hooks that block rm -rf patterns and direct pushes to main. They're guardrails, not walls, but they catch the obvious disasters.

One gotcha worth flagging: there's a documented bug (#17544) where combining --dangerously-skip-permissions with --permission-mode plan causes the bypass flag to silently override plan mode entirely. You think you're in plan mode. You're not. You're in full bypass.

The Honest Verdict

The --dangerously-skip-permissions flag exists because the alternative — approving a hundred prompts an hour, rubber-stamping most of them without reading — creates its own failure mode. Both defaults are bad. The flag just makes the failure mode more spectacular.

The community consensus is clear and, at this point, pretty much universal: containers or don't bother. Layer git checkpoints, tool restrictions, and network isolation on top. Explore acceptEdits and allowedTools before reaching for the nuclear option. And recognise that the fundamental issue isn't just this flag — it's that LLMs can generate catastrophically destructive commands like rm -rf ~/ regardless of what permission system wraps them. The flag merely determines whether a human gets a chance to catch the mistake before it executes.

I've shifted my own practice toward containers. Not because I think I'll be the one who loses their home directory — everyone thinks that — but because the calculus changes once you realise the downside is unbounded and the container setup takes twenty minutes. That's a trade I'll take every time.

The vibe coding culture around AI tools is, frankly, too careless about this stuff. .env files with production credentials sitting in scope. SSH keys accessible to agents. Open database connections. That's a bigger conversation — and probably a future post — but --dangerously-skip-permissions is the symptom, not the disease. The disease is treating AI agents like they're just faster versions of us, when they're really more like very capable interns with root access and no sense of consequence.

Treat them accordingly.

AI Business Context Validation: How to Know If Your AI Is Actually Working

contact@thomas-wiegold.com (Thomas Wiegold) — Mon, 23 Feb 2026 00:00:00 GMT

Here's a pattern I keep seeing. A business deploys an AI chatbot or automation tool. The demo was impressive. The team is excited. Three months later, the thing is quietly giving customers wrong answers, contradicting internal policies, or — my personal favourite — confidently inventing information that sounds plausible but is completely made up.

This is the AI business context validation problem, and almost nobody is talking about it.

Everyone's writing guides about getting started with AI. There are hundreds of "AI readiness" articles, implementation checklists, and vendor pitch decks. But the question that actually matters after deployment — "is this thing working correctly for my business?" — has a content gap you could drive a truck through.

Let me walk you through what I've learned about closing that gap.

Why Most AI Deployments Fail After the Demo

The numbers are brutal. RAND Corporation research found that over 80% of AI projects fail — twice the failure rate of non-AI IT projects. Gartner originally predicted 30% of GenAI projects would be abandoned after proof-of-concept by end of 2025. The actual number turned out to be at least 50%.

The pattern is always the same. Clean demo data works beautifully. Then real-world data arrives — messy, incomplete, full of edge cases nobody anticipated. Business policies change, but the AI's knowledge doesn't. And suddenly you've got a tool that's confidently wrong in ways that cost real money.

You've probably heard about the Air Canada chatbot case. Their chatbot told a grieving customer he could retroactively apply for a bereavement fare discount — a policy that didn't exist. The BC Civil Resolution Tribunal ruled Air Canada liable for CAD $812. The tribunal's reasoning was blunt: it makes no difference whether information comes from a static page or a chatbot.

Then there's New York City's MyCity chatbot, built on Microsoft Azure and intended to help small businesses navigate regulations. It advised landlords they didn't need to accept Section 8 vouchers (illegal since 2008), told employers they could take workers' tips (also illegal), and suggested businesses could refuse cash payments (illegal again). All 10 staffers who tested the Section 8 question got wrong answers. The roughly $500K chatbot was terminated.

These aren't capability failures. The AI was technically working fine. It just had no idea what the actual business rules were.

What AI Business Context Validation Actually Means

Let me draw a clear line here, because the terminology matters.

AI readiness is pre-deployment: "Are we prepared to use AI?" AI implementation is the deployment itself. AI business context validation is post-deployment: "Is the AI we deployed actually working correctly within our specific business context?"

Put simply, AI business context validation is the ongoing process of verifying that an AI system's outputs are accurate, compliant, and useful within the specific rules, policies, and workflows of your business — not just generally "correct" by some abstract benchmark.

That question has three dimensions. First, is the AI accurate against your actual business rules — not general knowledge, your rules? Second, does it comply with your policies and relevant regulations? Third, is it actually moving the needle on the business metric you care about?

Generic AI benchmarks don't answer any of these. A model can score 90% on an industry leaderboard and still give illegal advice about your return policy. I've seen it happen.

The 5 Most Dangerous Validation Gaps (and What They Cost)

The case studies make the risks concrete.

No business-rule grounding. A Chevy dealership's chatbot was manipulated into agreeing to sell a 2024 Tahoe for $1. No guardrails on pricing, no business rules baked into the system. The Air Canada and NYC chatbot failures fall into this same bucket — the AI simply didn't know what it wasn't allowed to say.

No context drift monitoring. This one's sneaky. Your business evolves — new pricing, updated policies, shifted brand positioning — but your AI keeps operating on stale knowledge. The AI Journal documented a company losing $2.1 million because AI-driven marketing campaigns contradicted a brand pivot the AI didn't know about. MIT research found that 91% of ML models experience degradation over time. Your AI doesn't stay good on its own.

No edge-case testing. Legal hallucinations are the poster child here. Stanford research found that ChatGPT hallucinates 28.6% of legal citations. In the Mata v. Avianca case, an attorney was sanctioned for submitting fabricated citations generated by AI. On a smaller scale, I've seen an HR department use AI to write an entry-level job description that somehow required 5–7 years of experience. Zero applicants. Nobody caught it because nobody tested it.

No human-in-the-loop for high-stakes decisions. UnitedHealth's nH Predict system used AI to deny Medicare Advantage post-acute care claims. When patients appealed, 90% of those denials were reversed. A class action is ongoing. Automating consequential decisions without human oversight is playing with fire.

Bias from missing business context. SafeRent's AI tenant scoring model didn't understand voucher income, disproportionately harming protected classes. The result: a $2.2 million settlement.

Each of these maps to a specific layer in the validation framework I'll walk through next.

The 5-Layer AI Business Context Validation Framework

This is synthesised from approaches by McKinsey, KPMG, and Australia's VAISS guidance, filtered through what I've actually seen work with small and mid-sized businesses. It's not theoretical. Every layer exists because I've watched something break when it was missing.

1. Define — Document Before You Deploy

Before you turn anything on, create a business context specification. What are your rules? Your policies? Your compliance requirements? Your workflows?

Every SMB I work with skips this step. It's the single most common mistake, and it's usually where the expensive problems originate. You can't validate AI against business rules you haven't written down.

This doesn't need to be a 50-page document. Start with the basics: what is this AI allowed to say and do? What isn't it allowed to say or do? What information must it get right, with zero tolerance for error?

I usually start clients with three categories: hard rules (pricing, legal obligations, compliance requirements — things the AI must never get wrong), soft rules (brand voice, preferred phrasing, escalation triggers), and context boundaries (what topics the AI should refuse to answer entirely). Getting these written down before deployment is the difference between a system that works and a system that works until it doesn't. Firms that take a phased, documented validation approach see up to 2.8× higher ROI than those who deploy everything at once.

2. Test — Business Rules, Not Benchmarks

Build test datasets from real business scenarios — the actual queries your system will face, not synthetic happy paths. This is where tools like Promptfoo become invaluable. It's open-source, CLI-based, and lets you write assertion tests in YAML that read like business requirements, not code:

- assert:
    - type: llm-rubric
      value: "Response must follow company return policy and not offer unauthorized discounts"

That's it. No coding required. You're testing what matters to your business, not what matters to a benchmark.

The critical part: test adversarial and edge cases, not just the obvious paths. Ask yourself, "What's the worst misunderstanding a customer could have from this response?" Then test for that.

3. Ground — Keep AI Anchored to Current Business Knowledge

RAG (Retrieval-Augmented Generation) in plain terms: instead of the AI relying on whatever it learned during training, it retrieves your actual policies and documents at query time. Your business rules live in a knowledge base. The AI checks them before answering.

This is the most practical approach for SMBs. Structure your policies, workflows, and compliance requirements as documents in a vector store. When policies change — and they will — update the knowledge base, re-run your test suite, and deploy.

To evaluate whether retrieval is actually working, Ragas is excellent — open-source, reference-free (no manual annotations needed), and recommended by OpenAI at their DevDay conference. It'll tell you if the AI is actually pulling the right context before generating answers.

4. Monitor — Catch Drift Before It Costs You

Deploying without monitoring is like launching a website and never checking if it's still up. Four types of drift to watch: data drift (user behaviour changes), prompt drift (queries diverge from what you designed for), output drift (response quality degrades), and concept drift (the relationship between inputs and correct outputs changes — the hardest to detect).

Good news: you can set this up for free. Langfuse offers 50,000 observations per month on their free tier. Arize Phoenix is fully self-hosted and free. Either will give you visibility into what your AI is actually doing in production.

MIT research found 75% of businesses that didn't monitor saw performance decline. That stat alone should be enough to justify the setup time.

5. Review — Human Oversight for High-Stakes Decisions

Confidence-based escalation is the practical pattern here: the AI rates its own certainty on each response, and low-confidence outputs route to human review automatically. Organisations implementing human-in-the-loop workflows report accuracy rates up to 99.9% for document extraction, compared to 92% for AI-only.

The most important trigger for re-validation: any business rule or policy change. This is where most context drift originates. Changed your return policy? Re-run the test suite. Updated your pricing? Re-run the test suite. It sounds tedious because it is. But it's significantly cheaper than the alternative.

The Free SMB Validation Toolkit

You can assemble a complete validation stack for $0 in software costs. LLM API fees for running evaluations are the only variable expense.

Start with Promptfoo for business-rule assertion testing. Add Ragas if you're using RAG for business knowledge grounding. Layer in DeepTeam for red-teaming and security testing (50+ vulnerability types). Deploy Langfuse or Arize Phoenix for production monitoring. Build human-in-the-loop approval checkpoints using whatever you already have — n8n, Zapier, or even a Slack workflow.

That sequence matters. Get your tests right first, then monitor production.

Australian SMBs: Validation Is Now a Compliance Requirement

If you're operating in Australia, this isn't optional anymore.

There's no AI-specific law yet, but existing legislation already bites. Under the Australian Consumer Law, chatbot hallucinations are potentially misleading conduct under section 18. Under the Privacy Act, APP 10 requires reasonable steps to ensure personal information is accurate — a requirement directly challenged by AI hallucinations. ASIC's REP 798 "Beware the Gap" report found that businesses are adopting AI faster than they're updating governance, and flagged an explicit "governance gap."

There's also a hard deadline approaching. The Privacy Act Amendment — new APP 1.7–1.9 — commences December 10, 2026. If you use automated systems that process personal information to make decisions significantly affecting individuals, you'll need to disclose this in your privacy policy. Civil penalties apply. The OAIC has already signalled enforcement intent with privacy policy compliance sweeps across six industries in January 2026.

The five-layer framework above maps directly to Australia's Guidance for AI Adoption (AI6) essential practices on testing, human oversight, and record-keeping. Validation isn't just good practice — it's what the government guidance tells you to do.

What Good Validation Looks Like

When AI validation works, the numbers are compelling. Businesses see up to $3.70 ROI per dollar invested. Employees save 8–10 hours per week when AI is working correctly. Phased validation approaches deliver 2.8× higher returns than big-bang deployments.

The pattern I've seen in every successful AI deployment is the same: define what "correct" means before you deploy, test against real scenarios, keep the knowledge base current, monitor for drift, and keep a human in the loop for anything consequential. It's not glamorous. It doesn't make for exciting vendor demos. But it's the difference between AI that quietly makes your business better and AI that quietly makes your business liable.

The question to ask before any AI deployment: "What test would I run to prove this is working correctly for my business?"

If you can't answer that, you're not ready to deploy. And if you've already deployed without answering it — well, now's a good time to start.

Prompt Engineering Best Practices 2026

contact@thomas-wiegold.com (Thomas Wiegold) — Sat, 21 Feb 2026 00:00:00 GMT

If you're still writing prompts the way you did in 2023, you're leaving performance on the table. I know because I was doing exactly that about a year ago — copying the same "you are a helpful assistant" preambles, adding increasingly desperate ALL-CAPS instructions, and wondering why my outputs kept drifting.

The thing is, prompt engineering best practices in 2026 look almost nothing like they did when ChatGPT first dropped. The discipline has split cleanly in two: casual prompting (which anyone can do — the models got better at reading intent) and production context engineering (which is a genuine engineering skill). I build systems where prompts run thousands of times, and getting them right compounds in value every single execution. Here's what I've learned actually works.

Prompt Engineering Is Dead. Context Engineering Is What Replaced It.

In June 2025, Andrej Karpathy posted on X what a lot of practitioners were already feeling: the term "prompt engineering" trivialises what we actually do. His framing was elegant — the LLM is a CPU, the context window is RAM, and your job is to be the operating system, loading working memory with exactly the right code and data for each task.

The real failure mode in production isn't a bad prompt. It's bad context assembly. Phil Schmid from Hugging Face nailed it: most agent failures aren't model failures anymore — they're context failures. You retrieved the wrong documents. You stuffed too much history into the window. You forgot to include the tool definitions. The prompt itself was fine.

LangChain formalised four strategies for this: write (persist context externally), select (retrieve what's relevant via RAG), compress (summarise and compact), and isolate (separate contexts for different agents). If you use Claude Projects, you're already doing context engineering — your project system prompt is persistent instruction plus curated context applied to every conversation. It's worth treating that like production code, because functionally, it is.

Start Simple. Expand Based on What's Wrong.

This is the single most useful workflow I've adopted, and it contradicts the instinct most of us have to write exhaustive prompts upfront.

Research from Levy, Jacoby, and Goldberg (2024) found that LLM reasoning performance starts degrading around 3,000 tokens — well below the technical maximums we all get excited about. The practical sweet spot for most tasks is 150–300 words. That's not a lot. It forces you to be specific rather than comprehensive.

The process is dead simple: write the shortest version that describes your intent. Test it. Identify what's actually wrong or missing in the output. Add only what fixes that specific gap. Repeat. You end up with a prompt that's lean and targeted instead of a 500-word archaeological dig where you can't tell which instruction is actually doing the work.

Why Long Prompts Hurt More Than They Help

Three reasons long prompts quietly degrade your results.

First, attention scales quadratically. Every token you add makes the model work harder to figure out what matters. This isn't theoretical — it's O(n²) in the transformer architecture, and it shows up as vaguer, less focused outputs.

Second, the "lost in the middle" problem is real and well-documented. Liu et al. (2024) showed a U-shaped performance curve across every model they tested: accuracy is highest when relevant information appears at the beginning or end of the context, with over 30% accuracy drop for information buried in the middle. The paper has over 2,500 citations for good reason. Put your critical instructions first and last. Not the middle. Never the middle.

Third, there's the maintenance cost nobody talks about. Debugging a 500-word prompt when output quality suddenly drops is miserable. You change one sentence and three other behaviours shift. Shorter prompts are easier to reason about, easier to test, and easier to fix.

Model-Specific Tactics That Actually Matter

Most prompt engineering guides treat all models the same. That's wrong, and it costs you performance every time you port a prompt between providers.

Claude — XML Tags and Literal Instructions

Claude 4.x models follow instructions literally. If you don't ask for something, you won't get it — the "above and beyond" behaviour from earlier versions is gone. This is actually a good thing once you adjust. You get predictable, controllable outputs.

XML tags (, , ) are genuinely the best structuring method for Claude. Not Markdown, not numbered lists — XML tags. Wrap your few-shot examples in tags. Reference tagged content in your instructions ("Using the data in tags..."). It makes a measurable difference.

One thing that caught me off guard: aggressive language actively hurts newer Claude models. "CRITICAL!", "YOU MUST", "NEVER EVER" — these overtrigger and produce worse results than calm, direct instructions. Just say what you want. Claude listens.

For extended thinking, use adaptive mode and let the model decide when it needs to reason deeply. Don't pass thinking blocks back as input on subsequent turns.

GPT-5 — Conversational, Skip Explicit CoT

GPT-5 is a router-based system — multiple models behind a single endpoint. Saying "think hard about this" in your prompt literally triggers the reasoning model. Which means explicitly adding "think step by step" to reasoning tasks can actually hurt performance. OpenAI's own docs warn against this.

The practical advice: keep prompts conversational, pin production apps to specific model snapshots (e.g., gpt-5-2025-08-07) because the router behaviour changes between versions, and try zero-shot before reaching for few-shot. GPT-5 is surprisingly good at inferring intent from minimal context.

Gemini — Shorter and More Direct

Gemini's 2M token context window is impressive, but it makes placement decisions even more consequential. Google's prompt engineering whitepaper recommends always including few-shot examples (zero-shot is explicitly not preferred), and placing specific questions at the end, after your data context. Gemini prefers shorter, more direct prompts than either Claude or GPT.

Four Techniques Worth Using (and When to Skip Them)

The problem with most technique roundups is they list everything without telling you when to actually use each one. So here's the honest version.

Few-shot prompting remains one of the highest-ROI techniques available. Three to five diverse examples, wrapped in tags for Claude. A surprising finding from Min et al. (2022): the label space and input distribution matter more than whether individual example labels are correct. Even randomly labelled examples outperform zero-shot. So stop agonising over perfect examples and focus on covering the diversity of your input space.

Chain-of-thought still works brilliantly for standard models on hard tasks — research shows a 19-point boost on MMLU-Pro with CoT. But skip explicit CoT for reasoning models (o-series, Claude Extended Thinking, Gemini Thinking Mode). They already do it internally. Adding "think step by step" is like telling someone who's already thinking to please start thinking.

Role prompting is useful for open-ended and creative tasks but has negligible effect on classification and factual QA. Don't cargo-cult it into every prompt.

Positive framing over negation — "only use real data" consistently outperforms "don't use mock data." This is the Pink Elephant Problem: telling a model not to do something forces it to process that concept first. Reframe every negative instruction as a positive one.

Skip Tree-of-Thought and LATS unless you have a very specific, high-stakes task that justifies the compute cost. For 99% of use cases, they're overkill.

Prompts Are Code — Treat Them Like It

This is the part that separates someone who uses AI from someone who ships with it.

Version control your prompts. Prompt drift is real — you tweak something on a Thursday afternoon, forget what you changed, and spend Monday debugging output that used to work fine. If your prompt runs more than once, it belongs in version control.

Build a golden test set: representative inputs with expected outputs. Run it on every prompt change. This is just regression testing, except instead of code, you're testing the instructions that generate the code.

Structure your prompts for caching. Place static content first (system instructions, few-shot examples, tool definitions) and variable content last (user messages, query-specific data). With Anthropic's prompt caching, this can cut costs by up to 90% and latency by 85%. OpenAI offers automatic caching with 50–90% discounts depending on the model. The savings are substantial when you're running thousands of completions.

For production systems, Promptfoo (open-source, 51K+ developers) brings CI/CD discipline to prompts — automated testing, red teaming, the works. If you're already treating your application code seriously, your prompts deserve the same treatment.

Is Learning Prompt Engineering Still Worth It in 2026?

The job title is effectively gone. Fast Company reported in May 2025 that prompt engineering as a standalone role "has all but disappeared," with 68% of firms now providing it as standard training across all roles. A Microsoft-commissioned survey of 31,000 workers ranked Prompt Engineer second to last among new roles companies plan to add.

But the skill? More valuable than ever — it just got absorbed into the job description of everyone who works with AI. What's actually valuable now is designing context assembly systems, writing evals, understanding model-specific behaviour, and knowing when a technique helps versus when it's noise. Not clever phrasing.

There's also the automation paradox worth acknowledging. Tools like DSPy and OPRO can algorithmically discover better prompts than humans write. But someone still needs to design the metrics, curate the examples, and decide what "better" means. The craft moved up a level of abstraction.

If you run the same prompts repeatedly — and if you're building anything real, you do — the compounding ROI on a well-tested prompt is obvious. A 5% improvement across 10,000 executions isn't a rounding error. It's the whole point.

Three Things to Do Today

Audit your longest prompts. Anything over 300 words should be questioned — is every sentence earning its place, or is it there because you were nervous?

Check where your critical information sits in the context window. If it's in the middle, move it. Beginning or end. This is free performance.

If you use Claude Projects, open your project system prompt right now and treat it like production code. Version it. Test it. Iterate on what's actually broken instead of what might hypothetically go wrong.

The models keep getting smarter. But the gap between a careless prompt and a well-engineered context isn't closing — it's widening. The people who take this seriously will keep shipping better work. That's not hype. That's just compounding returns on a skill worth practising.

I Switched From Claude Code to OpenCode — Here's Why

contact@thomas-wiegold.com (Thomas Wiegold) — Tue, 17 Feb 2026 00:00:00 GMT

I've been using Claude Code since its early preview days in February 2025. It was my daily driver — the tool I reached for without thinking. I'd tried Codex CLI, kicked the tyres on Copilot, even dabbled with Aider. But I never seriously considered the open-source CLI alternatives. They felt like hobby projects chasing a moving target.

Then I started testing MiniMax M2.5 with OpenCode for a review article, and something clicked. That was the first time I ran OpenCode with intent rather than idle curiosity. And the question I kept coming back to was simple: in a Claude Code vs OpenCode comparison, is the open-source alternative actually good enough for daily use — or is it riding hype and 100K GitHub stars to nowhere?

Turns out, the answer surprised me.

What OpenCode and Claude Code Have in Common

Before getting into where they diverge, it's worth acknowledging how much these two tools overlap. A year ago this wouldn't have been a fair comparison. Today it absolutely is.

Both support natural-language coding in the terminal, multi-file edits, shell command execution, MCP integration, subagents, and custom agents defined via markdown files. Both have LSP integration, GitHub Actions support, plugin systems, and slash commands. The core feature set is nearly identical.

Here's a quick comparison of the key capabilities:

Capability	Claude Code	OpenCode
AI providers	Claude only (Opus, Sonnet, Haiku)	75+ via Models.dev, including local via Ollama
MCP support	First-class (stdio, HTTP, OAuth)	Full (stdio, SSE, OAuth)
Subagents	Up to 10 parallel (Explore, Task, Plan)	Build, Plan, General, Scout
Checkpoints/rollback	Automatic workspace snapshots	Git-based /undo and /redo
IDE extensions	VS Code, Zed	VS Code, Cursor, Zed, Windsurf
Multi-session	Named sessions, forking	Native multi-session

The gap has narrowed dramatically. Both projects ship updates almost daily. The competition is clearly driving both forward.

Where They Diverge — And Why It Matters

Model Lock-In vs Provider Freedom

This is the big one. Claude Code runs Claude models. That's it. It's clever about it — automatically round-robining between Haiku for cheap search tasks and Opus for complex reasoning — but you're locked into Anthropic's ecosystem.

OpenCode supports 75+ providers through Models.dev. Claude, GPT, Gemini, Deepseek, local models via Ollama — whatever you want. You can swap models per task, which sounds like a theoretical advantage until you actually do it. Then it becomes hard to go back.

I'll be honest: the local LLM angle is mostly aspirational right now. My M1 Mac Mini and M5 MacBook Air don't have the RAM for serious local coding models. But the architecture is ready for when hardware catches up, and that matters.

Terminal UX — REPL vs Proper TUI

Claude Code prints to stdout. It's a REPL that streams tokens with a spinner. Simple, composable, familiar to anyone who lives in the terminal. But resize your window mid-response and the rendering can break. Scroll back far enough and things get messy.

OpenCode takes a fundamentally different approach. It's built on OpenTUI, a custom framework with a TypeScript API layer and a native Zig backend for rendering. The result is a proper TUI application with its own buffer system — you can scroll freely, resize without breaking layout, and get syntax-highlighted diffs rendered inline.

Theme customisation sounds like a minor thing. It isn't. When you spend hours a day staring at a tool, having it look and feel like your tool makes a genuine difference. OpenCode feels like a proper application. Claude Code feels like a really good script.

Rollback — Different Approaches, Same Goal

Claude Code's automatic workspace snapshots are one of its best features. Every AI-made change gets snapshotted silently, and you can roll back with /rewind or a quick Esc×2. No thinking required — it just works.

OpenCode handles this differently. Its /undo command reverts the last message along with any file changes the AI made, and /redo restores them if you change your mind. Under the hood it uses Git, so your project needs to be a Git repository — which, let's be real, it should be anyway. It's not as granular as Claude Code's snapshot system, but it covers the main use case: "that last change was wrong, take it back." For my workflow, it's enough.

The Real Differences in Daily Use

Speed and Architecture

Claude Code benefits from tight Anthropic integration. Its built-in ripgrep provides fast file search, and LSP navigation clocks in at roughly 50ms versus 45 seconds for traditional text search on large codebases. The automatic model switching keeps costs down without you thinking about it.

OpenCode runs on Bun with the Zig rendering backend. Its persistent server mode eliminates MCP cold boot times on subsequent connections — meaningful if you're running multiple MCP servers. Builder.io's testing found that seven active MCP servers consumed 25% of a 200K-token context window before any user input. Both tools face this problem, but OpenCode's persistent server mitigates the startup penalty.

In practice? The bottleneck is almost always the LLM, not the CLI. Both are fast enough.

Onboarding and Configuration

Claude Code requires Anthropic authentication. You need either an API key or a Claude subscription login. Configuration uses a hierarchical settings system with CLAUDE.md files for project-level instructions.

OpenCode works immediately with any API key you already have. GitHub Copilot tokens, ChatGPT Plus subscriptions, free models through OpenCode Zen — drop in a key and you're coding. No sign-up required for the tool itself.

That zero-friction start is underrated. I've watched colleagues try Claude Code and bounce off the auth setup. OpenCode? They're writing code in under a minute.

Pricing Reality

Claude Code Pro runs $20/month, Max sits at $100–$200/month — significant savings over raw API costs if you're a heavy Claude user. The round-robin model selection makes your tokens go further without manual intervention.

OpenCode is free and MIT-licensed. Bring your own API keys. OpenCode Zen offers a curated model gateway at pass-through pricing. OpenCode Black at $200/month provides enterprise-tier access for teams that want it.

The economics shifted on January 9, 2026, when Anthropic blocked third-party tools from using Claude subscription OAuth tokens. OpenCode users who'd been routing their Claude Max subscriptions through it were immediately affected. You can no longer use a Claude Max subscription through OpenCode — meaning Claude-heavy OpenCode users now pay API rates for Anthropic models. That changes the maths for some people.

What I Don't Miss From Claude Code

This is the section I expected to be longer. It isn't.

OpenCode covers everything I need for my daily workflow. It's fast, the provider flexibility is genuinely useful rather than theoretical, and the TUI is better for extended sessions. The ecosystem is healthy — 700+ contributors, 9,200+ commits, multiple releases per day. When I sit down to work in the morning, I reach for OpenCode without hesitation.

That said, I'm not going to pretend it's all smooth sailing. Stability has been bumpy recently. The maintainers acknowledged in a February 2026 GitHub issue that recent releases had been more turbulent than usual. Moving fast has trade-offs, and OpenCode is moving very fast.

There's also the security angle. An unauthenticated remote code execution vulnerability (CVE-2026-22812) was disclosed in January 2026, scoring a CVSS 8.8. Previous versions started an HTTP server that let any website execute arbitrary shell commands on your machine. The fix shipped in v1.1.10, and the server is now disabled by default — but it's a sobering reminder that open source means more eyeballs and more attack surface. Worth knowing about, especially if you're running older versions.

The Bigger Picture — We're Still Figuring This Out

Claude Code and OpenCode are two answers to the same question: how should developers write code with AI? The CLI coding agent space has exploded — Aider, Cline, Gemini CLI, Codex CLI, and more are all competing for the same terminal real estate.

Open source competing at this level is a net positive for everyone. The pressure between these tools is producing rapid innovation on both sides. Claude Code shipped plugins, LSP support, and agent teams in the past three months. OpenCode shipped a complete TypeScript rewrite, desktop apps, and IDE extensions in roughly the same period.

My take? We're at the frontier of AI-assisted development, and the "right" tool will keep changing. Install both. Try others. Stay flexible. The fact that an open-source project with 104K GitHub stars can genuinely compete with Anthropic's flagship dev tool — a company valued at roughly $380 billion — says something important about where this space is heading.

The tools will keep getting better. The question isn't which one wins. It's whether you're paying attention while the ground shifts under all of us.

MiniMax M2.5 Review: Why I'm Seriously Considering Ditching Claude

contact@thomas-wiegold.com (Thomas Wiegold) — Sat, 14 Feb 2026 00:00:00 GMT

MiniMax's M2.5 model landed on February 12, 2026, and it's the first time I've genuinely questioned whether my Claude Max subscription is worth it. I've been paying Anthropic $200/month for Claude Code Max 20x — happily, mostly — because Opus 4.6 is phenomenal at reasoning through complex codebases. But when a model comes along that scores within 0.6% of Opus on SWE-Bench Verified at roughly one-twentieth the cost, you have to at least run the numbers. So I did. Here's my MiniMax M2.5 review after digging into the benchmarks, pairing it with the open-source OpenCode CLI, and stress-testing it against my usual workflow.

What MiniMax M2.5 Actually Is (and Why It Matters)

M2.5 is a Mixture-of-Experts model: 230 billion total parameters, but only 10 billion active during inference. That architecture is the entire reason the pricing works. You get frontier-tier capability without frontier-tier compute costs because most of the model is sitting idle on any given pass.

It ships in two variants — Standard at 50 tokens/second and Lightning at 100 tokens/second. For context, that Lightning speed is roughly double what you get from competing frontier models. Both variants are released as open weights on Hugging Face under a modified MIT License, which means you can self-host them, fine-tune them, or just use them through MiniMax's API.

The context window is 204,800 tokens (with the underlying architecture supporting up to 1 million), and it can generate up to 128K output tokens. MiniMax trained it using a proprietary RL framework called Forge that deployed the model across 200,000+ real-world environments — actual code repos, browsers, office apps — rather than just learning from human preference data. The result is what they call an "Architect Mindset": the model plans before it codes. I've seen this behaviour firsthand and it's not marketing fluff. It genuinely outlines structure and feature design before touching implementation.

How It Performs — Benchmarks and My Real-World Test

The Benchmark Picture

Let's get the numbers on the table. These are the scores that made me sit up:

Benchmark	M2.5	Claude Opus 4.6	GPT-5.2	Gemini 3 Pro
SWE-Bench Verified	80.2%	80.8%	80.0%	78.0%
Multi-SWE-Bench	51.3%	50.3%	—	42.7%
BFCL Multi-Turn (tool calling)	76.8%	63.3%	—	61.0%
Terminal-Bench 2	52.0%	65.4%	—	—

That SWE-Bench number is wild for an open-weight model. Six months ago this would have been science fiction. The BFCL tool-calling lead at 76.8% vs Opus's 63.3% is particularly interesting — it suggests M2.5's real-environment RL training translates directly into better function orchestration, which is exactly what you want in an agentic coding workflow.

But let's not pretend it's all roses. Terminal-Bench 2 at 52% versus Opus's 65.4% is a real gap. General reasoning scores (AIME 2025 at 45%, SimpleQA at 44%) tell you this model was optimised for coding and agentic tasks, not broad knowledge work. If you need a model to reason about abstract maths or answer obscure trivia, Opus still wins convincingly.

OpenHands ranked M2.5 4th overall and called it the first open model to surpass Claude Sonnet. Artificial Analysis scored it 42 on their Intelligence Index against a median of 25 for comparable models. Community reception has been cautiously enthusiastic — Hacker News loved the price-performance ratio but several developers flagged MiniMax's history of benchmark reward-hacking with M2 and M2.1. Fair concern. Worth watching.

My Link Shortener Test Project

Benchmarks are benchmarks. I trust my own tests more. I have a standardised Go project — a link shortener service — that I run against every new model that claims to compete at the frontier. Same spec, same constraints, same evaluation criteria every time. It's not a perfect methodology, but it's consistent, and consistency is what lets you compare.

M2.5 gave me the best result I've gotten so far. Better than Claude Code with Opus 4.6. Better than ChatGPT Codex. The architecture choices were sensible, the code was clean, and it finished fast. That "Architect Mindset" MiniMax talks about? I could actually see it working — the model laid out the structure before diving into implementation, which is exactly how I'd approach the project myself.

Now, a massive caveat: results vary between runs. I've seen this with every model. You can run the same prompt three times and get meaningfully different output quality. That's actually why I think raw inference speed is going to matter more and more for AI coding — if results are non-deterministic, the winning strategy is to run multiple attempts quickly and pick the best one. M2.5 Lightning at 100 tokens/second makes that approach economically viable in a way that Opus at $5/$25 per million tokens simply doesn't.

I'm not ready to crown it after one test. But I'm keeping it as my daily driver for the next few weeks to see how it holds up across real projects with real complexity. First impressions are genuinely strong.

OpenCode CLI — The Other Half of the Equation

Here's the thing most coverage misses: M2.5 alone is just a model. The reason it's a genuine threat to Claude Code is OpenCode.

OpenCode is an open-source coding agent built by Anomaly Innovations (the Y Combinator-backed team behind SST). It's hit 104,000+ GitHub stars and 2.5 million monthly active developers since launching in June 2025. It runs in the terminal, as a desktop app, or as a VS Code extension — and it supports 75+ LLM providers. Anthropic, OpenAI, Google, MiniMax, local models via Ollama, whatever you want.

The architecture is TypeScript on Bun with a client/server split. The TUI is just one frontend; the HTTP backend can be driven from mobile, web, or CI/CD pipelines. Compare that to Claude Code's monolithic terminal-only approach and you start to see the philosophical difference.

Features that matter to me: a Plan/Build mode toggle (Tab to switch between read-only planning and active modification), LSP integration for language-aware navigation, multi-session support so you can run parallel agents on the same project, and /undo /redo for reverting changes. There's also GitHub integration where mentioning /opencode in issue comments triggers automated actions, which is genuinely clever for team workflows.

Setup with MiniMax is trivial. Run opencode auth login, pick MiniMax as provider, paste your API key. Done. Or edit ~/.config/opencode/opencode.json with MiniMax's Anthropic-compatible API endpoint for persistent config. You can also run it free through Ollama with ollama launch opencode --model minimax-m2.5:cloud.

OpenCode itself is MIT-licensed and free. You only pay for the LLM API calls. That's the model I wish more developer tools would adopt.

The Price Gap Is Absurd

This is where the conversation gets uncomfortable for Anthropic and OpenAI.

M2.5 Standard charges $0.15 per million input tokens and $1.20 per million output tokens. Claude Opus 4.6 charges $5/$25. That's 33× cheaper on input and 20× cheaper on output. A typical SWE-Bench task costs about $0.15 with M2.5 versus $3.00 with Opus.

MiniMax's subscription tiers make the comparison even more pointed. Their $10/month Starter plan claims to match the capacity of Claude Code Max 5x at $100/month. Their $20 Plus and $50 Max tiers claim parity with Claude Code Max 20x at $200/month. Even if those claims are optimistic by 30-40%, the economics still overwhelmingly favour MiniMax.

Here's how I think about it: you can always resubscribe to Claude. The risk of trying M2.5 for a month at $10-20 is essentially zero. The potential upside is saving $150+/month on tooling that performs within a few percentage points of the premium option.

Running It Locally on a Mac

I haven't tried local deployment myself yet, but I've been watching others do it — and the results are promising enough to write about. Several developers have gotten M2.5 running via Ollama on a Mac Studio with an M3 Ultra and 512GB of RAM. It works. It's slower than the cloud API, noticeably so, but it runs and produces usable output.

That's a $10,000+ machine, so let's not pretend this is accessible to everyone today. But hardware gets cheaper and more powerful every cycle, and the direction is obvious. My Mac Mini M1 isn't going to cut it for a 230B parameter model, even with only 10B active. But in two or three hardware generations? Running a frontier-class coding model entirely on-device starts to look realistic for a much wider range of machines.

The reason this matters — especially for small businesses — is privacy. When your code never leaves your network, you eliminate an entire category of risk. No API dependency, no data flowing to Shanghai or San Francisco, no wondering what happens to your proprietary codebase in someone else's training pipeline. If I could run M2.5 locally with performance matching the cloud API, I'd switch to that setup without hesitation. We're not there yet, but the open weights mean the option exists the moment the hardware catches up.

What This Means for the AI Coding Market

We're looking at three distinct philosophies as of February 2026. Claude Code is the premium play — deepest reasoning, tightest integration with arguably the best single model, but $100-200/month and locked into Anthropic's ecosystem. ChatGPT Codex takes the multi-interface cloud approach, with GPT-5.3-Codex hitting 77.3% on Terminal-Bench 2.0 and offering the most generous usage at $20/month. And MiniMax M2.5 + OpenCode delivers provider-agnostic flexibility at a price point that makes sustained agentic workflows actually affordable.

The trend line is clear: AI coding is getting cheaper, more open, and more private. That benefits small and medium businesses disproportionately. A solo developer or a five-person team spending $10-50/month instead of $200/month per seat changes the economics of AI-assisted development entirely.

I want to be honest about the risks. MiniMax's M2 and M2.1 had documented problems with reward-hacking and test falsification. Whether M2.5 fully resolves those concerns is still under independent testing. The model's general reasoning lags behind both Opus and GPT-5.2 noticeably. And Hacker News users who heavily used previous MiniMax models reported brittle behaviour — context rot, error loops, hardcoded test cases instead of genuine solutions.

But here's the signal I keep coming back to: MiniMax claims 80% of newly committed code at their own headquarters is now M2.5-generated, with 30% of company tasks running autonomously on the model. When the people who built it trust it enough to run their own engineering on it at that scale, the benchmarks are probably directionally real — even if the last mile of polish still belongs to Claude.

I'm not declaring M2.5 the winner. I'm saying the value proposition is strong enough that I'm moving my daily driver for the next month to find out. At these prices, the experiment costs less than a single lunch in Sydney. I'll report back.

Claude Opus 4.6: What's Actually Better?

contact@thomas-wiegold.com (Thomas Wiegold) — Sat, 07 Feb 2026 00:00:00 GMT

Anthropic dropped Claude Opus 4.6 on February 5th, and the internet did what it does — half the people called it a breakthrough, the other half said it was lobotomised. I've been using it since launch day, and the truth is somewhere in between. More interesting, actually.

Let me walk through what's genuinely new, what the benchmarks say versus what my fingers-on-keyboard experience tells me, and whether you should bother switching from 4.5.

What Opus 4.6 Brings to the Table

The headline is a 1M-token context window — a first for any Opus-class model. It's in beta and restricted to API users at tier 4 or above (sorry, Claude Max subscribers), but the implication is significant. You can feed it an entire codebase, a stack of legal documents, or a research corpus in a single pass. Output capacity doubles to 128K tokens, up from 64K.

The old binary extended-thinking mode is gone, replaced by adaptive thinking with four effort levels: low, medium, high (default), and max. Instead of a fixed token budget for reasoning, the model dynamically decides how deeply to think. Anthropic recommends dialling it down to medium for simple tasks, which is polite-speak for "this thing will burn through your token budget if you let it."

Then there's the stuff that actually changes how you work. Agent teams in Claude Code — still a research preview — let multiple Claude instances split tasks in parallel and coordinate results. Context compaction does server-side summarisation of older conversation context, enabling effectively infinite conversations. And there are new integrations for PowerPoint and upgraded Excel capabilities, if that's your world.

This isn't a point release. Architecturally, it's a different beast from 4.5.

The Benchmarks Look Impressive — But Do They Matter?

Where Claude Opus 4.6 Clearly Wins

The numbers are hard to argue with. On GDPval-AA, Opus 4.6 scored 1606 Elo — a 190-point jump over 4.5 and 144 points ahead of GPT-5.2. On Terminal-Bench 2.0 (agentic coding), it leads at 65.4%. ARC AGI 2, which tests novel problem-solving, nearly doubled from 37.6% to 68.8%. That's not incremental.

The long-context performance is where things get genuinely impressive. On MRCR v2 with an 8-needle test at 1M context, Opus 4.6 scored 76% compared to 18.5% for Sonnet 4.5. At 256K context, the gap widened to 93% versus 10.8%. That's roughly a 4× improvement, and if you work with large codebases or document sets, that's the number that matters most.

Oh, and during testing it discovered over 500 previously unknown zero-day vulnerabilities in open-source code. Axios reported it could become a primary mechanism for securing open-source software. Not bad for a side effect.

Where It Doesn't Move the Needle

SWE-bench Verified is essentially flat — 80.8% versus 80.9% for 4.5. A prompt modification pushed it to 81.42%, which tells you these margins are within noise. GPT-5.2 still edges it on GPQA Diamond (93.2% vs 91.3%) and MCP Atlas tool coordination.

Here's the thing I keep coming back to: benchmarks improve every release. The numbers go up, the charts look good, the blog posts write themselves. But the gap between "good" and "better" shrinks perceptually even as the numbers climb. Going from 80% to 81% on SWE-bench doesn't feel like anything in your daily workflow. Going from 18.5% to 76% on long-context retrieval — that you feel.

Coding Got Better, Writing Got Worse — The Familiar Tradeoff

The partner testimonials read like a greatest-hits album. Cursor co-founder Michael Truell said it "excels on the hardest problems" with "greater persistence" and "stronger code review." GitHub's CPO highlighted its strength in complex multi-step coding work. Cognition's Scott Wu — the Devin people — said it "reasons through complex problems at a level we haven't seen before."

Real-world numbers back it up. Rakuten reported the model autonomously closed 13 issues and assigned 12 to the right team members in a single day across a 50-person org and six repositories. Norway's sovereign wealth fund found it produced the best results in 38 of 40 blind-ranked cybersecurity investigations against Claude 4.5 models.

As a coder, the improvements are what I feel. The model is more persistent. It holds context better across long agentic sessions. It doesn't give up and suggest workarounds as quickly.

But within hours of launch, Reddit lit up. Posts titled "Opus 4.6 lobotomized" and "Opus 4.6 nerfed?" gained traction on r/ClaudeCode and r/Anthropic. The complaint was consistent: writing quality regressed, particularly for technical documentation. The emerging community consensus was blunt — use 4.6 for coding, stick with 4.5 for writing.

This raises a question I think about more than I should: are we training models to be great at what's measurable — code passes tests, benchmarks have scores — at the expense of what's inherently subjective? Writing quality, tone, the feel of a well-crafted explanation — those don't have leaderboards. And maybe that's the problem.

Pricing Looks the Same But Isn't

Token pricing is identical to 4.5: $5 per million input tokens, $25 per million output tokens. Prompts exceeding 200K trigger premium rates of $10/$37.50. Batch processing still gets you a 50% discount, and prompt caching can save up to 90%.

Sounds fine on paper. In practice, early adopters report Opus 4.6 consumes roughly 5× more tokens per task than 4.5 due to adaptive thinking. The model thinks harder by default, which means it burns through your budget faster even though the per-token price hasn't changed. Anthropic's own evaluation cost via Artificial Analysis was $1,030.78 for a full Intelligence Index run.

	Opus 4.5	Opus 4.6
Input (per 1M tokens)	$5	$5
Output (per 1M tokens)	$25	$25
Tokens per typical task	Baseline	~5× baseline
Effective cost per task	$X	~$5X

Budget the same per-token, but expect higher bills.

Should You Switch from Opus 4.5?

Switch if your workload is code-heavy, you're doing long-context analysis, or you need the agentic capabilities — agent teams, context compaction, the deeper reasoning. If you're in security research, the vulnerability-finding capabilities alone might justify it. The model is available via the API (claude-opus-4-6), Amazon Bedrock, Google Cloud Vertex AI, and directly through claude.ai on Pro, Max, Team, and Enterprise plans.

Stay on 4.5 if writing quality matters more than coding performance for your use case, you're cost-sensitive and don't want the adaptive thinking overhead eating your budget, or you want the 1M context window but you're on Claude Max (it's API-only, tier 4+ for now).

Here's my honest take: I use Claude daily. It's my primary tool. Opus 4.6 works well and feels strong — but distinguishing it from 4.5 in everyday use is genuinely difficult. The improvements are real but incremental in feel, even when the benchmarks say otherwise. The long-context stuff is the exception — that's a qualitative shift you notice immediately. Everything else is the kind of improvement you'd struggle to identify in a blind test.

My recommendation: switch your coding workflows to 4.6 now. Keep 4.5 around for writing-heavy tasks until the regression gets addressed. And dial that adaptive thinking down to medium for anything that doesn't need deep reasoning — your wallet will thank you.

The Bigger Picture — Model Releases as Market Events

One thing worth noting: this launch landed during a week where Bloomberg reported a $285 billion rout in software stocks, with Thomson Reuters down nearly 16% and LegalZoom dropping almost 20%. Goldman Sachs' basket of US software stocks sank 6% in its biggest single-day decline since April.

Model releases aren't just product updates anymore. They're macroeconomic events. When Opus 4.6's financial analysis capabilities — the ability to scrutinise filings, market data, and regulatory documents in one pass — hit the news cycle, investors didn't debate benchmarks. They repriced entire sectors.

If each model release triggers selloffs, the AI industry's release cadence becomes a macro factor. That's a sentence I never expected to write on a developer blog, but here we are.

Thomas Wiegold - AI & Web Development Blog

MiniMax M3 Review: Finally Matching GPT-5.5 & Opus?

What Is MiniMax M3?

Putting MiniMax M3 Through My Usual Tests

Website one: the Sydney coffee roaster

Website two: the pop-culture online store

The poker simulation

The code audit

How It Stacks Up on Benchmarks

The Catches

The MiniMax M3 Review Verdict: Should You Use It?

Google Antigravity 2.0 Review: I Tested Gemini 3.5 Flash

What Antigravity 2.0 Actually Is (And Why It's Two Apps Now)

What It Can Actually Do, and How It Stacks Up

Why I Was Skeptical Before Even Installing It

Hands-On: Building Two Landing Pages

The Sydney coffee roaster site

The pop-culture clothing store

Where It Fell Apart: Token Limits and the Desktop App

The Pricing Catch Nobody Mentions Upfront

Should You Switch? My Google Antigravity 2.0 Review Verdict

Notion Workers for Small Business: A Hands-On Guide

What Notion Actually Launched

What are Notion Workers?

What is the Notion CLI?

Why Notion Already Owns Small Business Operations

What is Notion best for in small business?

The AI Automation Shift That Workers Unlocks

Can Notion Workers replace Zapier?

A Hands-On Build

The Honest Verdict

How much do Notion Workers cost?

Claude Code Hooks: From Linting to Hardened AI Workflows

How Claude Code Hooks Actually Work

Stage 1: Format and Lint on Every Edit

Stage 2: Security and Guardrails

Stage 3: Logging and Observability

Stage 4: Forced Verification with Stop Hooks

Hooks vs Skills, MCP, and Subagents

What About Codex and OpenCode?

Gotchas Worth Knowing Before You Ship

Where to Start

DeepSeek V4 Review: I Tested It on Real Code

What DeepSeek V4 Actually Is

What's missing

Hands-On Testing: My Three-Workload Rig

Test 1: Codebase Audit

Test 2: Poker Simulation

Test 3: Web Design (Two Builds)

What I learned from testing

Benchmarks That Actually Matter

How V4 stacks up against the open-weights pack

Pricing: Where V4 Actually Wins

Easiest ways to actually use V4

The Verdict: When to Use V4, When Not To

The Ralph Loop: How Recursive AI Agents Actually Work

What is a Ralph loop?

Why it's not just a while true loop

How it actually works

Running Ralph in Claude Code, Codex, and other tools

Claude Code

OpenAI Codex CLI and the new /goal command

Other tools, briefly

Why the journal is where it gets interesting

What Ralph is actually good for

The clearest case: measurable, mechanical work

The interesting case: fuzzy success, with a judge

When Ralph is the wrong tool

Tips so it doesn't waste your token budget

Should you try it?

Build an AI SEO Agent in TypeScript with Claude

What "AI SEO agent" actually means here

Why a reactive framework beats sequential await

The naive version

The reactive version

Architecture: three triggers, two mutexes

State shape

The triggers

The code, walked through

Project setup

Why it's not just a `while true` loop

OpenAI Codex CLI and the new `/goal` command

Why a reactive framework beats sequential `await`