Tsvetan Tsvetanov

Defining Slop

Tsvetan Tsvetanov — Fri, 24 Apr 2026 18:50:46 GMT

Today we have a guest post from .

Feifan is a cofounder of Tanagram, where he’s working on tools to scale expertise across AI-native engineering teams. He previously helped launch Stripe Issuing, and writes about organization design and effective results using AI.

“There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies, and the other way is to make it so complicated that there are no obvious deficiencies.” – C.A.R Hoare

Not too long ago, one of our users was lamenting about the dramatic increase in slop that he was seeing. We tried to put some boundaries around what he considered slop, but found it hard to do so — at the time, we had a hard time defining slop.

That sent me down a research rabbit hole: talking to other founders building in AI, hearing about day-to-day struggles from tech leads, and reading research papers.

Here’s what I found.

Quantitative

A group of researchers quantified slop in two metrics:

Erosion: the fraction of the codebase’s total complexity mass that resides in high-complexity functions — i.e. a codebase is more eroded if most of the code is concentrated in a few, very-dense functions. An example from the researchers’ output:

def find_matches_in_file(text, path, language, rules):
    ...
    for rule in rules:
        if kind == “exact”:
            ...
        elif kind == “regex”:
            ...
        elif kind == “pattern”:
            ...
        else: ...

        if iterable is not None:
            for match in iterable:
                ...
        if kind == “pattern”:
            for match in iter_pattern_matches(...):
                ...
        for node in iter_tree_nodes(source_root):
            if not node_matches_selector(selector, node):
                ...
    return matches

This function ended up 117 lines long, with the majority of decision-point code concentrated at this level of abstraction.

Verbosity: duplicated or problematic lines (as defined through targeted AST-Grep rules), as a percentage of total lines. An example from the researchers’ output:

for posix_path, full_path, language in source_files:
    applicable_rules = [
        r for r in all_compiled_rules
        if language in r[”languages”]
    ]
    if not applicable_rules:
        continue
    with open(full_path, “r”, encoding=encoding) as f:
        content = f.read()
    matches = find_matches_in_content(
        content, applicable_rules, language
    )
    if not matches:
        continue
    match_list.extend(matches)
all_matches = deduplicate_matches(match_list)
return all_matches

They called out the identity list comprehension instead of filter, empty checks instead of building around iteration, and single-use variables.

The full list of checks that the researchers used is here.

One additional way to quantify slop is through the amount of dead code that a change leaves behind — LLMs are strongly biased towards adding code, instead of removing existing code¹. Changes that leave behind more dead code are sloppier.

Qualitative

A few other attributes are also hallmarks of AI-generated slop, although they’re harder to measure:

Locally-optimal code that doesn’t account for other aspects of the codebase/system. The code might fail to reuse existing constants and functions, database tables, or whole existing services. The code might reimplement functionality already provided by an existing library or framework, or implement things in an old, possibly-deprecated way. The code might try to work around an existing constraint without understanding why that constraint exists in the first place.
No clear organization/domain boundaries — code might be broken up into modules, but there’s no logical rationale behind the modules.
Code that diverges from present-day conventions, especially given multiple differing patterns that have evolved over time.

Why it Matters

Individual engineers may care about code quality (especially the staff/principal engineers whose mandate includes thinking about the long-term health of day-to-day engineering work), but businesses generally only care when tech debt manifests as a reliability or compliance problem.

The problem is that slop is a self-fulfilling loop: the more you vibe-code, the harder it is to work in a codebase. The harder the codebase, the more you want to vibe-code. At some point, there’s an abdication of responsibility — when the codebase is no longer human-understandable, then engineers take themselves out of the loop, and give up on trying to keep things maintainable.

People stop being responsible, they stop being diligent, and that’s how outages and incidents arise.

What to Do About It

If you haven’t already, start using modern coding agents (not Copilot, probably not Cursor. Use Claude, Codex, Amp, or OpenCode). Really play with them — get a feel for their strengths and weaknesses, like you would a new coworker. You’re trying to figure out the right amount of detail you need to specify — when you can be broad and it’ll “get it”; when you have to specify more constraints. Then apply your engineering judgment to the results; don’t blindly accept its output as good or done, even if it looks like things work.

This process is now a continuous part of being an engineer. You’ll have to relearn this every few months as new models come out and harnesses evolve.

Finally, remember that in a system, when one bottleneck is alleviated, a different step becomes the next bottleneck. In software, now that generating code is no longer the bottleneck, design review and code review become the bottleneck. Invest in infrastructure that helps you scale your team’s unique expertise, patterns, and lessons.

Footnotes

This is, at least in some part, inherent in post-training: in RLHF, reviewers are asked to compare which output is better, instead of judging whether output should be removed. ↩

Notes from working with multiple agents

Tsvetan Tsvetanov — Fri, 03 Apr 2026 07:17:58 GMT

At the beginning of the year I started seriously exploring working on multiple things at once. With a multitude of agents. My harness is Claude Code. My framework of choice is nWave but before that I relied on Paul Hammond’s Claude config. Here are some learnings from this period shared in no particular order.

You don’t have to reinvent your framework

I see some folks building their skills, commands, agents and whatnot from the ground up. That’s fine as an experiment. If you have the time—definitely do it. But don’t get discouraged if you don’t. You don’t have to know all about skills, agents and all the other cutting edge harness developments. There are already pretty good frameworks you can use out of the box and be really productive. Keeping your focus on orchestrating the harness instances and thinking about the product, instead of caring about how a particular framework + harness pair performs.

Quality is important

When choosing a framework pick one that has built in quality controls. Especially ones that align with the eXtreme Programming techniques. Otherwise, I’m afraid you’ll end up in a pile of technical debt faster than you can imagine. As a bonus these practices shouldn’t be only encoded in skill files but actually enforced through something like nWave’s Deterministic Execution System.

YOLO mode, but don’t forget the sandbox

If you want to be able to orchestrate multiple instances of Claude Code (or whatever harness) at once, you must use YOLO mode. Or —dangerously-skip-permissions. All else will drive you insane by having you constantly approve details. It’s like reviewing 100s of PRs per day. So, try to build a framework in which you have enough confidence that it’ll make the right decisions in terms of quality. And those decisions can be verified. Then run in YOLO mode. Don’t forget to sandbox it, however. There are already several proper sandboxing solutions out there. Recently I’ve been happy enough with Claude Code’s sandbox mode.

Code reviews aren’t the way

You basically cannot ensure quality manually for the amount of code the harness spits out. You should automate stuff. Rely on high quality tests, linters, architecture tests and all other kinds of verification systems you can implement and find. The only thing you have to review is the framework and the end result from a user standpoint. All else should be automated. And your framework should be reliable enough to create high quality software without constant oversight.

Terminals + IDE

The most effective way for me to work is to be mostly in the Terminal. Four terminals actually. Each terminal runs an agent in a sandboxed YOLO mode. Each terminal is either within the same project or in another project. I have an IDE open for some of the projects if I want to double check something. Maybe it’s just an old habit that I find hard to get rid off. My belief is that we’ll need an IDE less and less as we go.

Team work limits the parallelisation

Most days I run just one or two agents in parallel. In rare days I run all four. My days are occupied by talking with humans, understanding stuff and thinking while the agents chug away at code. I guess I’m in a bit different position as a tech lead. And a big chunk of my job is actually making sure we’re aligned as a team. If you’re a software engineer you might be able to regularly spin up 4 or more agents.

The bottleneck is the discussion

Agents optimise the flow after the productive discussion has happened. After you, as a team of humans, have decided what to work on and how it’s going to look like. Of course, you can shoot in the dark and implement stuff with your army of agents. See what sticks and iterate. But I kind of find it counterproductive to start working on something that isn’t clear enough in terms of an end goal.

Refactoring is cheap

Recently I refactored (or rewritten?) a system that has started to accumulate some slop along the way. At first I started by analysing what to refactor and thinking about how to split the tasks so other team mates can handle them. But in the process of analysis I realised that I can just spin up a couple of agents and leave them to it. All the identified opportunities for improvement were done within a day.

Multiple agents means controlling conflicts

The issue becomes merging work done in parallel. The same issue we have with humans. When I spin up several agents in the same repo I must create worktrees. But even if I do that, some agents mess things up. Or find it hard to integrate previously done work in the worktrees. It’s a tricky business. Sometimes it’s easier to just spin up agents serially instead of dealing with the merges.

Conclusion

There might be uber-productive folks that work for 12-15 hours a day spinning 10s of agents. But I think these are in the minority. And I’m not sure whether the end-result isn’t a lot of waste and a bit of value. Even if it isn’t, it doesn’t have to be your goal. As an ordinary engineer, you can easily be a lot more productive by just having the proper, simple harness + framework and learning how to orchestrate a few agents at once. In my experience, that’s an easy and down-to-earth thing to do. Which doesn’t leave you drained at the end of the day.

AI software engineering doesn’t have to be an impossible objective you have to achieve. It can be simple.

Hey Agent! Test Drive!

Tsvetan Tsvetanov — Fri, 27 Mar 2026 09:11:53 GMT

I’m continuing my exploration of nWave’s Deterministic Execution System (DES). This time I’ll dive into a “simple” part of it. How it makes Claude Code follow test-driven development. I bet most of you who’ve actually tried thoroughly instructing an agent to follow a specific set of steps each and every time have struggled to enforce this. The beauty of nWave is that it solves this problem through its DES.

This is the 4th article in my series on nWave. Have a look at the other ones:

The Flow

Everything begins with a step. A step is a discrete part of the implementation that will result in a fully passing test suite. And it’s specified within a steps/XX-XX.json file for the current feature you’d like to implement. This step contains the self-reporting log of the agent where it describes which phase of the TDD cycle it has completed.

After this file is created, a subagent starts following nWave’s TDD loop:

PREPARE - verify that the log is properly created and decide with which acceptance test to start
RED_ACCEPTANCE - unskip the acceptance test and make sure it fails
RED_UNIT - write failing unit tests
GREEN - write the minimum amount of code to make them pass
COMMIT - type check, lint, refactor, verify tests still pass and commit

Each phase’s outcome is logged in the audit log. The agent self reports whether the step has been executed successfully through the DES CLI which updates the execution log. Before the next phase starts, the execution log is checked to make sure the agent has followed the steps as intended. The pre write handler also verifies that no source files are written during phases where tests are being written.

Another crucial thing is that at the end of the execution phase there’s the refactor phase. Which makes sure the agent reviews the code for refactoring opportunities and improves the design. This is crucial for maintainability of the software you’re building.

At the end of each /nw-deliver wave, the DES also does a review on testing antipatterns and does mutation testing. This enforces high quality tests that actually test the system.

The Parts

This flow relies on a few crucial mechanisms which we’re going to explore.

Hooks

Claude Code’s hooks are a critical component. This is a way to execute user code during different lifecycle events. The most important once for nWave’s TDD enforcement are:

PreToolUse - to validate whether all conditions have been met for the current task to start
SubagentStop - to validate whether the subagent has properly updated the logs, changed expected files and committed properly. It also cleans up after the subagent

Execution Log

The execution log is putting structure to the agent’s output. It provides a way for the scripts that implement DES to actually verify that what the agent has said has actually been done. If it weren’t for the execution log, DES would have to rely on natural language output which would be way more difficult to do. The execution log also provides an audit trail.

Agent

The agent is the doer. It tries to follow instructions to its best abilities based on training, prompts and context. Of course, the agent lies. So DES corrects it along the way until it produces what’s actually expected from the specification.

The strength of DES comes from the interception of the prompt for the subagent. Each subagent gets started with a very specific prompt that is blocked by the DES hook to ensure the subagent will follow the steps.

Current Limitations and What’s Cooking

The main current limitation of DES is that it relies on the agent’s self-reporting in the execution log. And verifies against this. Despite this it’s doing a marvellous job in making agents follow instructions and implement what’s required by following high quality software engineering practices.

However, I have a hint that a new version of nWave is on the way. Which will provide a way to enforce any workflow you have. It’ll be zero trust and dynamic. Meaning it won’t trust the agent but the workflow rules you’ve encoded into the system.

I’m looking forward to it!

Making Agentic Flows Determinstic

Tsvetan Tsvetanov — Fri, 20 Mar 2026 08:10:57 GMT

I have a lot of work recently so I haven’t been able to find ample time to write. So this will probably a short post hinting at the power of nWave. And trying to give a brief overview of their deterministic execution system. The engine that drives 50% of this framework’s reliability.

This is the 3rd article in my series on nWave. Have a look at the other ones:

The Key Idea

What if we could solve the problem of context drift?

Or more specifically that when AI is left unsupervised skips phases, fabricates results, and bypasses discipline.

By intercepting Claude Code’s hook lifecycle events, the deterministic execution system (DES) verifies that each step of a plan is followed to the last detail. And it does it on two fronts:

To ensure the specification is translated into a verifiable plan
To ensure that the specification is delivered by strictly following double-loop TDD

Ensuring Specification Is Followed

nWave guides the developer (and vice versa) through several distinct waves. For greenfield work these are:

(discover) -> discuss -> design -> distill -> (devops) -> deliver

Each phase after the discuss wave creates verifiable artefacts. Acceptance tests, roadmap encoded in YAML, strict rules on step naming. Each of these artefacts should follow a specific schema that enables the DES to verify that the implementation:

Does each step one after the other
Does exactly what’s required of each step
Provides a business oriented test harness to verify the implementation

Ensuring Double Loop TDD Is Followed

Later on, during the deliver wave, DES makes sure that:

Acceptance tests are tackled one by one
Each feature described by an acceptance test is implemented by following strict TDD
The TDD loop always goes through
1. Write a single failing test
2. Make it pass
3. Refactor

What Does This Mean for Agentic Engineering

This is just touching the surface on DES. But it’s the most powerful concept. We finally have a way to enforce our best practices. And are only limited by the quality of our specification.

Building a Bird Flocking Simulator with nWave

Tsvetan Tsvetanov — Fri, 06 Mar 2026 16:34:49 GMT

This week I wanted to show nWave to a friend of mine. He was getting into working with agents for software development and was curious to understand my flow. We sat down and decided to implement a bird flocking simulator. It was a fairly complicated task with a known solution. The goal wasn’t to innovate. It was to show the nWave greenfield workflow and see whether this framework is capable of creating an end-to-end solution to a problem that runs on the first try. Just for the fun of it. I forgot to save all the Claude Code conversations but will recreated them to my best abilities based on the logs I have.

This is the 2nd article in my series on nWave. Have a look at the other ones:

Everything Starts With a Discussion

First, we sat down together in front of the same computer. We were pairing. He knew what he wanted to see. I knew how to use the framework and what I wanted to show. So we started with the discuss wave which was something along the lines of:

/nw:discuss I want to build a web-based 3D bird flocking simulator.

This kicked off the product owner agent Luna which asked a few questions (removed some parts for brevity):

What type of feature is this 3D bird flocking simulator?

> 1. User-facing (Recommended)
  2. Backend/API
  3. Cross-cutting
  4. Other

I chose user-facing as we wanted a UI thing. However, nWave is fully capable of full stack work and infrastructure.

Then it asked for whether it should start with a walking skeleton to validate the architecture end-to-end:

Should we start with a walking skeleton to validate the architecture end-to-end?

> 1. Yes (Recommended)
  2. Depends
  3. No

The “Depends” answer might be chosen when you already have an existing setup and want to adapt the new solution to it. “No” is for when you’re confident that the architecture will work on the first try. All in all, nWave tries to minimise the risk of failure during implementation by following proven practices.

After we decide on the walking skeleton we should consider the UI from a user perspective:

How deep should the UX research and journey mapping be?

> 1. Ligthweight
  2. Comprehensive (Recommended)
  3. Deep-dive

This provides a thorough foundation for UX choices so our application is not only sound from an engineering perspective but is also actually delivering business value to our customers in a delightful way. I choose the second option which was recommended. Then the agent will ask for more clarifying questions that are related to the functionality. After you answer all of them, nWave proceeds with creating the proper foundation of specifications which it will deterministically follow throughout the implementation because of it’s Deterministic Execution System (which I’ll write about in a follow-up post).

Luna has started the DISCUSS wave discovery process. She's asking comprehensive
questions to understand your vision for the 3D bird flocking simulator before
creating any journey maps or requirements.

Note: Luna is emphasizing that she's not producing mockups or artifacts yet - she
needs to fully understand your mental model first through these discovery questions.
The more detail you provide, the better she can design the UX journey and craft
requirements that match your vision.

Please answer Luna's questions about:
- Core vision: Educational? Entertainment? Research tool?
- Target audience: Who will use this?
- Emotional journey: What should users feel in the first 10 seconds? First minute?
- Interaction model: Observation only? Parameter tweaking? Full sandbox?
- Visual style: Realistic? Abstract? Minimalist?
- Performance targets: How many birds? What devices?
- Features: What excites you most beyond basic flocking?

Take your time answering - this discovery phase sets the foundation for the entire
project. Luna will use your answers to create visual journey maps, YAML schemas,
Gherkin scenarios, and ultimately user stories with acceptance criteria.

Based on the answers I provided, it created a bunch of files for initial review. You can see the result in the requirements docs. These are meant for review by the engineer to make sure that what they want is properly captured within the specifications. Usually I review them thoroughly before proceeding to the next wave. But in that case I just moved forward.

The Gathered Requirements then Undergo Technical Design

Based on all of this data, the agent goes through the Design wave. This basically takes all of the input during the requirements gathering phase and creates thorough technical specifications.

/nw:design

You can either pass an existing feature name as a parameter here or just let it proceed from the current context. This kicks of the solution architect agent Morgan which will ask you a bunch of questions so it can do proper technical design.

Discovery Questions from Morgan

Please answer these questions to guide the architecture design:

Performance & Compatibility

1. Degradation strategy: If hardware can't sustain 1000 birds at 60fps,
should the system auto-optimize (reduce bird count), reduce visual fidelity
(LOD), hard cap at limits, or let it run slow with warnings?
2. Browser compatibility: Mac M-series only, or all modern desktop browsers?
Mobile support needed now or future?

Development Approach

3. Maintainability: One-time build, actively developed with new features,
or learning project optimizing for code clarity?
4. Testability: What level of automated testing? Unit tests for flocking
logic? Integration tests? Visual regression? Or manual testing acceptable?
5. Extensibility: Beyond obstacles/predators, plans for new behaviors,
plugin system, user-contributed presets, or 3rd-party integrations?
6. Time-to-market: MVP in weeks, polished product in months, or flexible
academic timeline?

Team & Technical Context

7. Team structure: Solo developer or small team? If team, how do they split work?
8. Technical experience: Familiar with 3D engines (Three.js, Babylon.js),
WebGL/WebGPU, TypeScript strict mode, TDD workflows?
9. Timeline: Target dates for Walking Skeleton (Feature 0), MVP (Features 0-3),
and full release?

Operations & Deployment

10. Operational requirements: CI/CD needed? Monitoring/error tracking? Analytics?
Or simple static hosting?
11. Deployment target: CDN/static hosting (Vercel, Netlify), self-hosted, or offline
capability needed?
12. Budget/licensing: Open source only, commercial libraries acceptable, hosting budget?

Your answers will help Morgan recommend architecture that fits your actual business
drivers.

Some of these questions were not so relevant. Like “timeline”? I expect it to be done in a few hours, why ask for that at all. But as a software engineer I’m used to disregarding timelines. That’s what I answered:

1. No need to care for that at the moment;
2. Chrome only;
3. Highly maintainable;
4. Highly automated testing;
5. No 3rd party extensibility for now;
6. MVP but high quality;
7. Solo dev;
8. Highly experienced;
9. Right now;
10. No deployment for now;
11. No deployment;
12. Open source

And here’s one of the things I really like about this approach. It makes you think. You have to make sure you know what you feed it and it’s what’s actually needed. The bigger the scope, the more thinking you have to put in and the more risk exposure you create. And you’ll see that reflected straight away in your cognitive abilities. Which might be a good guide for what batch size you should be working with. The design results were reflected inside the architecture design directory.

The Technical Design is Distilled Into Acceptance Tests

After all of this we start setting up our executable foundation for making sure what we want to implement is what’s going to be implemented. This will be done through acceptance tests implemented during the Distill wave and skipped initially. The Deliver agent will enable them one by one and implement them via Outside-In TDD.

/nw:distill

This kicks off the process and as always the agent asks you a few clarifying questions beforehand.

What is the scope of this feature?

> 1. Core feature (Recommended)
  2. Extension
  3. Bug fix

Which test framework should be used for acceptance tests?

> 1. Vitest + Playwright (Recommended)
  2. Cucumber.js
  3. pytest-bdd
  4. Jest + Testing Library

How should integration tests connect to services?

> 1. Test environment (Recommended)
  2. Mock Three.js rendering
  3. Real browser with visual regression

Should acceptance tests cover infrastructure concerns?

> 1. No (Recommended)
  2. Yes

We answered with the recommended defaults as they were quite on point. Which resulted in invoking the acceptance designer agent Quinn. After a bit of token crunching, Quinn created the following acceptance tests (which are now unskipped). When I reviewed them after the whole thing was implemented I saw that some of them remained skipped for some reason but I didn’t get to fixing that. Probably a reviewer agent would’ve caught this. And I guess the Deterministic Execution Engine can easily be upgraded to prevent such tests from staying in the code. These should either be made pass or deleted.

One thing I loved about the process was this summary.

🎯 Key Principles Validated

1. Hexagonal Boundary Enforcement ✅

Tests exercise driving ports only (ISimulationEngine, IRenderEngine, IUIController) -
no internal implementation details

2. Business Language Purity ✅

Zero technical jargon in Given-When-Then steps - all scenarios readable by non-technical
stakeholders

3. Walking Skeleton First ✅

Feature 0 (3 active scenarios) must pass before implementing US-001 through US-003

4. Performance Assertions ✅

Every scenario validates 60fps + <16ms render time

Then the Whole Solution is Delivered

What remains is the Deliver phase which is invoked through /nw:deliver. During it the software crafter agent Crafty delivers the whole solution following Outside-In TDD, Hexagonal Architecture, proper refactoring principles. The Deterministic Execution System makes sure that the steps outlined in the previous phases are followed exactly as planned. It also makes sure TDD is always followed strictly. You’ll see the end solution shortly, but before that a small comment on the process.

Using the Right Process

The flow I described above is for greenfield projects. The whole thing took about 2-3 hours where our input was only required in the initial specification and review. It’s about 30 minutes of human time. For production critical software this might jump a bit as one has to make sure that the specs are exactly what’s needed. nWave ensures a spec would be followed to the letter to a great extend. But nWave can’t prevent a spec from being wrong.

For other cases, you can use a simpler process. A few ideas:

Use /nw:root-why followed by /nw:execute to dive deep into the root cause behind a bug;
Use /nw:research to understand part of the codebase, a feature, a technology or an idea;
Use /nw:mikado followed by /nw:refactor to do a safe refactoring;
Use /nw:deliver for brownfield projects where you know exactly what has to be accomplished.

All in all, the process can be tailored to your needs. The key technological principles for high quality software are baked in. And in my opinion they shouldn’t be negotiable (unless we uncover better ways of working). The Deterministic Execution System also ensures that the agents will do what they’re asked to do a great extent. However, you can mix and match all else and see what works best for your case.

Conclusion

The end result looks like this.

All tests pass. No linter or typescript errors. Exactly what I needed based on the requirements. The code is well factored, easy to understand, navigate and maintain. You can inspect it for yourself here https://github.com/emdeha/flocking-birds-simulator.

I’m already expecting a possible protest though. And I’m kind of thinking this myself. Isn’t this waterfall? We specified an application to the utmost detail. A bird flocking simulator like this might take 1-2 weeks for an engineer to implement. It’s thousands of lines of code. A really big batch. We needed a lot of time before getting feedback on our specification. Frankly, I don’t have the answer here. I guess people way more experienced in software delivery than me would have it. But my initial thoughts are: tailor it to your process and experiment. For production systems I won’t go with implementing such a big thing. I’ll probably split things into smaller steps. Especially right now where I’m still understanding the framework and it’s limitations. But if I gain confidence and I see more and more that it delivers what I want it to deliver, I might increase the steps and see what happens then. Probably it’s just a matter of the reality inside your team. Another interesting perspective on that is that the batch size might be big in terms of lines of code and functionality. But if we take a button on our screen as an example and decompose it down to assembly, won’t it contain thousands of lines of assembly code as well?

The other interesting question is: will this allow me not to care about the code? Again, I think it’s a matter of experience. Play it safe at first and review things. See whether it actually produces code up to your standards. See when it has failed to do so. Or has delivered a bug or something that’s not how you want it to be. Improve the system until you reach a moment where you can trust it. Another point is that reading the code isn’t necessarily the best way to understand the system. What if you could generate views of the system that help you grasp it way better than that? Basically what Moldable Development advocates for.

Anyway, that’s all I have for now. I’m definitely seeing a lot of value in nWave and plan to continue exploring it in the next couple of weeks. More articles will come. Next week I plan to finally dive into the Deterministic Execution System and see what it’s made of. And what it can do.

First Steps into nWave

Tsvetan Tsvetanov — Fri, 27 Feb 2026 08:26:01 GMT

I first heard about nWave from a webinar hosted by its creators. What impressed me at first was the background of the people who talked about it. They were teaching eXtreme Programming practices. Call it an appeal to authority but that’s usually my first filter to decide whether something is worth diving deeper into or not. However, nWave also embodied and built upon practices that I’ve found valuable for software engineering. Practices that I’ve applied to AI augmented coding for the past 3-4 months and which led me to be productive without sacrificing quality. These practices are nothing new. They worked really well 20 years ago. Now they’re “just” amplified by AI.

Test-driven development. Working in small steps. Modular architecture.

However, nWave seems to be waaay more than that. In today’s article I’m going to share my first learnings and what I managed to do.

This is the 1st article in my series on nWave. Have a look at the other ones:

What is nWave, really?

At this point in time I understand it as a system for AI-augmentation built upon solid research in behavioural science and software engineering practices. It guides you from an idea to a high quality implementation in waves. The most common set of waves being:

Discuss /nw:discuss where you gather requirements
Design /nw:design where you determine the software design necessary to implement them
Distill /nw:distill where you create a proper acceptance test harness
Deliver /nw:deliver where you implement the feature using test-driven development

After each step you have a checkpoint. After the discussion happens, you can review the exact product requirements and make sure all is as you imagine it. Then the design will build upon this and generate an architecture. You review the architecture and make sure it’s what you need and it’s sound. Then the distill phase will build upon it by creating proper acceptance tests. You review these—if they’re good these are the actual specification that defines your feature in a way that’s both business-centric and automatically verifiable. Then you “deliver”. And what I loved about the last part (quoting their docs): “Run the tests yourself to confirm”.

I loved it because the folks behind nWave understand the futility of implementation reviews. Eyeing the implementation, especially now that AI generates it, won’t help you find issues—be it bugs or design flaws. Even more, you just won’t be able to handle the sheer amount of code you have to review. Implementation issues should be caught by automations: tests, linters, mutations. Not by manual labor.

Deterministic Execution System

This framework probably wouldn’t be as valuable if it weren’t for its Deterministic Execution System. Or DES for short. I’m still figuring out the details. It’s implemented in this set of source files. But as far as I understand it, this system ensures each wave does what we want it to do. It makes sure the TDD process is followed properly each and every time. It detects abandoned phases and suggests recovery. It analyses the agent’s logs and detects fabrications in them. It validates whether the roadmap artefact is structurally and semantically valid.

Basically, it’s the way to enforce determinism in a hallucination-prone system. And it’s something that I’ve been convinced for a while that each system we develop should have. However, my thoughts were kind of primitive. Let the team evolve their deterministic harness over time by adding more and more scripts and verifications that encode rules that must be followed.

First Use Cases

I’ve been using nWave for the past couple of days. Michele and Alessandro were really kind to connect with me on LinkedIn so I can share my excitement with them. My first use cases were a bit rough. I was still learning how to ride the waves.

A first hurdle for me was to find a way to “save” my current Claude configuration. For that I used bridle. It basically lets you have different Claude Code configs and switch between them. Like a package manager.

The first thing I used nWave was for Terraform. I had a fairly big repo of Terraform configurations which were meant for a specific AWS account. Now, I wanted to deploy the same system to a different AWS account. But things weren’t parameterised well. First I tried with my regular Claude Code setup. It did output a monstrous refactoring plan that got me scared. I had to migrate the system in a day. This plan seemed like something I’ll debug and fix for a week. Running things through nWave’s loop generated a much saner result. Something that was done in minutes and was actually what I expected. It miss some details though. During the process of applying the Terraform I found out that some important configuration was still hard-coded and wasn’t provided as tfvars. But for now I assume that’s an error on my part because I got confused at one moment and stopped following the phases.

The next thing was for a small feature. I had to provide an API endpoint for setting environment variables in Coolify programmatically from my application which used it as a control plane. It designed quite a good API for my needs. Where people could update environment variables in bulk. I deployed it to production as soon as all the phases passed. Then did a sanity check by hand. Things worked on the first time.

The other thing I used it was for a bug. There I just relied on the “Improve Existing System” job. I forgot the details but again it worked quite well. And in general I find their jobs-to-be-done guide really useful.

Now, I’ll be utilising it for a quite big feature which I want to deliver end-to-end. Not in small steps. It basically requires some rearchitecture of an existing system where:

It should call a 3rd party database
Merge data from its own API and a 3rd party API
Have a mix of long-polling and webhook events that update specific objects in the UI

I just went through the Design phase. After I review it, I’ll continue. And most probably share how things went in my next article. Frankly, I expect things to work on the first try. It’s pretty ambitious but we should expect nothing less from AI systems for software engineering. As long as we instruct them properly.

Conclusion

So far I’m still pretty excited for nWave. I see the value. What remains is to prove it via experiments. I’m also pretty keen on understanding the details. Because I suspect that the methods used for nWave are also applicable for developing AI systems at large. My next articles will most probably be dedicated to that.

AI Builder Brief #16

Tsvetan Tsvetanov — Wed, 25 Feb 2026 16:55:43 GMT

This is the last edition of the builder brief. Keeping up with all of the things happening in the AI-Native Engineering world is quite taxing. And few things are worth the diving into that I need to do so I can write a good summary. I’d like to focus my time on deeper things and just keep the Friday opinion pieces running.

Stop vibe coding blind. Start using Test Driven Navigation

How is it useful? Provides an error-prone method to using AI to write code for us.

Commentary: I’ve known this method for quite a while and it’s at the foundation of how I use AI to develop software. However, I see with enough experience folks can start going beyond it. Mutation testing can add a lot to the mix as the agent can make the tests pass by implementing stuff that the tests didn’t ask for.

The Context Development Lifecycle: Optimizing Context for AI Coding Agents

How is it useful? A good read on the importance of managing context for agents.

Commentary: I really enjoyed the idea of using TDD for our context. Before we codify rules we should think of how to evaluate them and store those evaluations. This will make sure that when we move to a different model, the coding harness will continue behaving as expected. Personally, I’ll start thinking more about this and figuring out how to provide proper project context for the things I’m working on.

How I Came to Understand the 100x Claim

How is it useful? Showcases a person’s learning journey on how to use agents effectively for coding.

Commentary: I’m quite aligned with his approach. Looking forward to the day where I’ll finally get to implementing my own orchestration framework that fits my needs.

Small Teams Guiding Agents

Tsvetan Tsvetanov — Fri, 20 Feb 2026 08:11:33 GMT

There’s an image circulating the Internet recently. It shows how 2-pizza teams are going to be replaced with 2 engineers with 1 business person and AI.

This is a compelling vision. What it evokes in me is nimbleness, speed, quick exchange of ideas and results. I can’t lie that I like it. Imagine sitting in front of your computer in the morning entirely focused on implementing the next feature the business wants and shipping it to users in the manner of a few hours. No useless meetings, no distractions with Jira workflows, no waiting for someone to review your 1000-line pull request for the fifth time in a week. Maybe both devs and the business user pair the whole time. So you have the social interactions you were craving so much since COVID. And whenever you’re stuck you just prompt the AI better and it helps you move forward. You establish proper automated checks and rules for the AI. You provide a thorough subagent setup and written instructions. You’re blazing through your backlog. Before you know it, your team has made the next OpenClaw exit.

Half of this fantasy has nothing to do with AI.

High performing teams collaborate first. They’re not focused on whether a certain individual performs better than the rest. They’re focused on how the team can operate at 10x or 100x the pace of a team of 10x individuals. Humans are cooperative species. Team performance can’t be reduced to the sum of individual performance. Having a few people who have organised over a process that facilitates collaboration will always result in better performance. Sadly, most engineers just live in teams that have no say in how they should work. Which leads to the false presumption that having fewer humans will just make things easier.

High performing teams are product minded. There are smart enough engineers who can learn and understand the user. However, our industry still assumes that an engineering team needs someone “business minded” to tell them what the user wants. Yes, there are the programmers who just want to fiddle with their technology. But you hardly want to work with such people anyway.

High performing teams take quality to heart. They don’t just sign off something to a QA who gives feedback and then push it over to a release manager. They build in quality and know that each quality metric should be automatically verifiable. They create proper behavior-oriented tests, use linters, security analysis tools and other means of automated verification. They don’t trust that the human-review process will catch the most important stuff. They try to automate this out as much as possible. With AI this skill becomes just more important. You can no longer review the code. You must enforce guardrails that come in effect as the code is being written. They deploy early and often to keep batches small and limit the blast radius.

Few vs. Several

Given the above, I think the few people argument is lucrative as it’s easier to keep these practices in a small group of people. You find it easy to trust each other. You can find the right people more easily. Even if you mess up, the cost won’t be that high. After all, a 3-people team with AI costs more than half of what an 8-people team with a PM costs. With the right plan you can split your product into modules properly owned by such small teams. Given that these teams don’t have to interact with each other, you can remove a ton of friction of having to manage multiple streams of work at once. Or you can spin up separate product offerings and see what sticks.

Having a few people might also lead to issues, however. You start lacking redundancy. These few people are expected to work at a really fast pace. Which can lead to pressure. Which can lead to burnout. And now you’re with a 0-person team. Or some of the folks on the team might get sick or out on vacation. What do you do now? How do you transfer the knowledge after the member comes back? What if both engineers get sick at the same time? Frankly, these seem like easily solvable problems. The chance of both engineers being out at the same time might be pretty minor. Unless both of them have kids who are just getting into kindergarten. Or one of them has to take a maternity leave during flu season.

Few people also exchange a lot less ideas than several. And in knowledge work cross-pollination of ideas is crucial. Hardly any business knows what feature or product would be the next killer. A 3-person team with AI might be able to spit out features and products at a really fast pace. Thus increasing your plane of opportunity. However, there’s a thing called a local maximum that can’t be overcome by just pushing more stuff out the door. At one point you need a “mutation” to come out and get you out of the local maximum. The more people you have on the team, the higher the chance for that. Of course, if you’re operating in a feature factory model where some omniscient entity knows that this exact set of features is a must to make things work, then this argument falls apart. You just need engineers who can churn this out for a long-enough period of time. Despite what most business owners think though, I’d argue that this is rare.

Where AI Fits Into This Picture

From all of the above, it might sound like I’m totally anti-AI. I’m not. I can’t deny it’s value and I embrace it as much as I can. I think it’s a game changer for our industry. And the industry is no longer the same. With all it’s flaws, hypes and delusions, AI is here to stay and we must learn how to use it properly.

But you can’t just trim 4-5 people out, replace them with AI and expect results. You have to be prepared for the consequences of having less human input. Lack of redundancy and local maximums are just some of the things you might face. Also, a 3-person team with the wrong practices and technical expertise won’t do the job. Even if you force them to use the best AI model or agentic framework out there.

AI fits into the picture when engineers with the right expertise can use it to actually multiply the existing good practices they have instead of diminish them. If they already know how to ideate, build-in quality, understand customers, experiment and extract value. AI will just help them do more of this. Which will be good.

8-person teams with the right practices and AI might be 10x more productive than a 3-person team with the right practices and AI.

8-person teams with the right practices without AI might not be more productive than a 3-person team with the right practices and AI.

8-person teams with the right practices without AI might be more productive than a 3-person team with bad practices and AI.

Take your context into account.

AI Builder Brief #15

Tsvetan Tsvetanov — Wed, 18 Feb 2026 08:00:16 GMT

Leading an Agentic Development Team

How is it useful? A level-headed and effective approach to working with agents for software development.

Commentary: I really love how Bryan has moved from skeptic to a supporter. And I find similar practices work great for me as well. It’s a really valuable read and every person serious about AI-Native Engineering should read it.

Manifesto for Vibe Coding

How is it useful? An interesting set of values for the AI-Native Engineering era.

Commentary: The only thing I’m failing to see is quality of the technical solutions. From my experience, coding agents are still far from being able to create high quality by themselves.

Build Your Own Team of Agents

How is it useful? A showcase of another agentic orchestration setup.

Commentary: Valuable insights. The main things that I disagree is the granularity of tasks which leads to micromanagement. And quality gates that seem to be setup through Claude Code hooks instead of deterministic mechanisms like husky. Otherwise, I agree with the project constitution document, using teams of agents and encoding procedures.

If There Is Nothing Inside — Why Build a Cage?

How is it useful? An interesting argument against disregarding consciousness in AI.

Commentary: I find the article pretty interesting. The rhetoric is a bit too much for me. The logic seems reasonable. Anyway, I find no practical use of debating whether AI has consciousness or not. If it can do the job that’s enough for me. If it reaches human-level capacity and awareness that’s fine. Whether we know it or not.

Test Design Reviewer

How is it useful? A Claude Code agent that makes sure you’re writing high quality tests.

Commentary: I think this is a really valuable agent. Plan to incorporate it into my workflow soon. Tests are the main deterministic tool we have to make sure AI does what we want it to do. The higher quality they are, the better.

nWave

How is it useful? A structured approach to generating high quality software with the help of AI.

Commentary: Last week I finally had the time to watch the webinar for nWave. Now I’m glad they have released the framework. Can’t wait to try it and I expect really good results.

The Bottleneck That Wasn’t

How is it useful? Understand that optimising outside of the bottleneck won’t help overall delivery speed.

Commentary: A useful article by Dragan that should make us think about how the speed with which we can ship code thanks to AI affects other parts of our software engineering organizations.

Guardians of the Agents

How is it useful? A way to formally verify tool calling workflows generated by LLMs.

Commentary: It’s more about the context of security. However, I already see how it can be useful for grounding agentic systems in general. Where you verify workflows before they can be run.

Steve Yegge on AI Agents and the Future of Software Engineering

How is it useful? Outlines some interesting and a bit worrisome predictions about the industry.

Commentary: I’m a bit worried whether people will get laid off, drained and burnt out. Personally, I think I’ll go through this but temporarily there might be a lot of people that haven’t managed to adapt. However, I believe Jevon’s paradox will kick in at some point and there will be even more jobs in the market. I’m excited about the possibilities that are opened and what we’ll do with so much new technology.

Effective Use of AI in Legacy Systems

Tsvetan Tsvetanov — Fri, 13 Feb 2026 07:48:35 GMT

By now people have seen that AI and legacy systems do not make a good pair. Huge codebases with bloated code and lack of tests are as hard for humans to orient in as for AI. If you don’t have a proper test harness, you won’t be able to ground AI into the actual requirements of the system. It’ll just make change after change by guessing what should happen based on a prompt. If the system is bloated and hard to reason about, AI won’t know where to change what. Even if it does, there’s a high chance that it’ll miss a critical dependency that’ll make it go haywire. Lastly, a huge codebase hardly fits into context. And when on top of this it’s hardly navigable, that will make AI get as much as possible when trying to solve a specific problem.

Not that all legacy systems are like that. But a lot might have some of these problems. Which makes the case that AI hates legacy code.

However, not all is bleak. With the right strategy and tactics a team can make AI quite useful when dealing with legacy systems.

Characterisation tests and refactoring

One of the main tenets of working with software systems is “if it works, don’t fix it.” This is reasonable, but limiting. It signifies dysfunctions and probably lack of skill. It might also show how constrained a team is in their actions because of things they can’t or don’t know how to control.

Controversially, with AI I think this should no longer hold.

A system can be put into a safe enough harness that prevents issues when we change it. And actually allows us to make the system better. In the literature, this is called characterisation tests. Tests that describe the current behavior of a piece of the system. These tests should be of such quality so they prevent regressions. Of course, the details matter here. Some system parts might be really difficult to put into a test harness. Some parts might be easier. As an engineer you have to think hard about how to make every piece of the system automatically tested. This is the only way you’ll be able to safely change it.

Another ideal set of properties of these tests is that they should be behavior-oriented and ignorant of implementation. The first thing means that they should test stuff from the user’s point of view. Be it a user of the system or a user of a specific module or library. The second thing means that these tests shouldn’t depend on specific internal modules. In short, they shouldn’t rely on mocking internal components. Only things at the edges of a system.

Once you have these tests in place and are pretty confident in their quality, you can begin refactoring. Or adding new functionality without worrying whether this will break the system.

In the whole of this process AI helps us with understanding and implementation. We’ll talk about understanding in the next section. After we have it, implementation is just about telling AI to write all of these characterisation tests, cross check their quality and think of more edge cases to add. Combined with mutation testing, you can be sure that these tests will catch a significant part of the edge cases in the system. Afterwards, you can just instruct AI to refactor the parts that are of the lowest quality. Or start adding features that are driven by tests.

Micro-tools that aid understanding

When faced with hundreds of thousands of lines of code understanding by reading becomes impractical. That’s the main premise of the folks behind Moldable Development. In that situation you should start relying on context-specific micro-tools that generate different representations of your system. These representations are guided by questions. Some questions are:

Why is request X so slow?
Where in my architecture do we directly call the DB instead of using a repository?
Which parts of the code change the most?
Which parts of the system are the most critical?
Where do we have the most bugs?
How should I feature X into the system?

Usually we rely on generic tools or no tools at all to answer such questions. However, what if we had the power and skill to create tools that fit into each specific question? Or follow-up questions that arise? Being able to create such tools on demand will help us deterministically address such questions way faster than we might’ve if we’re just relying on reading the system.

Before AI this was hardly possible. Systems like the Glamorous Toolkit addressed this issue by providing an IDE tailored towards the creation of context-specific micro-tools. However, the fact that it relied on a not well-known language, Pharo, and had a steep learning curve made it hard to introduce into teams and organizations.

However, now we don’t need such a system for most use cases. Knowing it will greatly help and I think it’s far more superior, but few people have the weeks or months necessary to invest to learn how to use this system properly. Now we have AI and AI can be prompted to generate throwaway tools that answer our questions. We would’ve done this in weeks or months. AI generates such a tool in minutes. This approach can greatly speed up our understanding and help us rely on objective analysis of our systems instead of subjective hand-drawn diagrams.

As an example, this week I used this approach to generate 4 different tools in less than an hour that helped me understand the relationships between 3 entirely different systems that created a software solution. It was a NextJS frontend which relied on Supabase and N8N. One of the things I wanted to know was “which components called Supabase directly?” I just instructed Claude Code to create a tool which deterministically found all components that do such invocations and create a Mermaid diagram out of it. Another question I wanted to know was “Which Supabase webhooks trigger N8N workflows?” I provided the necessary API keys and Claude Code generated a tool that retrieved all webhooks and workflows, mappend them and generated another Mermaid diagram.

This approach is mostly about changing your mindset from being focused on reading or AI analysis to being focused on generating tools that deterministically answer questions.

The modernisation roadmap

Lastly, with the power of AI now you can more boldly go towards modernising your legacy system. You have characterisation tests which enable safe refactoring and feature delivery. You have micro-tools that enable faster and more reliable understanding. What remains is to start building your modernisation roadmap. This is an alignment between the technical needs and the business ones. It’s not a huge multi-month, freeze everything effort. It’s a way to gradually improve your system while still delivering value for your users. Before it was hard to do that because you didn’t have the safety net and the wider understanding. However, you no longer have an excuse to ignore it. Set your system on the right track. Make yourself and your users happy.

AI Builder Brief #14

Tsvetan Tsvetanov — Tue, 10 Feb 2026 12:48:30 GMT

The third golden age of software engineering – thanks to AI, with Grady Booch

How is it useful? Understand why demand for software engineering is actually going to increase.

Commentary: From a historical perspective, it was really interesting to hear Grady’s thoughts on previous golden ages of software engineering. It felt really motivational and inspiring to be part of such an age. It also presents a sober view on why engineers will be even more in demand. Even though how we work might change as we go.

The creator of Clawd: “I ship code I don’t read”

How is it useful? It showcases the thought process of a person behind a really successful product in the AI era.

Commentary: First of all, this calmed me down a bit. I was wondering—how so many folks are managing to ship so many products and stuff in the era of AI? And from this podcast I found out that the creator of Clawd was basically a millionaire who’s fully focused on this. I also liked that he doesn’t see value in the Ralph Wiggum loop idea. Instead, he values small steps and focus on controlling the agents. Also, it’s really good that he focuses on automated verification to make things work.

The AI-Coding Revolution: Spec-Driven Development

How is it useful? Showcases the nWave framework—a methodology for developing software with AI based on Behavioral Engineering.

Commentary: The basic principles are nothing new to me. I liked the idea of giving your agent a personality. What I’d really like to see is the prompts. The way this is organized. I guess they’ll release it soon.

addt - AI Don’t Do That

How is it useful? Run all kinds of agents in secure docker containers.

Commentary: Pretty useful. I managed to configure and run it for me. I’m quite happy with the results. It definitely has some rough edges but I’m really surprised how easy it is to extend with new agents. I managed to adapt it for running Claude Code inside a corporate environment.

Agent-Native Engineering

How is it useful? Has good points on how to transform your organisation to use AI more effectively.

Commentary: Really support the move away from PR reviews as a requirement. We should automate that. Also, the focus on using verification even though they seem to focus a bit too much on rulesets. I also liked the idea of splitting your tasks based on capabilities required to implement it. And this way making it easy to delegate stuff to agents.

Eight trends defining how software gets built in 2026

How is it useful? Provides a prediction for what will be valuable in 2026 for engineering organizations that want to leverage agentic coding.

Commentary: Largely agree on this. Multi-agent orchestration, more focus on automated verification, scaling agents beyond engineering. These seem obvious now. Wondering what 2027 would be like?

The Anthropic Hive Mind

How is it useful? It showcases how a new form of increasingly innovative and successful organisations form.

Commentary: Really impressed by the hive mind style of working. It looks quite “agile” if I may. Probably that were the initial intentions of the people behind the Agile Manifesto. An organization so fluid and adaptable. It might seem like chaos at first but is actually in its golden age. What’s important though is putting the proper guardrails in place to prevent the huge losses from happening.

Everyone’s Wrong About AI Programming — Except Maybe Anthropic

How is it useful? A mathematical overview on why current agentic systems drift towards errors.

Commentary: For a non-mathematician like me this explanation seems quite simple and important in consequences. Wondering how it can be resolved and whether it should be resolved on the level of training models. Or can it be resolved through a layer above the LLM—e.g. modifying the way agents interact.

What do you mean by agent orchestration?

Tsvetan Tsvetanov — Fri, 06 Feb 2026 07:30:21 GMT

Agent orchestration is all the rage this year. However, I’m noticing that there are different meanings to that. And in order to approach this challenge meaningfully, we have to clarify what is it.

What I’m seeing out there are several types of orchestration approaches:

Agent orchestration within a single task. Built in;
Agent orchestration within a single task. DIY;
Agent orchestration for multiple tasks.

I’m going to explore each one and share first impressions on limits, tooling and capabilities.

Agent orchestration within a single task. Built in.

This is basically what the current “Agent” modes of coding tools provide. Claude Code’s calling of subagents. Cursor’s subagents. And so on. These are provided by the vendors because sometimes tools aren’t enough to complete a single task. You can delegate more complex workflows to subagents. And a common pattern is when Claude Code spawns an analyser, architect and coder agent. The analyser does an initial analysis of the problem at hand to understand what actually needs to be done or refine the user’s query. The architect understands the current design of the codebase and plans a meaningful end state that’s going to support the necessary functionality. The coder implements all of this.

These subagents are “orchestrated” by the main agent. I’ve no idea what are the patterns in Claude Code itself. But in general the main agent decides what subagents it needs, when to use them, and whether it should run them in parallel or serially. Another cool property of subagents is that they work with their own context window which leads to less token consumption.

All in all, this seems to have increased the capabilities of AI-assisted coding tools. And even though the orchestration pipeline is built-in you have a lot of power in defining your own subagents and using them.

DIY agent orchestration within a task.

This is basically the approach where you define your agent orchestration harness and the agents themselves independently of the coding tool. I’m not sure this can be done within Claude Code or Cursor. But I imagine using a system like OpenHands and creating your agent workflow from scratch.

Given the early stages we’re in this is a viable approach. We still don’t know what can work better and by the end of 2026 I think we’ll find out immensely better approaches to AI-assisted software engineering than what we have today. So we as engineers have to experiment and invent new tools and approaches that will make the use of AI more productive and safe. It’s definitely not easy to balance productive work with work required to create your own tooling, but with the aid of AI I see this becoming more normal than it was in the past. I see how moldable development will rise in popularity even though some folks might not be aware of it.

So, DIY agent orchestration is useful when you do research or you have a fairly obscure use case where the mass market tools like Claude Code can’t cut it.

Working on multiple things at once

Now comes the part that I think most engineers imagine when they hear agent orchestration. Basically, the part where we’re lured into the fantasy that we can work on 10 things at once thanks to our AI slaves. This is a workflow that practically says: I’ll spin up a separate Claude Code instance for each feature/bug/investigation I’d like to do. And I’ll spin up 10s of them so I can complete stuff in parallel. I’ll also spin up a separate instance that manages merge conflicts. And my role will be that of the master orchestrator.

This was my vision as well. And I see folks like are starting to do it with GasTown. However, I had a really interesting conversation with where he challenged this assumption quite well. Even if we automate code review, a human still has to verify whether their assumptions before the feature was realised actually make sense from a human standpoint. Do some exploratory testing in a way. So there’s hardly any value in working on 10 features at once if each feature is completed within minutes. You just won’t have time to review all of it.

Despite all of this, I think there are nuances to this approach and we’re still too early to conclude we can’t do it. As I said, we also see tools like Gas Town, Claude Flow and other attempts at orchestration. People seem to see some value in it. And it’s also important to clarify what are those multiple things. Because you can have cases where a human doesn’t have to come at the end of the process:

With all of the AI proliferation we might end up writing a lot of software that’s actually meant for non-human consumers. In this case, the AI is in the best position to review and own this software end-to-end. With reasonable automated guardrails put by a human so we don’t end up with slop;
A lot of software work (maybe the bigger part) is keeping quality high. Especially with agents, we have to refactor constantly and improve the codebase to counteract the chaos that LLMs create bit by bit. This work is also enforceable by automated guardrails and might not need human supervision;
Then there might be features that can be automatically validated with enough confidence through acceptance tests. Human insight and feedback might still be needed but maybe on the scale of multiple features or a whole module. Which will take time to complete to a meaningful state. Or maybe human validation logic can be encoded well enough within specs (if acceptance tests aren’t applicable) so the AI can properly self-verify it.

Personally, despite all of these arguments I’m still not sure how much sense would orchestrating 10s of agents make for an engineer. Maybe some engineers would find it beneficial and useful. Maybe most won’t. I definitely think we’ll have the technology to do it. Maybe to even orchestrate 100s of agents at a time. And maybe we’ll orchestrate 2-3 features at a time but on each feature there would be 10s of agents working. Maybe by the end of 2026 that’s what we’re going to mean by agent orchestration:

Work on multiple things at a time with multiple agents collaborating on the same task.

It’s still too early to tell. But I’m above excited to see where this will go!

AI Builder Brief #13

Tsvetan Tsvetanov — Mon, 02 Feb 2026 08:18:11 GMT

Unpacking the ‘unpossible’ AI coding logic of Ralph Wiggum

How is it useful? Helps you understand the Ralph Wiggum loop better.

Commentary: Largely on point:

Context compaction can limit accuracy in long-running tasks
Token consumption might sky rocket

Claude Code - Swarm Mode

How is it useful? You can test the hidden swarm mode feature of Claude Code.

Commentary: I’m wondering if this will make orchestration platforms obsolete. Probably not as people still have no idea what proper orchestartion means. It’s still quite cheap to build your own solution than to stop on a given existing solution. Also Claude Sneakpeak is really interesting as it lets you unlock hidden Claude Code features.

DClaude - Containerized Claude Code

How is it useful? Let’s you run separate Claude Code instances in independent containers.

Commentary: I’ll try it out for my orchestration approaches. It seems quite valuable. The interesting part might be—what if you want the separate containerized instances to communicate with each other. Will this be an use case at all? Anyway, I’ve had issues with non-containerized Claude Code instances and I’ve been trying to resolve them only with git worktrees which might pose issues. So, I’ll give it a try.

RuVector is not a text predictor

How is it useful? An overview of RuVector’s capabilities by its author.

Commentary: There are some really bold statements there. Like it maintains an internally coherent model of the world as it changes over time. Which seems like a way to prevent or greatly reduce hallucinations. This is done through the interplay of vectors, graphs and a coherence engine. I’ve been exploring how this all works for some time and still learning. But seems like a powerful concept for the agentic engineer. Even if only because of the fact that you can now have a free vector DB running alongside your application.

Open Coding Agents

How is it useful? A framework for easily training open LLMs on your codebase.

Commentary: Seems like a really useful framework for companies that are sensitive to data leakage or have tight financial constraints. With this framework you can take an open source model and train it on your data. It automates most of the training. They claim the costs to be at around $400. I don’t have the use case for that but might try it out at some point.

Software Survival 3.0

How is it useful? Shares views on what software might survive the decreasing costs of building vs. buying.

Commentary: It’s kind of depressing. Even though the author gives hope that we’ll actually write way more software in the future and that I believe in Jevon’s paradox. It’s a fundamental shift and I’m not sure how many people will adapt. And how many errors will emerge through the use of AI from untrained people who don’t understand how to put proper deterministic guardrails. Apart from that, I agree with the direction.

The End of Human Code Review

How is it useful? Shares insights on the future of code reviews.

Commentary: Well, for me, code review was always a waste of time. Unless you have a high trust team environment that learns how to automate the guardrails that code reviews are meant to impose, you’ll suffer from a significant slowdown in review and approval speed. And my approach to coding with agents is the same. I hardly review the code. I focus on finding the proper guardrails that will let the agents deterministically produce high quality code with a minimum amount of bugs. These guardrails aren’t agents themselves. They’re code. Like automated tests and linters but also other specialized tools.

The 80% Problem in Agentic Coding

How is it useful? Addy Osmani analyzes the issues with the 20% of coding we still have to figure out for ourselves.

Commentary: It shares how our mistakes changed are might be more conceptual and hard to find out. Because we don’t understand the software we produce. Even if we review it, it’s not like we’ve written the code. It also highlights how despite increased throughput some teams are bogged down in code reviews. Finally, it ends with some patterns on how to escape this trap. For me, the most important pattern is automated verification of everything. Also, learning to declare, not to micromanage is quite important.

How AI assistance impacts the formation of coding skills

How is it useful? Showcases that using AI to grasp new coding skills might lead to less human ability in the exchange of small time savings.

Commentary: I’m not surprised as you can’t reliably learn a new tool if it doesn’t get into your hands. The study also shows how individuals that used AI only on a conceptual level scored higher on forming the coding skills than those who delegated the implementation in full. However, we’re looking into acquiring skills that are hardly relevant anymore. You can understand the tool in a learning session with Claude. And then just rely on Claude Code to actually use it and save you a ton of errors. You get the best of both worlds this way.

Automatic Programming

How is it useful? Distinguishes vibe coding from AI-Assisted development.

Commentary: Largely agree. Even though I don’t see a reason for yet another term. However, I find value in the author’s observations that if you’re not vibe coding you’ve essentially produced the code and it’s yours and you can be proud of it.

Dolt is the Database for AI

How is it useful? A version controller DB.

Commentary: This can be quite useful when you let AI take full reign over your production DB as well. If you rely on DoltDB you can easily revert your DB to previous states if something breaks. Thus it makes reversibility easier. At least in concept. I have yet to try it and dive deeper to see how useful it is.

AI multiplies your tech debt. But not in the way you thought

Tsvetan Tsvetanov — Fri, 30 Jan 2026 07:44:26 GMT

I’m seeing several kinds of software professionals around me.

The first kind are totally embracing it. Letting it write heaps of bad code and deploying that to production. No tests, no refactoring, just feature after feature. Amplifying their bad practices with AI.

The second kind are really wary of it. They think it has no place in software development. They’ve tried a few prompts here and there, saw that it didn’t generate one-shot results and gave up on it. They’re proponents of software development as a craft and think that unless the code is hand-written it won’t stand up to scrutiny.

The third kind are the ones that are embracing it but heaping tons and tons of reviews on the code. They scrutinize every line, try to understand it thoroughly and take utmost caution.

The fourth kind are the orchestrators. This is a new breed of developers who have unlimited access to tokens. They learn how to span a multitude of agents to work on different tasks. Either on the same project or different ones. They create their own orchestration harnesses as there’s nothing proven yet. It’s an uncharted teritory.

And lastly, I’d say, there are the engineers. They might have parts of the other kinds as well. But they guide AI through tests, develop context-specific tools to understand their systems better, focus on proper architecture and refactoring. They embrace AI through practices that have been proven to yield high quality results in a fast manner over the past decades.

I can’t say which kind of software professional is right or wrong. But I’m seeing that a crucial part of the approaches embracing AI is understanding the technical debt cadence.

Using AI makes you fast. And you’ve probably already found out that with this speed it amplifies your practices. If they’re good, you’re just doing better stuff at a higher pace. If they’re bad, you’re accumulating tech debt faster. But I don’t think this is the issue we should focus on. By basically saying that bad software organizations will become worse through AI we’re limiting our views to a static world without feedback. However, the world is fairly dynamic and there are feedback loops all around us. And humans change their actions through those feedback loops.

Which brings me to the point that the fact that AI amplifies bad practices might actually be for the good of the industry. Before it might’ve taken years for you to see how a system degrades to a big ball of mud. Now, especially if you’re orchestrating agents, you can see that in the matter of weeks. Even days maybe. It’s an unprecedented learning opportunity where one can understand how engineering decisions form the future of a system. And then rewrite the system with better engineering decisions at hand.

Imagine the following scenarios. Or better—try them with your own toy projects.

Implement a system using AI without any tests. Just plow your way through feature after feature. See when it starts to break too much. Maybe simpler systems can be modified indefinitely. Maybe frontend-heavy systems are harder to maintain in that way. Or maybe certain LLM models and agentic frameworks fare better without tests for certain kinds of systems than others.

Implement a system test-first. Either through behavior-driven tests or tests tightly coupled to your implementation. See how the system evolves then. Maybe AI is “smart” enough to maintain a bad test harness and thus prevent regressions. Or maybe if you don’t care about the tests at all, AI makes them pass through mocks. And the system under test hasn’t been tested at all.

Focus really hard on caring about proper architecture. With tests or without them. What happens then? Can the system evolve in the long term even if there’s no regression harness? Which kinds of architecture are best for this system? Maybe you can spawn multiple Claude Code instances each betting on a specific design direction and see which goes the farthest. You can try setting this up for a fairly complex system and see when things start to break. You couldn’t have done this before as coding just a single architectural direction by hand would’ve required weeks or months of work.

Now, what if you apply these practices on the level of agent orchestration? Can you find a way to work on 5-10 features at once across several projects without system quality degrading? How? Maybe for each 1 hour of work you have to dedicate 4 hours of refactoring. If so, does that lead to better outcomes overall? When do such systems degrade? What happens to your stress levels having to manage so many contexts at once? Or maybe technical debt won’t matter as each system that eventually crumbles can be converted to a spec from which a new, high quality system can be replicated and thus you can continue from a point of quality.

By now you should probably get my point. If you don’t know how to write high quality code or the quality diminishes because traditional practices can’t keep pace with the throughput of LLMs, you can still benefit immensely from AI. As long as you learn you have much faster ways of gaining feedback. Ways that weren’t available to us before. So at some point, sooner rather than later, you’ll learn which practices actually amplify throughput without risk.

Maybe the only question that remains is: Who is going to suffer from this fast-paced feedback loop? Because people will write safety-critical systems that will fail. And customers and businesses will be immensely affected because people have applied practices they don’t yet understand to problems that are of high importance. So, if you want to ethically embrace AI, try it first on toy projects and use the fast feedback loop to scale these toy projects to the calibre of real-life production systems. See where they break and why. If you do this enough, you’ll understand how to apply AI properly to production-grade systems as well.

AI Builder Brief #12

Tsvetan Tsvetanov — Mon, 26 Jan 2026 08:40:01 GMT

Just Talk To It - the no-bs Way of Agentic Engineering

How is it useful? Mostly outlines the benefits of Codex over Claude Code. And adds some other tips and tricks for agentic engineering.

Commentary: I might be sold on trying Codex and see how it compares to Claude Code. I’m mostly an Anthropic fan but there were some compelling arguments for Codex. I found it strange that the author doesn’t rely on TDD that much and meanwhile shared some issues that he had that are easily solvable by test-first approaches. Another thing that’s on my radar to try is Whispr Flow.

Everything Claude Code

How is it useful? A collection of claude code skills, plugins, etc.

Commentary: Well, you should just play with it and see if it’s useful.

Trust, Accountability & AI Coding Swarms

How is it useful? Provides a framework of thinking for trade offs between gradients of supervision of AI generated software.

Commentary: I resonate with it. The key message is that you basically trade off speed of delivery vs. risk of failure. If you’re writing systems that have a limited blast radius or impact in case of an incident, it’s okay to mostly vibe code them (as long as they’re architected for change). However, if the systems you’re implementing will lead to a lot of reputational damage or money loss, then you have to be more careful in utilizing AI in developing them.

Ralph Wiggum, Abundance and Software Engineering

How is it useful? An essay on how you should change your mindset about typing code by hand in the age of AI for coding.

Commentary: Mostly agree with this article. So I don’t have much to add.

Scaling AI Without Scaling Entropy: The CTO’s Real Challenge

How is it useful? It outlines a framework for tech leaders on how to scale AI engineering efforts in their teams.

Commentary: The focus is mostly on sharing the decision making process of people with AI through context. In other words getting decision logic out of people’s heads and into AI. In this way we can be more sure AI follows our intention and scale our systems without unnecessary chaos. It seems reasonable, but I’m not sure it’s the right time for that. Like, you can start encoding decision making in context files. But this doesn’t guarantee AI will follow the decisions there. Also, you’ll have to make a considerable effort in encoding a common decision-making framework across the org. This might be doable on a team level, but it should be done on any level above.

Fast Feedback, Fast Features: My AI Development Experiment

How is it useful? A field report on why quality is important when working with AI.

Commentary: I was really happy when I read that this guy spends 4 hours for making things high quality for every 1 hour of code changes. It might sound ludicrous but this is something that enables fast changes for the long run. Otherwise, your system gets messy and harder to change. You might be spending 4 hours adding features and 1 hour for refactoring for weeks and slowly get into a place where you have to spend days just to untangle the mess and add 1 low-value bug fix. I also liked the observation that now we don’t have excuses to keep quality high. “Time” is not of the essence since AI can do things fast for us.

AgentiCorp

How is it useful? AgentiCorp is a lightweight AI coding agent orchestration system that manages workflows, handles agent lifecycle, and provides real-time event streaming for monitoring and coordination.

Commentary: Haven’t tried it yet (there are so many tools out there…) But I think we should explore more and more such platforms as 2026 will be the year of agent orchestration.

Experiment with AI, don't just hate it

Tsvetan Tsvetanov — Fri, 23 Jan 2026 07:54:48 GMT

The current state of AI for software engineering is a state of war. Companies, teams and individuals are figuring out what are the practices, technologies and configurations that yield results. And war happens on multiple levels.

On the level of foundational models OpenAI, Anthropic, DeepSeek, Google, open source models and all the players are trying to find the limits to LLMs or maybe even experiment with neuro symbolic approaches. They find out that there’s hardly any moat in this foundational technology so they’re exploring different ways to stay in business. We’ve no idea who is going to survive or not.

On top of that we’re starting to have agentic systems that seem to multiply the level of capability LLMs have. Claude Code and Cursor started to show really good results in the software engineering landscape. People are now experimenting with agent orchestration. Where the most notable tools seem to be Gas Town and Claude Flow. Fields outside of coding seem to find it harder to utilize agentic systems but I believe 2026 will start showing the first promising results. In the coming years we might have some big breakthroughs in application of neuro symbolic AI which will also lead to solving LLM’s biggest problem—hallucination.

It’s too early to have any idea who is going to win and what is actually achievable with AI at this point. Especially in the long run. But one thing is clear: it brings benefits. Especially for a software engineer.

What should an individual do?

Assume whoever says they’ve found the perfect way to use AI for individual productivity is lying. This might be unconscious. Or it might work pretty well for them. But still, it might not be the best or most useful way. It’s just still too early to come to such conclusions. We won’t know for sure what works well and what doesn’t until we see the first big failure because of AI. And this failure will come sooner or later. As Simon Wardley says, it’ll be because someone trusted AI to make enough changes to the system that they don’t understand. So, until a few years pass we can’t be sure of the winning practices in the field.

You, as an individual, should verify and experiment with what comes. At first, this was just using LLMs to help you code and copy-pasting it in your editor. Later on came Claude Code and other agentic coding tools. Now, we’re seeing agent orchestration platforms come into the mix. This is on the technological side.

We also have practices. Experiment writing augmented code with TDD, with test-after, with no tests. With formal verification. In an obscure language. Within a pair. Within a mob. Without any supervision. By reviewing each line of code. By writing tests by hand and letting AI write the code. With writing code by hand and letting AI write your tests. With letting a Ralph Wiggum loop implement a feature for you overnight. With letting AI write your infrastructure in Terraform. With working on 2 things at once with different Claude Codes. Or 5 things at once with Gas Town. Do spec driven development.

Experiment with whatever crazy idea you can think of. And see where it leads you. This is the only way to see what really works and what not. And separate marketing lies from reality. Sure it’ll take you money and time but I’m afraid otherwise you might be too late to adapt.

What should teams and organisations do?

First of all, don’t constrain your engineers to a specific best practice of agentic coding. There’s no such practice right now. And what you’ve chosen today may well be obsolete and counter-productive tomorrow. You don’t want that to hinder your progress. Instead, give access to as many tools as possible and as much money as possible to your engineers and let them figure it out. They’re in the trenches and they’re facing the everyday problem of evolving your software systems so that the business can profit from them. And don’t conclude too early that an experiment is a failure or not.

Second, try to think above the individual level. Currently we’re focused on how to make an engineer more productive. How to teach them to use Claude Code. Or how to orchestrate multiple agents. However, if we really want productivity we should start looking into team-wide practices. If now a single engineer can churn out 10x the features, how can we ensure these can be processed by the team at the same pace? Can the team 10x its output as well? And will this 10x output lead to 10x outcomes? What practices, technologies and techniques will enable you to utilise this productivity?

And if the teams become much more productive, will the organization be able to handle this? Will your customers be able to? If not, what bottlenecks have you got in place that prevent this productivity to be spread around? And finally, does AI really give you 10x the productivity? Or 2x? Or maybe even 1x as you’re fast to churn stuff out but then waste a ton more time to maintain that slop.

These are all questions that should be taken into account inside an engineering organization that wants to benefit from AI. Otherwise, you’ll get mired in vanity metrics and illusions of high performance.

Conclusion

It’s too early to commit to a certain AI practice or technology. Experiment as an individual. Experiment as a team. Experiment as an organization. This is the only way to find out what works. And to come out as a winner or at least not a complete loser.

AI Builder Brief #11

Tsvetan Tsvetanov — Mon, 19 Jan 2026 08:08:47 GMT

LLM predictions for 2026, shared with Oxide and Friends

How is it useful? An overview of what to be prepared for in 2026.

Commentary: I can’t say anything about the Kākāpō parrots piece. For the other parts:

I’m not sure if LLMs are going to write code we can trust without reviewing or proper formal verification. If they write good code, it means that by the end of 2026 we’ll be able to vibe code everything. I still think we’re far away from that;
On the Jevons paradox, I see no way around it. So for me it’s a solved matter;
I agree with all the rest. Especially the “Challenger disaster” part about coding agent security. I might even frame it as “another Knight Capital” because of vibe coding

Introducing: React Best Practices

How is it useful? Prevent most of issues when you let an agent write React code for you.

Commentary: I think this is in the right direction. Curious to see how consuming such a big library will affect context windows.

Code as Commodity

How is it useful? It outlines a possible future of roles in software engineering.

Commentary: It’s quite broad and seems to focus on vibe coding or at least prematurely removes care for the implementation and focusing our trust on the output of the agent. Without considering guardrails for that trust. While his general idea aligns with my vision of how we’re going to move as a profession, things still feel quite imaginary.

Why We Built Our Own Background Agent

How is it useful? An interesting background agent coming with sandboxing and verification.

Commentary: I like that they started with verifiability — running tests, looking at telemetry, inspecting visual output. I wonder how many of the verification steps are setup by a human or the agent is trusted with creating them. They seem to also have good sandboxing support.

Untethered: Vibe Coding Anywhere

How is it useful? Another agent orchestration platform.

Commentary: These platforms seem to sprung up from everywhere. I think people are realising the power of orchestrating agents. And since LLMs like Claude have become fairly decent at writing code it’s becoming easier and easier to produce software with less supervision. The only thing that I’m concerned is what happens with production-grade systems? For now, we see this applied to hobby projects. But will the quality be good enough for releasing software to millions of users?

Agent-native Architectures

How is it useful? Outlines good practices for building agentic systems.

Commentary: Quite detailed and extensive. I think whoever builds such systems must have a deeper look in it and learn from this article.

Open Work

How is it useful? The open-source Claude Work alternative.

Commentary: A GUI for OpenCode. However, I tried to install it on my Mac and it just exited with code “1”. Didn’t have time to debug it properly.

The AI-Coding Revolution: Spec-Driven Development

How is it useful? Have a look at the nWave framework for multi-agent orchestration.

Commentary: This is still on my watch list but I’m curious to see their approach. Mostly because they were recommended to me by a person I respect. I’m a bit wary on the spec-drive development part. It seems that focusing on providing written specifications upfront is quite waterfallish. The part with “using cost-efficient shadow models to peer-review code and catch hallucinations before they compound” seems like a plausible way to tackle a big issue I’ve seen with my specification-based approaches.

AI makes learning any tech easy

Tsvetan Tsvetanov — Fri, 16 Jan 2026 05:28:19 GMT

A while ago I had to learn Terraform. I have fair knowledge of infrastructure and what should be used to solve which problem in AWS. However, I have never used Terraform before. Whenever I worked on a software project, it was either simple or things were already setup.

Now I was in a position where I had to understand a fairly complicated platform which let users spin up their own tech stack in AWS. It had a ton of specific services and concepts: EC2s, security groups, VPCs, S3s, init scripts, a bunch of apps, load balancers, custom proxies, you name it. And a fairly big Terraform project to set everything up. So I had to understand this and be able to build upon it. Fast.

What does an AI-Native engineer do in that case? They open their favourite AI. And prompt it:

You are my Terraform teacher. Do a crash course for me so I can understand this and apply it. I'm looking into AWS infra setups.

Make it complex and fast paced. I have a lot of experience and can deal with it.

Split the teaching into chunks. Give me exercises after each chunk, so I can make sure I understand it. With the first chunk, share a playground I can use to see how changes in Terraform affect the deployment.

This generated the mother of all crash courses. Brief, to the point, and too crashy for my taste. It instructed me how to connect with my AWS account. Then split everything in chunks as instructed. After each chunk there was a set of exercises. It only proceeded to the next chunk when I told it. However, being a cheap Bulgarian engineer, I decided that I want a proper environment to play in without an AWS account. So, before I continued with the next chunk I instructed it to let me setup localstack:

Okay, how can I link this to a localstack instance that simulates aws?

Then it proceeded to provide a complicated setup with a bunch of AWS services. Instead of the simplest thing that I can use to test whether my setup works. So, I had some back and forth with it until we made the barest setup work.

Key insight: When learning, start with the minimum necessary demo environment.

Unfortunately, the initial prompt has set the course to be too fast paced for me. So I had to start over. I opened a new Claude session and typed:

You are my Terraform teacher. Do a crash course for me so I can understand this and apply it. I'm looking into AWS infra setups.

Split the teaching into chunks. Give me exercises after each chunk, so I can make sure I understand it. 

I already have a simple localstack setup and this is my first TF file:

...

This generated a way better course which contained smaller chunks of information, delivered piece by piece. Each chunk had a few paragraphs of explanations + examples that I could read through and understand. Then, it was followed by a set of simple exercises which I could try on my machine to make the knowledge run through my fingers. Of course, as we dived deeper into Terraform, the sections grew larger and the exercises became more complicated. But in the span of 6 such modules I was able to go through all the Terraform theory and apply it, in the span of a few hours.

Key insight: You can prompt the AI to generate a course as fast paced as you want it to.

What does this mean for learning tech?

You no longer have to wade through tutorials or books for popular technologies. If it’s in the AI’s training data, then you can prompt your way through a personalised training. It’s way better than reading a book or an article as this training provides exactly the exercises you need and the knowledge in a format you can digest easily. And if it hadn’t provided it for some reason, you’ll be able to prompt your way through this.

Why you had to learn it though?

Why bother understanding Terraform at all? Or Rust? Or AWS? Or Vitest? Just delegate it to AI.

I still feel uneasy about not understanding what happens under the hood of AI-generated software. Basically, AI generates all of my Terraform right now. But I review it. I don’t know of any automated tests for infrastructure. So I had to wade through the details and make sure I know what happens under the hood. It’s still faster than having me to type it in and debug obscure configuration issues. I’m focused on the overall form and function, the AI fills in the details. If I haven’t learned Terraform, I wouldn’t know what good looks like and what should I strive for.

Also, I think infrastructure work is a bit more critical part than writing the software itself. After all, a software change can first be tested locally and made sure to work properly—you have automated verifications and all that. However, with infra, you have to actually apply the changes to see what happens. And once that reaches production, you can cause an outage easily if you don’t know what you’re doing.

Conclusion

I’ve shown you how you can learn any non-obscure tech you like with AI. And I believe this makes learning easier by tailoring the material to your needs. You can make the course faster, longer, more complex or simple, with just a few prompts. You can add exercises or reduce them or even introduce real-world projects to do at the end of a section. For me, this greatly helped understand a piece of technology that I hadn’t used before. And now I’m applying it successfully to production-grade systems. What’s not to like about that?

AI Builder Brief #10

Tsvetan Tsvetanov — Mon, 12 Jan 2026 08:00:00 GMT

AI and the Next Economy

How is it useful? It makes the case that currently AI centralizes power instead of distributing prosperity. Thus making society poorer.

Commentary: It seems logical that as long as you concentrate resources in the AI giants, the society will become poorer and poorer. Which, in turn, will starve the giants from resources. What if we could end this process earlier? How?

Paul Hammond’s Mutation Testing Claude Skill

How is it useful? Mutation testing is key in validating that the AI has written an extensive test suite or hasn’t implemented something you don’t want. However, it’s expensive. This skill attempts at making it cheaper.

Commentary: A while ago I experimented with this approach on a codebase because I found it hard to setup Stryker. However, back then it didn’t work fast. What Paul proposes seems quite good because it specifies the mutation possibilities. I definitely plan to try it out.

Six New Tips for Better Coding With Agents

How is it useful? A treasure trove of advice on more effective coding with agents.

Commentary: Most of the tips are quite valuable and detailed. I’ve used them with success and they align with my experience. Maybe the scariest is the first one “Software is now throwaway”. Something in me is telling me that this is a really risky and wasteful assumption. However, I’m still quite open minded as we’re in the war stage of AI adoption. We have to experiment until things saturate and the winning approaches start to appear. But we shouldn’t forget to make this experimentation safe.

From Prompts to AGENTS.md: What Survives Across Thousands of Runs

How is it useful? Valuable ideas for utilizing AGENTS.md better.

Commentary: A really interesting set of ideas for AGENTS.md. I think about trying it out sometime

What AI Engineering Looks Like at Meta, Coinbase, ServiceTitan and ThoughtWorks

How is it useful? Get an insider look on how some companies do AI Engineering.

Commentary: I was a bit disappointed from the over-reliance on spec-driven development. People talked more about technology than meaningful practices that are technology agnostic. Personally, I’ve seen that principles are better than choosing a specific technology. But I guess that’s what the market wants.]

Ralph Wiggum - AI Loop Technique

How is it useful? Let Claude Code run until it finishes a task.

Commentary: This tool comes from the need that Claude Code can’t handle long-running tasks well. However, I’m really skeptical for the need of such functionality. Like, running Claude Code overnight with specs about a really complicated application seems to delay the feedback loop unacceptably. What if at the end you find out that this wasn’t what you wanted? You would’ve wasted hundreds of dollars on tokens and will have to start over.

My AI-Native Setup for TypeScript

Tsvetan Tsvetanov — Fri, 09 Jan 2026 07:26:35 GMT

This is a short piece on how do I setup a project for proper AI-Native development.

First of all, everything here is coming from and evolving based on the following principles:

Proper AI-Native Engineering needs deterministic verification loops;
Agents are the future and we should aim toward orchestration of agents.

Deterministic Verification Loops

What do I mean by “deterministic verification loops”? Basically, automated test cases, linters, type checkers, mutation tests, etc. Everything where I can use code to prescribe what quality looks like and let the agent always respect it when it implements the thing I want.

Vitest + Playwright/Cypress for automated testing. Playwright/Cypress are the foundation for my acceptance tests. Vitest is the foundation for lower-level tests. I mostly follow an inner-outer-loop TDD approach where acceptance tests are written first, followed by implementation through lower-level tests;
Eslint for sane linting defaults;
TypeScript strict mode;
Stryker Mutator for mutation testing. This ensures my tests have caught all edge cases and there’s no unwanted functionality implemented;
Husky for pre-commit hooks that make sure all checks pass, so we don’t commit bad code;
Knip.dev to prune any unused dependencies, files and exports.

This enables my flow of:

Let the AI write tests based on my specification;
Verify the tests;
Let the AI implement the necessary functionality until all tests pass;
Let the AI run mutation tests to expand on the test completeness;
Commit which triggers a hook to make sure all checks pass;
Let AI fix anything that has remained;
Push and deploy.

All of this runs in conjunction with Paul Hammond’s Claude Code configuration.

How This Enables Orchestration of Agents

This flow greatly reduces the cognitive load for me. Such checks let me mostly care about proper tests and high level structure. Which leaves some bandwidth for working on multiple projects at once by following the same approach in another project. I’m still pretty new to this so even despite the reduction in cognitive load I find it hard to juggle more than 3 agents at once. However, I believe there’s some skill to it which will come with practice and enough trial and error.