title: I Built a Compiler with AI Engineering Over a Weekend. These are 3 Core Strategies for Scalable AI Development published: true description: I Built a Compiler with AI Engineering Over a Weekend. These are 3 Core Strategies for Scalable AI Development tags: rust,python cover_image: https://dev-to-uploads.s3.amazonaws.com/uploads/articles/baf7czsdzf9hih9ah909.png

Use a ratio of 100:42 for best results.

published_at: 2026-02-24 18:24 +0000

You know that feeling when you've been doing something for years, and then someone comes along and says "nah, throw all that away"? That is exactly how I felt reading Cursor's blog post about self-driving codebases.

Don't get me wrong, I do believe this is impressive. 3M+ lines of code. Approximately 1,000 commits per hour. Thousands of agents working together to build a web browser. But something about it bugged me.

It ignores everything we have learned about software engineering.

Wait, what is wrong with 1,000 commits per hour?

As you know: Throughput is not progress. Maybe 10 meaningful commits targeting goals we would like to achieve would be more helpful.

The Cursor approach optimizes for raw output. More agents, more commits, more lines of code. But after years of building software the "right way," here is what I know matters:

Agile development, meaning time-bounded sprints with scoped work, not infinite agent swarms.
Meaningful changes over large volume. 10 thoughtful PRs might beat 1,000 commits.
Strong feedback loops like tests, CI, and code review, rather than just hoping the agents figure it out.
Architecture decisions and interface contracts backed by documented reasoning, not emergent chaos.

So when I set out to build Sifr, a compiled programming language that uses Python syntax and compiles to Rust, I decided to do it with AI engineering. But I wanted disciplined agents. The kind that follow a process. The kind that write PRDs, create tickets, review each other's code, and do not merge without passing tests.

And let me tell you, it works. Really well. 😄

The project: a whole programming language

Before we get into the workflow, let me give you a taste of what we are building. Sifr is a compiled language with:

Python syntax plus static typing
Compilation to Rust for native binaries
A borrow-by-default ownership model
TypeScript-style union types, type narrowing, and protocols
Over 45 standard library modules with zero-panic guarantees
21 planned phases, with 11 completed and over 80 milestones

This is not a toy. It is a full compiler pipeline (lexer, parser, AST, binder, type checker, HIR, Rust codegen, rustc, and finally binary) with a roadmap stretching from language foundations all the way to a web framework, package manager, and ecosystem.

And it was built almost entirely using AI engineering following the workflow I am about to describe.

Sponsorship note: This project was initially sponsored by CDON, a leading marketplace in the Nordics (Sweden, Norway, Denmark, Finland).

The basic workflow: implementing a feature

Let's start small. Here is how a single Task moves through the board, from an idea to merged code.

Basic Workflow

The Task moves across board columns as it progresses: Backlog -> Ready -> In Progress -> Review -> Done. Each step maps to a real action that an AI agent can execute.

Draft the Task. The agent writes a Task with the current situation, desired situation, and a suggested solution. This is scoped to a small number of changes.
Add to the board. The agent creates a GitHub issue and adds it to the project board. The Task lands in the Backlog.
Refine & prioritize. The agent assesses effort versus value and moves the highest-priority Tasks to Ready.
Work on the Task. The agent picks up the highest-priority Ready Task, creates a branch, implements the changes, runs tests locally, and creates a PR. This PR uses a template that requires an issue link, bullet-point changes, and deployment considerations. The Task moves to Review.
Review the PR. A different agent, preferably a different model, reviews the PR for logic bugs, unnecessary complexity, test coverage, style, and architecture alignment.
Adjust. The implementing agent addresses review comments.
Merge. The PR merges, and the Task moves to Done. Ship it.

What does a Task actually look like?

Let me show you a real example from Sifr. Here is Task #100: Expand Built-in Functions:

Current Situation: max(a, b) and min(a, b) with two arguments are not supported (only the list form max([1, 2]) works).
Desired Situation: All common Python built-in function signatures should work.
Suggested Solution: Update the compiler's lowering phase to handle 2-argument max/min `.
Acceptance Criteria: max(1, 2) returns 2

That is it. Small, focused, concrete. The agent implemented this in one PR.

🚨 Gotcha: You do not want to become the bottleneck. Make sure that shipping does not require you to be in the middle. That includes manual testing, manual clicking, manual deployment, all of it. If you are the human doing QA on every PR, you have defeated the purpose.

But what about bigger features?

A single Task is great for "add a len() method to strings." But what about "implement a borrow checker"? That is where Epics come in.

Epic Workflow

Here is the critical insight. Every Epic starts with a PRDS, a combined PRD and Solution Design document. The agent does not just jump in and start coding. It uses a specific tool to write a single structured document covering both sides:

Product requirements: problem statement, goals, scope, constraints, acceptance criteria (with Given/When/Then).
Solution design: architecture, data model, API design, error handling, testing strategy, trade-offs.

The PRDS gets added to the board as an Epic, refined, and then comes the part that matters most. Step 4 is a human reviewing the PRDS. This is the human-in-the-loop checkpoint. You are not reviewing 50 PRs. You are reviewing one document that shapes all of them.

Once approved, the Epic gets broken down into smaller Tasks, and those Tasks follow the basic workflow above.

And here is a step the blog posts never talk about: the Epic demo. Before marking an Epic as Done, you create a working demo that showcases all major features delivered. In Sifr, these live in a ./demos folder, each named after the Epic. If the demo does not work, the Epic is not done. Simple as that.

What does an Epic look like?

Here is Epic: Add collections.Counter. This epic was about adding the first class-based API to the standard library.

Objective: Users need to count hashable objects easily. Implement collections.Counter.
Scope:
- Define class Counter in lib/sifr/collections.sifr.
- Implement methods: __init__, most_common, total, update, keys, values.
- Add necessary rust implementation to support these methods.
Solution Design:
- Data Structure: Wrap a Rust HashMap but expose it as a Python class.
- API: Match Python's Counter API exactly.
- Testing: Verify counting works, most_common returns sorted results, and empty counters behave correctly.
Acceptance Criteria: from sifr.collections import Counter works, and Counter("hello").most_common(1) returns [('l', 2)].

The agents broke this down into tasks: implement the intrinsics, implement the Sifr class, add tests, and create a demo.

🚨 Gotcha: Without reviewing the PRDS, you cannot guarantee the results. The agent might build the completely wrong thing, beautifully. I have seen it happen.

In Sifr, every Epic has a PRDS document. The borrow-by-default Phase? It started with a PRDS that defined parameter conventions, escape analysis rules, and codegen patterns, before a single line of code was written.

Scaling the workflow: many Epics, many Phases

Okay, so you can ship a feature. You can ship a big feature. But what about building an entire programming language with 21 Phases?

This is where things get interesting.

Phase Workflow

Step 1: Plan multiple Phases top-down

The key word here is top-down. You plan the high-level Phases first, then drill into Epics within each Phase. And critically, avoid parallel Epics.

In Sifr's roadmap, each Phase has a clear ordering rationale. For example:

Type System Power comes before Standard Library, because the stdlib needs generics and closures for proper type signatures.
Error Safety comes before Stdlib Safety Remediation, because you cannot make intrinsics return Result types if the compiler does not enforce error class hierarchies yet.
Borrow-by-Default comes before Stdlib Deepening, so new stdlib functions are written with the final ownership model from day one.

Every ordering decision is documented. Not in someone's head, but in the codebase, in the roadmap, with explicit rationale for why Phase N depends on Phase N-1.

Step 2: Execute Phase by Phase

Each Epic within a Phase follows the epic workflow: PRD -> solution design -> human review -> Task breakdown -> execute. The agents pick up Tasks, implement, create PRs, get reviewed.

🚨 Gotcha: Don't execute too many Phases at once. I tried. The agents start creating workarounds for dependencies that haven't been implemented yet, and you end up with spaghetti. Sequential execution with clear Phase boundaries is the way.

Step 3: Review with a different model

This is one of my favorite tricks. After a Phase of execution, I use a different agent session (and often a different model) to review the work. The reviewer has fresh context, with no sunk cost bias or "I already wrote this so it must be right" mentality.

The reviewer runs in a feedback loop: review -> fix -> review -> fix -> review. Three iterations is the sweet spot.

Step 4: Re-planning with the judge

After review cycles, a "judge" (the smartest model you have access to) evaluates whether the plan needs to be steered. Maybe the type system completion phase revealed that the codegen architecture should be restructured first. Maybe a new constraint emerged.

The judge decides whether to continue as planned or adjust. If adjustment is needed, the plan is updated and execution continues. Multiple reviewer agents can also weigh in during this phase.

🚨 Gotcha #1: Parallel work might not be the best idea. There could be unidentified dependencies between Epics. Agents will make workarounds and create sloppy solutions instead of waiting for the right foundation to be in place.

🚨 Gotcha #2: It is good to plan for the future, but don't get stuck with too many details about later Phases. The first few Phases will teach you things that change your assumptions about later Phases (you can also update the plan and insert new phases midway). Plan the current Phase in detail, and keep future Phases as rough outlines.

What does a Phase Plan look like?

This is a snippet of how we structure high-level planning. We track all phases, and for each one, we define exactly what capabilities it unlocks:

#	Phase	Milestones	Status	What it unlocks
1	Language Foundations	6 (built-ins → codegen_quality)	completed	Single-file programs with classes, error handling, safe indexing, imports
2	Type System	6 (narrowing → ...)	completed	Generics, closures, generators, decorators, operator overloading
...	...	...	...	...
13	Type System Completion	6 (stdlib_generic_rewrite → generics_in_stdlib)	pending	Auto-init, user-facing generics, pattern matching, enums, bigint, generic stdlib

Notice the "What it unlocks" column. We don't just list technical tasks; we list capabilities. Phase 1 unlocks single-file programs. Phase 2 unlocks generics. This helps the AI (and me) understand the purpose of the phase, not just the code.

The real results

Sifr has completed 11 Phases and over 80 Epics using this workflow. The compiler handles:

A full type system with generics, protocols, union types, and type narrowing.
Over 45 stdlib modules.
Borrow-by-default ownership semantics.
Error handling with compiler-enforced exhaustiveness checking.

All of this was built with AI engineering following the structured workflow described above. Not thousands of agents racing to commit, but a disciplined process where every feature starts with a plan, gets implemented incrementally, and gets reviewed before merge.

An impressive result is that the first version of the working compiler for the core language was built over a weekend, literally on a Saturday & Sunday!!

You can find the repo here: Sifr.

Try it yourself

If you want to adopt this workflow, here is the TL;DR:

Small Tasks: Draft -> Board -> Refine -> Work -> Review -> Merge.
Epics: PRDS -> Board -> Refine -> Human Review -> Break Down into Tasks -> Execute -> Epic Demo -> Done.
Phases: Plan top-down -> Execute sequentially -> Review with different model -> Re-plan with judge.
Automate the boring stuff: Ticket creation, PR templates, review checklists, board management. Make them commands the agent can run.
Don't be the bottleneck: If shipping requires you in the loop for every PR, you have lost.

The agents are the hands and the architect is YOU.

What do you think? Have you tried applying AI engineering on a real project? I would love to hear about your workflow, so drop a comment or reach out!

I Built a Compiler with AI Engineering Over a Weekend. These are 3 Core Strategies for Scalable AI Development

Use a ratio of 100:42 for best results.

published_at: 2026-02-24 18:24 +0000

Wait, what is wrong with 1,000 commits per hour?

The project: a whole programming language

The basic workflow: implementing a feature

What does a Task actually look like?

But what about bigger features?

What does an Epic look like?

Scaling the workflow: many Epics, many Phases

Step 1: Plan multiple Phases top-down

Step 2: Execute Phase by Phase

Step 3: Review with a different model

Step 4: Re-planning with the judge

What does a Phase Plan look like?

The real results

Try it yourself

Tags

Comments

More Blog

Five Gemma-4 models, one accelerator: what porting E2B 31B to AWS Inferentia2 taught me

Hey DEV, I'm Tobore. Let's actually connect.

I burned through thousands of AI tokens. Then a friend did it for free

Claude might be saturating your machine

Automated GitHub Code Reviews Using Google Gemini

What is an "agentic harness," actually?

Ready-made automations for this