Skip to main content

Command Palette

Search for a command to run...

Three Pivots in AI-Driven Development

Updated
16 min read
S
Solutions architecture leader working across cloud architecture, distributed systems, and AI. As much in the people and processes as in the technology. OSS author & maintainer. I write about architecture lessons from real-world systems at scale. Currently at AWS. All opinions are my own.

An AI coding agent does not bend a productivity curve on its own. A team does.

Coding agents, models, and IDE plugins ship new versions every few months, and most of what gets called "best practice" in this space has a shelf life shorter than a quarter. After watching AI-driven development play out across several large organizations, what I am convinced actually lasts is duller and more durable than any tool: three pivots in how a team works. Redesigning the workflow around the agent, pulling a plan-then-clarify-then-implement rhythm upstream into requirements and design, and agreeing on an output metric so the organization can tell whether any of it is paying off. None of the three is something you install.

Where I'm coming from. I work alongside engineering organizations as they adopt AI-driven development. AWS publishes AI-Driven Development Life Cycle (AI-DLC) as one practice among several, and I treat it that way: a well-shaped option, not the answer. My vantage point is a practitioner's, the view of someone who has helped roll it out rather than an author grading it. Two of the three pivots below I have watched and felt directly. The third, measurement, is the one I am least done thinking about. I share it the way I share it with customers in briefings: as Amazon's hard-won lesson and as the problem the teams I work with are wrestling with right now, rather than as something I have fully closed out myself.

TL;DR

  1. Pivot 1: Workflow redesign comes first. Tools do not move the line. The slope changes when a team rebuilds its workflow around the agent, not before. AI amplifies whatever pipeline it lands on.

  2. Pivot 2: Move the control rhythm upstream. Apply Plan → Clarify → Implement to requirements and design, not just implementation. AWS's AI-DLC is one of the clearest off-the-shelf implementations, and you can try it today.

  3. Pivot 3: Agree on an output metric and start measuring. Cost, release count, release cycle, anything that lets you justify the return. Pick one the organization agrees on, start measuring, and mechanize only what proves durable across teams.

Pivot 1: Workflow redesign comes first

The first weeks of an AI rollout are quieter than anyone expects. A team turns on an agent, the co-author rate on commits climbs, and the metrics that leadership actually watches, deployment frequency and lead time, move far less than the excitement around the tool would predict. I have seen this lull in a digital enterprise with thousands of developers, and the public record shows the same shape inside Amazon.

In Jeff Barr's JAWS Days 2026 talk on building a GenAI-driven developer organization, the team behind a Bedrock inference engine called Mantle described a cumulative-commit curve that stayed nearly flat at first, then changed slope. The headline numbers are striking once the slope turns: two commits a week became forty, ten to twenty times more output, a system launched in 76 days with a small team instead of adding thirty-plus engineers. What I want to draw attention to is the flat part at the start, not the steep part later. The agent was present the whole time. The curve did not care.

I first heard the cleanest framing for why from Amazon's agentic AI leadership: AI is an amplifier. It rewards a team with a healthy pipeline and it punishes a team with a broken one, because coding was rarely the bottleneck in the first place. The bottleneck is the wait for review, the deploy approval, the test environment you cannot get, the security consult booked for next week. Point an amplifier at those and you hit the same constraints, only faster.

So the slope responds to the team, not the install date. It turns once the team stops asking "how do we apply AI to the way we already work" and starts asking "how would we work if the agent were the default author." Concretely, in the teams where I watched the line finally move, three things changed in the workflow itself:

  • PR granularity shifted to units the agent could produce and a human could review in one sitting, rather than the units a human used to write over three days.

  • Review focus moved from line-by-line correctness toward conformance to the spec, because the agent rarely makes the typo a human reviewer was trained to catch, and often makes the design-intent mistake they were not.

  • Task decomposition was rewritten into parallelizable pieces, so several agent-driven changes could be in flight at once instead of queued behind one developer's working memory.

The Mantle account lines up with this. The same talk lists a monorepo holding code, tests, and docs together, extensive AI-generated documentation, self-contained local testing, thousands of unit tests, and code deliberately structured to be legible to an agent. Those are workflow decisions, not tool decisions, and together they give the amplifier a clean signal.

My own field notes say the same thing from both directions. One organization rolled out GitHub Copilot company-wide, then a few months later rolled out Cline so it could reach Claude's models, then a few months after that rolled out Claude Code. Three tool changes in about a year, and the process around them moved far less than the tooling did. The curve behaved accordingly. Another organization sat at the opposite end. After a major security incident forced it to rebuild large parts of its business from scratch, it adopted AI-DLC as the method for the rebuild. Under that kind of time pressure there was no incumbent workflow left to protect, so the team rewrote how it worked and how it shipped at the same time, and it reached its inflection point earlier than peers who had the luxury of changing nothing. Necessity did for that team what deliberate redesign has to do for everyone else.

There is an order hiding in here. A team that has not done this rewrite cannot get much from a sharper control rhythm or a better metric. Get the workflow right before you amplify it.

Pivot 2: Move the control rhythm upstream

Most organizations let AI into exactly one phase: implementation. Requirements, design, and review keep running the way they always did. The arithmetic of that choice turns against you quickly. When implementation gets ten to twenty times faster and the work in front of it does not, requirements and design quietly grow to most of the calendar, product and business stakeholders move at their old pace, and the end-to-end time to ship barely changes. You have optimized the one stretch that was not the constraint.

One practice that addresses this directly is AWS's AI-DLC. It frames two common adoption styles as opposite extremes: AI-assisted, where the human drives and the agent helps with isolated tasks, and AI-autonomous, where the agent generates everything end to end. It proposes a middle path where the AI plans and executes while the human keeps the judgment and the decisions. The mental model it repeats at every step is Plan → Clarify → Implement: the agent proposes a plan, asks the questions it needs answered, and proceeds only after a human approves. The AWS method paper goes deeper if you want the full definition.

The point that took me longest to internalize is that this rhythm earns most of its keep before implementation. Requirements and design are the stages where ambiguity is cheapest to remove and most expensive to leave in. Werner Vogels made the same case in his final re:Invent keynote, where Clare Liguori demonstrated Kiro's spec-driven development on stage. Her line stuck with me: "Natural language doesn't have to mean high ambiguity." The control rhythm is how you act on that.

AI-DLC is open source, not just a concept on a slide. AWS publishes it as awslabs/aidlc-workflows, a set of steering rules for the major coding agents (Kiro, Claude Code, Codex, Cursor, Cline, Copilot), meant to work on any IDE and any model. The control rhythm shows up as explicit approval gates between phases, something like this:

# INCEPTION PHASE: decide WHAT to build and WHY
  Requirements Analysis
    - analyze the request (intent analysis)
    - ask clarifying questions (if needed)
    - generate the requirements document
    - Wait for Explicit Approval: DO NOT PROCEED until the user confirms

# CONSTRUCTION PHASE: decide HOW to build it
  Code Generation (per unit)
    - Part 1: produce a plan, get approval
    - Part 2: execute the approved plan

The "DO NOT PROCEED until the user confirms" gate is the whole idea in one line. Plan, clarify, then implement, applied to the spec before it is ever applied to the code. The repository is evolving fast (a v2 is already in development that reorganizes these stages into composable skills), so treat the block above as the shape of the idea rather than a literal file to copy.

In practice, the moment it pays off is visible. In one part of a digital enterprise with thousands of developers, requirements sessions that used to sprawl across days compressed to half a day to a day. Misalignments that normally surfaced in code review surfaced in the requirements conversation instead, where they cost a sentence rather than a sprint, and missing user-experience cases got caught on the spot because the agent kept asking. Where it stalled was never the rhythm itself but the room: when the product owner who could actually decide was not present, the clarify step just queued questions no one in the room was authorized to answer. What made it work was a mobbing-style ritual: product, engineering, and architecture at one table, feeding judgment into the agent's output in parallel rather than in a serial queue of handoffs.

One correction I had to make to my own mental model: the rhythm is not a single straight pass. Inception is meant to be run more than once, rhythmically, not as a one-shot gate at the front. The teams that got the most out of it separated a coarse first pass, just enough to size the work, from a detailed second pass once the shape was clear, and they re-entered Inception whenever the requirements moved instead of treating the spec as frozen.

None of this makes implementation faster. That is the part people miss. The control rhythm exists so that requirements and review do not become the new bottleneck once implementation stops being one.

Pivot 3: Agree on an output metric and start measuring

This is the pivot I am least finished with. I have shared it with customers in executive briefings as Amazon's lesson, and I am watching the organizations I work with begin to take it seriously, but I will not claim there is a correct metric. The claim is narrower and, I think, more durable: pick one output-side metric that lets you justify the return on the investment, get the organization to agree on it, and start measuring. Cost per delivery unit, releases per cycle, cycle time, it almost does not matter which. What matters is that you choose one and begin.

The failure mode is familiar. The tool goes in first and the measurement is deferred, and six months later someone asks whether it is actually working and no one can answer. The teams I work with usually start with co-author rate, PR counts, and a daily readiness checklist. Those are fine as an on-ramp. They are not what you bring to a leadership review when you are asked to defend the spend.

Amazon's answer is one arrival point worth studying precisely because they committed to it and reported a result. An internal team built a measure called Cost to Serve Software and used it to show a 15.9% year-over-year improvement. The shape of it is deliberately blunt: total cost of delivery, people plus infrastructure, divided by the units of software that actually reached customers. They explicitly rejected activity-based costing as too complex to sustain, and worked backwards from input cost and output units instead. I show it to customers not as the metric they should adopt but as proof that an organization can agree on one and move it.

Why the choice of metric is not cosmetic comes through in another Mantle observation: the bug rate per line of code stays roughly constant, while the commit rate runs ten to twenty times higher. Measure commits and the story is triumphant. Measure defects reaching production and the same data says you now need new approaches to testing. The metric you pick is the direction you point the organization, which is why agreeing on what to measure matters more than measuring more. DORA and SPACE remain useful at the process layer; the output metric sits a layer above them, where the conversation with the business actually happens.

There is an operational half to this pivot that I treat as a structural observation rather than a recipe, because the specifics differ by organization and the numbers do not generalize. Tools, models, and plugins churn on a timescale of months, so standardizing one agent centrally and mandating it tends not to survive contact with the next release cycle. The pattern I have seen hold up is a budget cap with free tool choice inside it: the organization sets a spend ceiling per team or per person and lets engineers choose their own agent within it. Editor choice has always had a religious streak, Vim versus Emacs being the classic holy war, but history is fairly clear that letting engineers use the editor they like makes them more productive, not less. Coding agents are the same kind of choice, and they churn faster than editors ever did. The cost of that freedom is absorbed by the cap; the alternative, locking everyone to one tool, saves you the cap design but trades away the ability to absorb a fast-moving market.

The cap is harder to run than it sounds. In a digital enterprise with thousands of developers, a cap can exist at the organizational level and still be slow to adjust: changing a team's ceiling means touching the billing, access, and tooling integrations that actually enforce it, and those were usually built to set a limit once, not to raise or lower it on a regular cadence. The economics of where you enforce it, and how you account for spend across teams, I will leave to a later article; here it is enough to say the cap is a design problem, not a config flag.

The same churn is why I keep tips and durable practices in separate buckets when I share knowledge. A prompt that sped things up three months ago can stop working after a model update; a rules-file setting can go from recommended to deprecated in a point release. Those are worth passing around, but they are not worth building an organization on, the same durable-versus-disposable split I drew for agent evaluation on this blog. What is worth rooting are the invariants that keep showing up across teams and across tools: AI-friendly PR granularity, the plan-clarify-implement rhythm, a continuously evaluated output metric. When one of those proves itself, you promote it into a mechanism, a template, a guideline, a CI check, a shared dashboard. Werner Vogels put the reasoning better than I can, on why mechanisms matter: "They convert good intentions into consistent outcomes."

Finding those invariants is slower than it sounds, and this is where I will be honest about a limit of my own experience. One organization I work with was the fastest of any to trial AI-DLC, in a single division, and saw real results there. The company-wide event to spread what that division learned came roughly a year later, and a common best practice that transfers cleanly across divisions has still not emerged. Promoting a local success into an organization-wide mechanism is harder than producing the local success, and I do not have a tidy answer for it yet. Tips stay tips. Invariants become mechanisms, when you can find them.

So the pivot is less about adopting Cost to Serve Software and more about agreeing on one output metric and starting to measure it, with the honest footnote that running only to make a number look good is its own failure. Agreeing on the metric, and beginning to watch it, is itself the turn.

Closing

The three pivots have an order. Redesign the workflow so the amplifier has a clean signal, pull the control rhythm upstream so requirements and review do not become the new bottleneck, then agree on a metric so you can tell whether either worked. AI-DLC is a good off-the-shelf implementation of the middle pivot, and you can run it from awslabs/aidlc-workflows today. It will not, on its own, redesign your workflow or choose your metric. Those are still yours.

If tools, models, and plugins really do churn every few months, the tool of the quarter is the wrong thing to build on. The durable layer is the three pivots, a budget cap that preserves freedom underneath them, and the habit of mechanizing whatever invariant survives the churn. Running alongside the organization, there is an individual version of the same durability. Werner Vogels closed his final re:Invent keynote on what he called the Renaissance Developer, and the qualities he put forward, curiosity, systems thinking, communication, ownership, and breadth among them, are the ones that outlast any single tool. The line I have not been able to shake is shorter: "The work is yours, not the tools."

This article is the first in a series, and it deliberately stays inside the part of the picture where the methodology works. The next one goes to the edges: the gaps AI-DLC does not yet cover, from discovery through brownfield code to organization-wide visibility, and the realities of pushing any of this across a large enterprise, where the hardest problems may turn out to be less about the methodology than about the organization around it. That is the part I am still working through.

So I will end with the question I keep asking the teams I work with. Which of the three pivots has your organization actually made, and which one is the one you keep deferring? If you have an output metric your leadership genuinely agreed on, or a tool-choice model that survived more than a couple of release cycles, I would like to hear how it has held up.

AI-Driven Development

Part 1 of 1

How software development changes when an AI agent is the default author, across the workflow, the team, and the way an organization measures output. This is a moving target with no settled best practices yet, so each post stays honest about the trade-offs, the parts that hold up across teams, and the parts still being worked out. Practitioner notes from helping engineering organizations adopt AI-driven development at scale.

More from this blog

B

Build at Scale — System Design, Distributed Systems, and AI

5 posts

Architecture lessons from real-world systems at scale. I write about the trade-offs, constraints, and design decisions behind production systems, from cloud architecture and distributed systems to AI in production and the organizational choices that shape how systems evolve. Each post digs into why a particular design choice was made, what alternatives were considered, and what you can take away for your own systems.