Posts

AI agents in your sprint: how copilots are changing what a story point means

Developer pair working with an AI coding assistant, one screen showing code being generated while the other shows a sprint board with story point estimates, conveying the tension between AI speed and traditional estimation

Matt Lewandowski

Last updated 10/02/20267 min read

A backend engineer on your team picks up a 5-point story. Historically, that's a day and a half of work. With Cursor and Claude, they ship it in an hour. The next sprint, the team looks at a similar ticket and someone asks: "Is this still a 5?" Nobody has a good answer. This is happening on teams everywhere right now, and most agile content hasn't caught up.

The problem isn't velocity inflation

The obvious reaction is "great, we're faster now." But the issue runs deeper than bigger velocity numbers. Story points are supposed to measure relative complexity. A 5 should represent roughly the same amount of effort regardless of who picks it up. AI breaks that assumption in two ways: The acceleration is uneven. A junior developer using Copilot might see a 3x speed boost on a routine CRUD endpoint. A senior developer working on a complex integration with unfamiliar APIs might see no improvement, or actually slow down. The METR study from mid-2025 found that experienced open-source developers were 19% slower with AI assistance on real-world tasks, while believing they were 20% faster. The same person is faster on some tasks but not others. AI agents handle boilerplate and well-documented patterns well. They struggle with architectural decisions and anything requiring deep context about your specific codebase. A developer might blow through three 3-point stories in a morning, then spend two days on a single 5-pointer where the AI kept generating plausible but wrong solutions. This makes estimation wildly inconsistent. Your velocity chart starts looking like a seismograph.

What teams are actually experiencing

Talk to engineering managers running sprints with AI-augmented teams and you hear the same patterns: Velocity numbers that don't mean anything. One team reported going from a steady 45-60 points per sprint to 55-65, but the work getting done didn't feel proportionally different. Faster coding wasn't translating to faster delivery because code review, QA, and deployment timelines stayed the same. The review bottleneck. GitHub data shows monthly code pushes crossed 82 million by late 2025, with 41% of new code AI-assisted. Pull requests are piling up. Teams report waiting 4+ days for reviews. The developer wrote the code in an hour, then it sits in the review queue for a week. Technical debt that compounds differently. AI-generated code tends to work but lacks architectural awareness. Studies show 4x more code duplication and 60% less refactoring in AI-assisted codebases. The 3-point story ships fast. Six months later you're paying for it in maintenance. Sprint board showing a mix of completed and stuck tickets, with some marked as done very quickly and others blocked in code review, illustrating the uneven flow of AI-assisted development

Sprint board showing a mix of completed and stuck tickets, with some marked as done very quickly and others blocked in code review, illustrating the uneven flow of AI-assisted development

Junior developers who look like seniors on paper. A developer with two years of experience can now generate the same output as someone with ten. But they spend 1.2 minutes reviewing each AI suggestion compared to a senior's 4.3 minutes. That quality gap won't show up until production.

Story points aren't broken, but they need recalibration

The instinct to throw out story points entirely is premature. Points still work for team alignment and sprint planning conversations. What needs to change is how teams calibrate them.

Separate "coding effort" from "delivery effort"

The part AI accelerates (writing code) is only one piece of a story's lifecycle. A more honest breakdown:

Phase	AI impact
Understanding requirements	Minimal
Writing code	High (2-10x faster on routine work)
Code review	Negative (more code to review, often lower quality)
Testing	Moderate (AI can generate tests, but someone has to verify them)
Integration and deployment	Minimal

If your team estimated stories primarily based on coding time, your points are now miscalibrated. If you estimated based on total delivery effort, they're probably closer to accurate.

Recalibrate your reference stories

Every team has anchor stories: "Remember that payment integration? That was an 8." Those anchors were set before AI. Update them. Run a recalibration session where you re-estimate 5-10 completed stories from the past two sprints. Use your planning poker tool and ask: "Knowing what we know now about how AI affects this type of work, what would we estimate this at?" The gaps between old and new estimates will tell you exactly where your calibration is off.

Track AI impact by task type

For two or three sprints, tag tickets with whether AI was heavily used, lightly used, or not used at all. Compare estimates to actuals for each category. You'll quickly see which types of work need recalibrated reference stories.

Other ways to measure

Some teams are moving away from story points entirely, and AI adoption is speeding up that shift. Cycle time measures how long a ticket takes from start to done. It includes the review bottleneck and deployment pipeline, not just how fast someone wrote the code. Teams tracking cycle time are finding that AI barely moves the needle on overall delivery speed, even when coding speed doubles. Throughput counts how many items the team completes per sprint, regardless of size. If your team shipped 12 stories last sprint and 14 this sprint, that's useful information without debating what a "point" means. DORA metrics (deployment frequency, lead time, change failure rate, time to restore) focus on outcomes instead of output. When AI makes coding faster but change failure rates climb because of less careful review, DORA shows the trade-off that velocity numbers hide. Dashboard showing different metrics side by side, with traditional velocity chart compared to cycle time and throughput charts, revealing different patterns about team performance

Dashboard showing different metrics side by side, with traditional velocity chart compared to cycle time and throughput charts, revealing different patterns about team performance

Ron Jeffries, who is often credited with inventing story points, has said: "I may have invented story points, and if I did, I'm sorry now." His concern was that points get misused as a productivity measure rather than a planning tool. AI makes that misuse worse.

What actually works right now

Based on what teams are reporting in early 2026: Keep planning poker, but change the conversation. The estimation meeting is still valuable. The discussion about what's involved in a story is more important than the number you assign. But update the questions: "Will AI help with this?" should be part of the conversation. If a story is mostly boilerplate, say so. If it's a complex integration where AI will generate misleading solutions, flag that too. Treat AI-generated code as a first draft, not finished work. Build review time into your estimates. A story where AI wrote 80% of the code might need more review time than one where a developer wrote every line, because the reviewer has to verify intent rather than just quality. Watch for the perception gap. Remember that METR finding: developers believed they were 20% faster while actually being 19% slower. Using AI feels more productive because generating code is cognitively lighter than writing from scratch. That feeling doesn't always match the clock. Don't let it inflate your sprint commitments. Measure what you ship, not what you code. If your team's cycle time hasn't improved despite faster coding, the bottleneck is somewhere else. Fix that before worrying about story points. For teams looking to experiment with different estimation scales as they recalibrate, Kollabe supports Fibonacci, T-shirt sizes, and custom scales that you can adapt as your team figures out what works in an AI-assisted workflow.

This is a process problem

The teams handling this well didn't buy an AI estimation tool. They recognized that AI changed how their work gets done and adjusted their process. That starts with honest conversations about where AI helps and where it doesn't. Your velocity from six months ago isn't a useful baseline anymore. And the gap between coding output and actual delivery has never been wider, so focus on the delivery side.

Not necessarily. Story points still work as a relative sizing tool for team discussion and sprint planning. What you should do is recalibrate your reference stories and make sure your team accounts for AI's uneven impact when estimating.

Estimate total delivery effort, not just coding time. A story that takes 10 minutes to code but 2 hours to review and test is still a meaningful chunk of work. Split your thinking between implementation and verification.

On routine, well-defined tasks, the gap narrows significantly. On complex work requiring architectural judgment, seniors still outperform because they know when AI suggestions are wrong. The real risk is juniors shipping AI-generated code without the experience to catch subtle issues.

Cycle time and throughput give a clearer picture of delivery performance without the calibration headaches. DORA metrics (deployment frequency, lead time, change failure rate) capture quality alongside speed. Most teams benefit from tracking all three alongside velocity rather than replacing it outright.

AI agents in your sprint: how copilots are changing what a story point means

The problem isn't velocity inflation

What teams are actually experiencing

Story points aren't broken, but they need recalibration

Separate "coding effort" from "delivery effort"

Recalibrate your reference stories

Other ways to measure

What actually works right now

This is a process problem

Should we stop using story points because of AI?

How do you estimate when AI makes some tasks 10x faster?

Are junior developers now as productive as seniors with AI?

What metrics should replace velocity?