Research · 2026

The State of Rework 2026

Half of pull requests merge clean. The other half enter a fix-and-review loop that eats most of review capacity. This report tracks that split: where the loop starts, what it costs, and why the feedback driving it is usually right, just late.

Talk to sales Run the rework calculator

Aggregated from twelve weeks of pull request activity across engineering organizations on Baz, spring 2026. Organizations stay anonymous throughout. Findings are reported as shares and multiples of a clean-PR baseline, never as raw or per-customer numbers.

The split

Half of pull requests pay for the other half.

Rework is not a rare event. Half of PRs complete without a single rework cycle, and they need little review effort. The other half absorbs almost all review activity, and the deepest tiers concentrate it further. The severe band is the smallest group on the chart, yet it consumes the most review executions.

Where review activity goesBy rework tier, share of total

Share of pull requestsShare of review activity

Each tier's share of the PR population against its share of all review executions, grouped by measured rework cycles. Clean PRs make up half the volume and a sixth of the review work. The deepest tier is the smallest population and the largest single consumer.

Most of that spend is repeat work. Once a PR gets reviewed, every later execution re-reviews code the team already paid to review once. Across the dataset, roughly two thirds of review compute happens after a first review. For the typical organization, repeat reviews make up most of its review load.

The cost curve

Each tier deeper is a step change, not an increment.

A clean PR merges in minutes at the median. One rework cycle multiplies that several times over, and the deepest tier runs hundreds of times slower, which means days on the calendar. Change size climbs the same staircase: even a single cycle comes with a change several times the clean median.

Time to mergeMedian, as a multiple of clean

Median time to merge, indexed to the clean-PR median. We use medians rather than averages, so the climb reflects the typical PR's experience, not a handful of outliers.

Change sizeMedian lines changed, as a multiple of clean

Median lines changed, indexed to the clean-PR median. Size does not prove cause, but it is the clearest early sign of a deep loop.

None of this shows up as failed delivery. Merge rates stay level across tiers, and reworked PRs get abandoned no more often than clean ones. The cost hides inside cycle time and review load, not in dropped work. That is how a tax this large stays unnoticed.

The developer loop

Rework behaves like an interruption, not a queue.

The gap between review feedback landing and the next fix starting is short. The median sits around half an hour, most cycles begin within the hour, and the large majority start within a working day. Developers are not calmly returning to this work on their own schedule. The loop pulls them back.

Time from feedback to fixShare of all measured rework cycles

Share of rework cycles by the gap between a review finishing and the next fix starting. Most cycles begin within the hour, so the distribution front-loads heavily.

So the tax is not only elapsed time. Every cycle breaks focus: write, get pulled back, fix, get re-reviewed, repeat. The calendar cost shows up on delivery dashboards. The attention cost shows up in the work that never happened while the loop ran.

The signal

The feedback is right. The timing is late.

Here is the nuance that makes the tax hard to remove with blunt policy: rework works. The share of review comments developers act on climbs steadily as rework deepens, from a few percent on clean PRs to a clear majority at the deep end. The pattern holds across every major finding category.

Feedback acceptanceComments addressed, by category and rework depth

Logical bugsBreaking changesType inconsistencyBasic securityCode deduplication

Share of review comments developers act on, by finding category and rework depth. Every category climbs. Correctness findings climb steepest, deduplication climbs least.

Correctness findings climb steepest. Logical bugs, breaking changes, and type inconsistencies all reach their highest acceptance in the deepest tiers, with security findings close behind. Those are the strongest candidates for earlier, blocking, pre-PR feedback. Deduplication rises too, but more slowly, so it works better as advisory guidance than as another loop.

The baseline

This is the industry’s resting state, not a bad quarter.

Week over week, the split barely moves. The share of PRs entering rework holds near half, and the share of review activity attached to them holds near four fifths, no matter how much volume swings. Most organizations sit in a majority-rework band, and scale does not help: the largest organizations carry the longest severe tails in the data.

Week over weekShare of PRs and of review activity

PRs entering reworkReview activity on reworked PRs

Weekly shares across the twelve-week window. PR volume moved a lot from week to week, but both shares barely moved.

Where organizations sitShare of profiles by rework-rate band

Share of organization profiles by rework-rate band. Most organizations fall in the majority-rework band; staying under 30% rework is the exception, not the norm.

Because the rate stays stable, the burden scales with volume. Growing teams do not grow out of rework; they just buy more of it. Repeat reviews become the majority of review activity well before most PRs are reworked, so even the moderate band already pays a structural tax.

Implications

Measure the loop, then move the feedback upstream.

The data does not argue for less review feedback. It argues for the same feedback, prioritized better and delivered earlier, so it stops creating repeated PR-phase loops. Five operating changes follow directly from these trends.

Measure the loop directly

PR counts, merge rates, and average cycle time can all look healthy while the loop runs. Track rework rate, repeat-review share, and cycle time by rework tier instead, and the tax becomes visible.

Treat size as a risk signal

Median change size climbs with every tier. A change that grows from tens of lines into the hundreds deserves decomposition, a draft review, or earlier feedback before the PR opens.

Route findings by acceptance

Correctness, compatibility, type safety, and security findings deserve blocking, pre-PR delivery. Maintainability suggestions get lower acceptance and work better as advisory guidance than as another loop.

Escalate deep rework

A PR entering its fourth cycle is an operational problem, not routine review. Split it, assign a domain owner, or settle the architecture question outside the loop.

Count attention, not just calendar

Most cycles start within the hour of feedback landing. That interruption cost is real, even when every merge-rate dashboard looks fine.

This is the model behind Baz. Planner reviews the plan before the code exists, and purpose-built agents give the PR a final contextual check, so review confirms the approach instead of correcting it.

Put your own numbers on the loop.

Benchmark your organization against these trends. Model the recovery with the rework calculator, or bring us your PR volume and we will measure your actual rework distribution and what moving feedback upstream would return.

Talk to sales Run the rework calculator