Beyond the Story Point

Fed up with constantly trying to digest story points and their stifling ineffectiveness? There’s another way!

7 min readSep 5, 2018

When determining how big a piece of work is, it is common to use story points, often assigned to backlog items using a method called “Planning Poker”. Typically, the story points aim to indicate effort or even just complexity (depending on who you ask!), and provide a very useful abstraction from an estimate of time. Scalar scoring provides built-in uncertainty, but businesses often struggle to digest this uncertainty, and teams often get the scores wrong.

In this article I will look at an alternative approach that my team and I have dubbed diSCUVery Scoring, which I believe is a simple evolution of the classic story point and show how I believe it can help teams and stakeholders alike. I do not claim to have invented this approach, it is simply the product of my journey, with vast amounts of reading, learning and adapting along the way.

Time Estimates Are Bad

Before going beyond the story point, looking back at how they became so useful might help provide some context.

Time based or estimates are very often wrong. There are several key reasons for this, for example:

It’s very difficult to know everything that needs to be done upfront, or how long problem solving may take. Software development is deeply complex; developers are frequently asked to do work they haven’t done before, and various influences require them to tackle familiar problems in unfamiliar ways.
You are asking the developer(s) for their experience of how long something might take. This does not reflect how long it might take another developer. It is often the case that those estimating are not those doing the work, and it is often incorrectly assumed that it will be.
Asking developers to commit to time often comes with some degree of pressure to deliver quickly, sometimes this is done purposefully and sometime just by accident. This pressure will likely lead to a reduction in the quality of work, which may lead to more defects being found or even failure to deliver a suitable solution, which in the end may well make the whole delivery take far longer than it should have or was estimated to.
Historic failure to achieve goals within timed estimates will likely cause inflated estimates.
The last two points there are directly in conflict with each other and, in my experience, this tends to lead to wildly varying results which makes it very difficult to reliably predict any degree of inaccuracy.

It can be argued that developers should be able to commit to timings and deliver on time. This is usually the opinion of individuals who just aren’t familiar with the work of development teams, and who may compare their work to that of teams of other functions or of other professionals who deliver in similar ways. But software delivery quite often is very complex, and for the reasons listed above (among many others), committing to a fixed time estimate is wildly impractical.

Story points provide a useful abstraction from time. An numerical indicator infers the amount of effort that will be required but does not commit to the time taken to expend that effort. This allows teams to provide useful information back to product owners and stakeholders that could be used for strategy and backlog prioritisation.

The problem with points

“But how much will it cost?”

The need to understand cost is often met with rolled eyes and sarcastic retorts about string length. But for many businesses, measuring this is vital.

Thus, product owners, account managers and various other stakeholders want to know what it will cost to have a unit of work delivered. Equally, teams want to understand their velocity — how many points they can deliver in a sprint — so that they can plan for future sprints and work efficiently. The typical story point captures the estimated effort of a backlog item. It is relative to other backlog items and as such we may make an educated guess as how much time it will take to deliver, given how long other items took to deliver.

But when using scalar story points (larger gaps between larger numbers) — 1, 2, 3, 5, 8, 13, 21 etc. — larger scores only indicate a range. For example, using the Fibonacci sequence shown above, 13 is preceded by 8 and followed by 21. The lower bound for a 13 therefore is 10.5, halfway between 8 and 13. The upper bound is 16, half way between 13 and 21.

If we look only at the historical time taken to deliver backlog items to create our average hours-per-point (our point velocity), we will probably get some wildly inaccurate predictions for the future, and being able to understand the cost to deliver an item would still involve a fair amount of guesswork and a large margin of error.

What we need is a much richer data set from which to model our predictions.

DiSCUVery — Scale, Complexity, Uncertainty and Vagueness

Complexity infers the difficulty of a doing piece of work, but something very simple can still take a long time to do. Building a wall, for example, is less complex than assembling a car, and building a small wall may take less time than assembling a car. Building a large wall, however, may take considerably longer even though it is only as complex as building the small one. Here we see a clear difference between scale (how big) and complexity (how difficult), the first two factors that affect how much effort will be needed to deliver.

The third factor, uncertainty, i.e. “we’re not sure how we would implement this solution” or “we’re not sure how big or how difficult it is”, may include technical learning or the unknown impact on security, accessibility, compliance etc.. Uncertainty is quite obviously a common cause of inflation of effort.

Vagueness, “it’s not clear what exactly is required”, is the final factor and, like uncertainty, is a cause of inflation of effort. Vagueness often leaves to ambiguity, which if left unrefined will usually lead back-and-forth questions or quite likely to re-work.

So, where before we have asked a team to estimate effort, we can instead ask them to estimate effort in two ways (scale and complexity) as well as estimating the likelihood of the inflation of that effort with two more indicators.

Asking a team to discuss and score a backlog item’s scale, complexity, uncertainty and vagueness has several primary benefits:

It produces a richer picture of how big the job is and can help to estimate velocity and cost better.
The team are encouraged to consider early on the factors that affect the time to deliver, which I’ve seen increase accuracy of scoring.
Highlighting the scale and complexity in a backlog item can indicate that it may need to be dissected into smaller units of work.
Highlighting the vagueness and uncertainty in a backlog item are good indicators that a backlog item needs to be refined or reworked or that more risk is being taken.

DiSCUVery Sessions

The discovery session replaces (or perhaps more accurately expands upon) the poker planning session. The way in which the scoring is performed, and what scores (or data) is captured is very important to creating useful and accurate data from which analysis can be performed. My approached is detailed below and, while this worked for my team, it may need to be adjusted and evolved for others.

Each team member is given two sets of cards, one with the Fibonacci sequence integers on (or your preferred scalar sequence) and another with a simple 1 to 5 on (or you can just use your fists and fingers — rock, paper scissors style).

The backlog item to be scored is introduced with a brief discussion. At least some of the team should already be familiar with the item and have had a chance to form some of their own opinions of it. First the team discus vagueness and uncertainty of the issue. They individually pick a score card from 1–5 for uncertainty, presenting them simultaneously brief further discussion between the team should help determine if the modal (most common) number is correct or find consensus on another number. This is then repeated for vagueness. Notes must be captured that explain the numbers, such as areas that need clarification or further research and the rationale for the score.

Next the team discuss the scale and complexity of the issue. They repeat the scoring process above using the 1–5 cards to determine the complexity, then use the Fibonacci cards to determine the scale. Again, the team will need to find the modal number or reach a consensus and rationale for these scores should be captured.

Providing all 4 scores for a backlog item should be timeboxed, perhaps to just a few minutes per item. More complex, vague and uncertain items will naturally take longer to score but getting bogged down in minutia and detail is unnecessary, so encouraging swift progress is vital. The timebox you require will much depend on your backlog refinement processes and how involved the team have been before the diSCUVery session. You should work to improve these areas as a team if you find the sessions overrun.

Gobble the Numbers

I have covered the problems with using a single-dimensional scoring mechanism and introduced an approach I have successfully used with my teams to create a richer data set. Analysing this data will help us better predict the effort that will be required to deliver backlog items. This can help to predict the cost of delivery, model historic velocities, perform targeted refinement as well as create product roadmaps.

The data can be used in several different ways to achieve these goals, and correct approach will ultimately depend on what the team needs to learn. In a follow-up article, I’ll go through my preferred model for analysis use this data, and what I learnt along the way and how to spot factors that might affect the model early on.

If you’ve enjoyed this article please do follow me and throw some claps my way. I’d love to hear any thoughts and feedback too so comment away!