Best Practices for Managing Estimates

Best Practices

“Tell me how you measure me and I will tell you how I will behave. If you measure me in an illogical way… do not complain about illogical behavior…” — Eliahu Goldratt

Careful thought needs to be put into the metrics that are created, especially how they are used on the strategic level. Though the remit of this P & P may initially appear to be just about tactical level estimates of effort, such estimates inevitably will interplay with the strategic level through the changes in behavior that are encouraged thereby.

With the implementation of DevOps using Azure, to create an end to end seamless product development process, the need for ever tighter cooperation and integration will require requisite changes in behavior.

Eliahu Goldratt is mentioned because, about twenty years ago, Medtronic was having great difficulty getting work done as a coordinated entity – a team of stars does not a star team make. By many industry-wide measures of institutional maturity, Medtronic ranked low. It seemed that every team/department/group behaved as if it were alone in the world. There was little Situational Awareness: nobody knew what was going on so nobody knew what to do.

After a long search, the teachings of Eliahu Goldratt’s Theory of Constraints (TOC) were seen as a solution, and consultancy was brought in to promulgate them throughout Medtronic. They also brought in the scheduling tool Concerto that was based on TOC.

Whereas a typical project scheduling process-tool-environment had each task owner be able to “pad” their estimates, so as to hopefully improve the chances that the task would be done within that padded time frame, Theory of Constraints dispensed with padding altogether. Each task owner was expected to enter the most optimistic estimate for completion. The padding was all consolidated in project wide pool, the “buffer”. This was the first indicator to newcomers that a “we are all in this together” philosophy was at work. If a task fell behind, it would start “eating into the buffer”. The whole project to rush to the aid of that task owner. The right metric therefore encouraged the right behavior.

Though that Theory of Constraints is now a part of history, judging from recent town hall meetings, the lessons had been learned and behavior has changed for the better.

Though estimates happen on the tactical level, they will inevitably interplay with the strategic level

Top-Down Waterfall will clash with bottom-up Agile

The purpose of effort estimation is to ultimately minimize surprise when executing a plan, or, put another way maximizing predictability. The hope is that as one gets better at estimating tasks, one gets better at scheduling projects. The ultimate state is reached when projects’ effort are estimated perfectly and the execution is performed such that the project is completed on time and on budget.

That is the ideal. It is also a Waterfall approach, with all of its issues.

In contrast, the Agile approach's goals are subject to change. This flexibility is gained by breaking up the project into mini projects that can be used to learn more (often about product definition or new technologies) and thus change the final goal in response. This approach is also more team focused. Estimations are used in an Agile context to determine what a given team’s velocity for work is, help improve how much effort tasks take, and to determine whether a team is ahead or behind its base velocity. This awareness allows the team to adjust its goals. Estimation and execution perfection is not the aim.

Thus “rolling up” the estimates of different teams into one main schedule will inadvertently create a clash between the top-down Waterfall and the bottom-up Agile approach.

This clash is not necessarily a bad thing: in software development especially, those “devils in the detail” are notorious for being able to upset the greatest of plans. With the right kind of estimation metric, emergent trouble can often be spotted early on the strategic level.

For example: Typically roll ups of estimators have statistical variability measures that go along. Whereas removing variability is a good thing for the deliberately routinized work of manufacturing, it signals a bad thing when it comes to creative/problem-solving work like R&D. Thus, paradoxically, when R&D projects predictably meet their milestones within a narrow band of variability, that often is a signal of too little risk-taking and thus too little innovating. DevOps work does involve researching ever-new processes, tools, and environments thus, they too, will at times take much longer than planned.

The “apples-to-oranges” problem when rolling up performance metrics to the strategic level Note! Standardization of metrics to be gathered organization-wide needs to be done quickly as projects are ready in Azure DevOps and forging ahead with their own metrics which may prove to be un-roll-up-able.

1) One Use Case for leadership has been to roll up the teams’ Velocity measures into an overall project-level Velocity. Since each team is free to use its own scoring system such roll ups add “apples to oranges” and are unusable. The solution to this problem may take two forms:

    a) Tell all projects to use, say , Fibonacci scoring, constrained to 1 to 21. The problem with this is that teams are ready are working in Azure DevOps using their own scoring so their historical scores may be unusable.

    b) Or, leadership should ask themselves variant of the “five whys” to find out the root purpose of this metric. One such deep use is to know whether the strategic level effort is ahead, behind or okay. Then each team should vote whether there local effort is ahead behind or okay. The rollup then would be a histogram of how many teams falling which category.

2) Another Use Case for Velocity is for scheduling purposes, e.g. how long will the strategic level effort take and when will it be done? Using estimated and actual hours, both, instead of Fibonacci scores can be more intuitive and lend itself to easier probability distribution based total time effort estimates (Beta Distribution, for example). It does impose a cost and a risk of too much time being spent estimating how much time will be spent — a problem that Fibonacci scoring avoids.

3) Care must be taken to avoid another kind of “apples-to-oranges” problem on the strategic level, e.g. self contradictions in how the metrics are used by leadership:

“Don’t ask people to collaborate if they know that, in the end, there will be a winner and a loser.” – Eliahu Goldratt

Goldratt's quote highlights one ways the use of metrics can have adverse results due to a lack of removing inherent contradictions. There are others to name just a few:

a) For example, a team is investigating a new technology which inherently is impossible to estimate and is always late. Another team tackles safe stories and thus is always on time.  In the roll-ups to the strategic level, the former team is seen as a perpetual laggard, whereas the latter team is seen in a positive light. Thus riskiness of a task needs to be a metric.

b) For example, a team is generous in its help to other teams because it has the expertise and as a result the whole organizations' productivity improves. The helpful team itself is likely to be perceived as less productive. Thus reports of how other teams' members helped needs to be tracked.

c) For example, a team needs IT services to be able to do its work. The IT department is outside of the team's organization and doesn't fulfill the need because it operates under different incentives. The team is penalized for the behavior of IT.  Thus it is important to track qualitative reports of collaboration issues.

The metrics that aren't there can be more important than the metrics that are measured

"You don't put the armor on the bombers where the bullet holes are. You put the armor where the bullet holes aren't because those bombers never came back to have their bullet holes counted." – Abraham Wald on Survivorship Bias

In WW2 the statistician Abraham Wald was asked by the US government to help improve the survival rates of bombers. Previous researchers had counted up and mapped the bullet holes of bombers that had returned and wanted to put more armor plating in the areas of greatest bullet hole density. Wald had a radical insight and insisted that the armor be placed where the bullet hole density was least because any bomber that received damage in those areas was fatally damaged and never made it back to base. Thus armor over the pilot section, tail and engines was added.

The lesson for metrics use at the strategic level is that the best metrics , the ones that really tell one what is going on, are often the hardest to collect. For example, Velocity, is easily available as a metric, it becomes the basis for leadership to intervene to improve the "laggard" team. Yet there are many reasons why velocity is dropping for positive reasons ( see (3) ) , but those require investigation into the "eco system".

Current Practices & Recommendations

Time / Effort Measures

Fibonacci Agile Estimation ( Currently Used and Recommended Best Practice & Must-Have)

In this approach, users don’t put down, say, hours, to estimate the task effort, but instead users score tasks using Fibonacci numbers: 1,2,3,5,8… The higher the number the more effort a task takes relative to others in jumps of ~50-60%.

Consistency, both within a team and across team in the use of scoring, is important. "Reference" stories and tasks and the recommended Fibonacci scores are needed to set a standard of consistency to avoid an "apples-to-oranges" problem when the data is rolled up.

The best practices suggest that there be a maximum Fibonacci score for teams.

The maximum number to be used by a team in Fibonacci Scoring needs to be standardized across all teams.

Pros

This avoids wasting the team’s time with tweaking an estimate to be, say “20” vs “21” vs “22” etc. Instead, the big jumps make it easier the team to categorize the effort needed for a task – “just pick a T-shirt size” than obsess about precision. It makes transparent the fact that humans aren’t very precise.
It also allows consistency as the users should concern themselves with relative, not absolute, levels of effort.
Velocity trends quickly reveals the team’s capacity to estimate as over/under-estimations are amplified

Cons

Larger tasks’ scores underestimate the uncertainty that, as (DeMarco) observed “An estimate is the most optimistic prediction that has a non-zero probability of coming true”, or in practice has a 10% of getting met.
Hard to translate estimates into enterprise-wide schedule terms due to Fibonacci series based scores weighing intra-team relative effort, not absolute cross-team. “We are doing Agile on the team level, but Waterfall on the company level”
Hard to compare cross-team scores because each team has its own idiosyncratic use of the Fibonacci series scores

Story Points Based on Hours (Deprecated)

Story points translate a normally distributed range of time, say 1 point = 4-12 hours, mean of 8, and multiple points are assigned to tasks.

Pros

Since points are associated with hours, they are intuitive to understand
The budget for points for a team can be easily calculated
Velocity calculations, say 10 points, can be translated into schedule terms of 40-120 hours with a mean of 80 hours

Cons

Hard to compare cross-team scores because each team has its own idiosyncratic mapping of Story Points to hours
It ties up time in the estimation process
It diverts focus
Runs counter to Agile philosophy

Hours ( Project++, a third party x-project analytics dashboard app considered for use by Simplified DevOps Arch., requires hours for Tasks)

Hours are estimated directly.

Pros

Intuitive to do and to understand
It can be rolled up to the strategic level and be statistically analyzed
It can be easily used for schedule estimates.
Actual hours can be compared to Estimated hours to analyze for estimation error patterns

Cons

Bigger tasks are harder to estimate
Easy for team to spend too much time on estimation
Runs counter to the Agile philosophy

Risk Measures (New!)

A key score to know is what is the amount of risk a given activity (aka Work Item) represents to project as a whole? This allows one to know what the "risk profile" of a project, as well as to know how much risk has been reduced (aka "burnt down") as tasks are completed.

Thus risk assessment for a given Work Item requires an awareness of the overall project and requires the need to know risks and effort for all of the Work Items for a project in advance. This makes risk assessment inherently non-Agile, e.g. Waterfall.

The "risk profile" shows the shape of how project risk is expected to diminish as each step is completed.

In the best of worlds, a "risk profile" for a project should take on the shape of an "L" , as the first step eliminates all project risk and the rest of the steps carry no risk at all.
In a typical project , the "risk profile" would look more like a descending staircase, as each step's completion would remove a proportional amount of project risk.
A "bad project" would, in the extreme, have a "risk profile" that resembles a "¬", where the risk remains at 100% until the last step. (This is the risk profile of a rocket launch: Not until the satellite is successfully deployed is the risk at 0. )
Projects that have more risk at the end, "convex" projects signal to the project runners to be additionally pro-active to make sure to remove blocks.
Projects that have more risk at the beginning, "concave" projects signal to the project runners that this is a more forgiving situation, like a POC, where one will get an answer quickly.

The "risk burndown" shows how much progress is being made in reducing overall risks.

Fibonacci Risk Score

$$ Risk Score For Project=\frac{Fibonacci Score For Activity 1 + Fibonacci Score For Activity 2 +...+ Fibonacci Score For Activity N}{MaximumFibonacciValue*TotalNumberOfActivities} $$

$$ Risk BurndownOfIndividualActivityX=\frac{Fibonacci Score For Activity X}{MaximumFibonacciValue*TotalNumberOfActivities} $$

_{^{equation reference}}

Priority

Priorities usually stay within a team and it doesn't make sense to roll them up to higher levels, though one can as long as one agrees to a common ranking score.

Dewey Decimal Ranks

Using floating point allows one to insert new activities between two existing ones in rank.