So what? Even though engineers realize that, as long as you're talking abstract points, and not holding me to a deadline, I have no reason to modify my estimate. And even if I -do- modify my estimate, as long as I continue with that new estimation mechanism for some period of time, velocity will change accordingly and the PO can -still- make accurate determinations of when something will be delivered. If I consistently under or over estimate things, I am predictable. The more constistent I am, the less time it takes to be uniform, and thus predictable. By not pinning it to time, the PO can determine the average velocity, the current estimations, and have a largely accurate idea of when things will be done. The only time this ever breaks is when I have a reason to change my estimation strategy, which only happens when you start applying incentives for me to change it. Every time my incentives change, my estimates will change; don't change my incentives. Don't give me a date or a deadline.
We could do this with hours except hours actually translate directly to time. From them I can determine when you think I should be done and thus the implied goal; points I can't. Because of that, I have incentive to overestimate when it comes to hours, to give myself extra time 'just in case'. With points I don't; they don't imply an end time. Even if the PO's estimates are wrong, they can't get onto me, because I never had a due date. Because of that, my estimates tend to be consistent (even if inaccurate), and consistency = predictability. That's the goal, making it predictable when things will be finished (within a given tolerance; hours never give us that). Whether points are big, or small, the PO can determine "This team has an average velocity of 20 points, whatever size those points may be. That means I can expect around 20 points per sprint in terms of stories. I have 50 points I need to get done for this next release, that means I can expect it in three more sprints". All done without the engineers ever having a goal they are trying to make, no incentive to change their estimates, and literally the best predictability (not accuracy; you don't need accuracy) you can achieve, which, it turns out, is actually pretty damn good in practice.
I think you're missing the fact that points only matter relative to each other. There is no absolute meaning. The author hints at that but I don't think is explicit about it, assuming you already know that. As such, a team tends to be consistent with them, even if not objectively accurate. And their mistakes average out into something predictable, even if not accurate. Because there is no incentive, conscious or unconscious, to massage it.
Some people -do- decide to use t-shirt sizes for sizing, rather than points. It still works. This is generally equivalent to 3, 5, and 8 using points (fibonacci). Most people who use points say anything less than a 3 is wrong (make it a three), and anything larger than an 8 should be broken up because otherwise it's difficult to estimate (with -maybe- a 13). So use whichever you like. S, M, L, and maybe XL, if you want to map to four options, as you have with points. Though you need a way to aggregate them to determine a velocity; how many S = 1M, etc. That's why people tend to use points.
Basically, the author said, flat out, it's a psychology game. AND IT IS. What a PO needs from a dev is consistency in how they estimate, not accuracy. From that you can measure the actual work completed over time, and get a measure of velocity, which can be used to accurately predict the delivery of future stories, within a pretty good tolerance. Ensure the psychology for that consistency is there. Points are part of it. No goals, milestones, deadlines, etc, are part of it.
My point is that no story points need to exist. If you want to average the amount of work done over a period of time then just count bugs and stories completed. The Central Limit theorem applies equally well to stories over the long term, so just collect data. Automatic.
We add story points, presumably because we don't want to wait to gather enough data on stories, but the neutral position is to not use them because they incur a cost (meetings, training to get consistency right, etc). So why have them? I have not seen a convincing argument.
Sure, we could treat all stories as being of a single size and rely on the central limit theorem, but that takes far, far longer to converge. On the order of years, I suspect, not weeks or months, which is what you get out of pointing. "We should have this done sometime between now and 2022" is not a useful metric for a PO.
We add story points because given consistent incentives, estimates tend to be consistent. Maybe consistently under or consistently over, and obviously there's a bit of wiggle room from estimate to estimate, but they tend to converge quite quickly, and to within a week or two's uncertainty of what we'll have done at any given point within the next six months, and within 6 weeks within the next year. Quite a far cry from not using them.
Meetings? Some, sure; you'd have them anyway just defining what it is you're doing, the additional burden to assign points takes up maybe half an hour per person per week or two. Training to get consistency right? There's no training involved. The hardest thing is to get people accustomed to picking a number, relative to the others. But that's not that hard either; we basically just took the first sprint's stories, organized them into a line (much like his rectangles) going from least to most complex, then discussed where to draw three vertical lines, separating them into four distinct sizes, 3s, 5s, 8s, and 13s. We then made sure the largest didn't feel too large compared with the other 13s (or else it might actually be a 21, just compared to what we had agreed a 13 was, and so we had to split it up), and from there we always had 'reference stories' to decide whether it felt more like a 5 or an 8, say. And while we sometimes differed, we could always hash out why we differed and come to an agreement on exactly how large the story was. Again, per the OP, so long as you are consistent with how you address those discrepancies, your overall estimates will be consistent, and give you predictability.
One half hour per week or two is hardly a huge cost to pay when it gives the business the ability to accurately predict when we'll have something delivered.
I don't care if the argument is convincing. I've seen it work. You're free to do whatever you want; I know what I've seen work, and what I've seen not work. I've yet to find something that works so well.
Your claims are outrageous with no supporting evidence.
If you just want to throw out anecdotes I'll give you my own. I collected data on all the pointed stories at one previous job for three years and found a negative correlation between story point and time from a story being started to being completed.
Sure, you might claim that it all averages out in the end, except that our velocities were wildly in flux for that entire span as well. But we were just "doing it wrong" right?
I doubt that you did, if I understand you. Because it sounds like you're saying that overall 3 pointed stories took the most amount of time, and 13 (or whatever your max) took the least.
But let's say it did. I'd look to see why your estimates fluctuated all over the place. Or whether stories were being closed when they were actually finished (i.e., being accurately reported). Did you have deadlines? Did you have delivery pressures? Because that right there is a good reason; as you near a deadline you start padding estimates more. Did you keep having things come up that broke the sprint? Etc. All manner of things can cause estimates to be wrong. But not negatively correlated, -especially- with velocities constantly in flux (and I mean seriously in flux; you take an average because it can and will vary, especially if there's unexpected stuff, like someone getting sick, that you didn't account for when planning); that to me, yes, definitely sounds like you were doing something very, very wrong.
If you look at the first Central Limit Theorem slide in the original article, it compares story precision vs precision of unestimated bugs. The short answer is that if your stories are big, not estimating them causes more error (weeks or months) than most people are willing to accept. Not so with small tasks (bugs) which average out, as you suggest.
However, it’s hard to make long-range estimates using only many tiny stories/bugs, because you don’t want to break the job down with such granularity for months or years into the future - plans will change by then, and all that design work will have been wasted. That’s what makes big stories useful; you can estimate months of work in a few minutes. But because they’re so big, you can’t treat them all as equal sized.
Except you can't, as I've said, engineers are not good at estimating. They get significantly worse when you start estimating months out instead of days out.
The problem is not that they're engineers, no one is good at estimating work they've never done before. This is a well known problem in pretty much every single software shop I've ever been in. Teams never deliver what was planned on time, only functional teams cut features for releases. This is not a "win" for estimation.
> Except you can't, as I've said, engineers are not good at estimating. They get significantly worse when you start estimating months out instead of days out.
Because you keep focusing on time estimates instead of point estimates. People have intrinsic biases related to time and their productivity. Like how most people implicitly assume they're above average in intelligence, looks, etc.
We could do this with hours except hours actually translate directly to time. From them I can determine when you think I should be done and thus the implied goal; points I can't. Because of that, I have incentive to overestimate when it comes to hours, to give myself extra time 'just in case'. With points I don't; they don't imply an end time. Even if the PO's estimates are wrong, they can't get onto me, because I never had a due date. Because of that, my estimates tend to be consistent (even if inaccurate), and consistency = predictability. That's the goal, making it predictable when things will be finished (within a given tolerance; hours never give us that). Whether points are big, or small, the PO can determine "This team has an average velocity of 20 points, whatever size those points may be. That means I can expect around 20 points per sprint in terms of stories. I have 50 points I need to get done for this next release, that means I can expect it in three more sprints". All done without the engineers ever having a goal they are trying to make, no incentive to change their estimates, and literally the best predictability (not accuracy; you don't need accuracy) you can achieve, which, it turns out, is actually pretty damn good in practice.
I think you're missing the fact that points only matter relative to each other. There is no absolute meaning. The author hints at that but I don't think is explicit about it, assuming you already know that. As such, a team tends to be consistent with them, even if not objectively accurate. And their mistakes average out into something predictable, even if not accurate. Because there is no incentive, conscious or unconscious, to massage it.
Some people -do- decide to use t-shirt sizes for sizing, rather than points. It still works. This is generally equivalent to 3, 5, and 8 using points (fibonacci). Most people who use points say anything less than a 3 is wrong (make it a three), and anything larger than an 8 should be broken up because otherwise it's difficult to estimate (with -maybe- a 13). So use whichever you like. S, M, L, and maybe XL, if you want to map to four options, as you have with points. Though you need a way to aggregate them to determine a velocity; how many S = 1M, etc. That's why people tend to use points.
Basically, the author said, flat out, it's a psychology game. AND IT IS. What a PO needs from a dev is consistency in how they estimate, not accuracy. From that you can measure the actual work completed over time, and get a measure of velocity, which can be used to accurately predict the delivery of future stories, within a pretty good tolerance. Ensure the psychology for that consistency is there. Points are part of it. No goals, milestones, deadlines, etc, are part of it.