Archive for May, 2007

Brooks’s Law and intercommunication: you talkin’ to me?

May 29, 2007

Surely everyone involved in developing software has heard of Brooks’s Law. First presented in the eponymous chapter of Frederick P. Brooks, Jr.’s classic The Mythical Man Month, it states: “Adding manpower to a late software project makes it later.” This “law” is much beloved by software developers as a handy bucket of cold water with which to cool the ardor of overly enthusiastic managers and executives. Lately, however, I’ve been thinking about Brooks’s Law and rereading The Mythical Man Month and I’m no longer as impressed with Brooks’s analysis as I once was. This is the third in a series of posts discussing some of the reasons why. The first post in the series discussed training costs and the second talked about sequential constraints.

In addition to training costs and sequential constraints, part of the justification for Brooks’s Law is the claim that as a team expands the cost of communication between team members grows faster than total productivity of the team. As with the issue of training costs, Brooks has a point — the number of possible pairwise communications paths between a team of n people is n(n – 1)/2; that’s a simple fact of math. If you then assume that every possible pair will need to spend a certain amount of time communicating or, equivalently, that the whole team will have to get together for meetings whose total length is determined by the number of people on the team, then it’s true that the number of person-hours spent communicating will grow as the square of the number of people while total productivity, again measured in person-hours, will only increase linearly. We can all see where that’s heading — pretty soon the amount of time spent on communication will be greater than the total amount of time available to work on anything at all and nothing else will get done. But how soon?

To take a concrete example, suppose we’ve got a six person team that we’re thinking of expanding to eight; should we be concerned that the increasing communication costs will eat up any additional productivity we might get from the two extra people? We can figure it out. Suppose that each pair on our team gets together for a pure-overhead, one-hour tête à tête every week. Assuming a week is five eight-hour work days, the whole team spends 30 person-hours on communication per week out of 240 person-hours worked, leaving 210 person-hours of productive work. What happens if we expand the team to eight? Each person will now spend seven hours a week in pairwise communication and the team as a whole will spend a total 56 person-hours a week communicating. But the team will also now be able to do a total of 320 person-hours of work, leaving 264 person-hours to be spent on productive work, or 54 more hours. This calculation does demonstrate Brooks’s larger point, that a “man month” is not a useful measure of productivity — if it were, then expanding a team from six to eight, a 33% increase in size, would likewise increase productivity by 33%, not the approximately 26% we actually get. But this example doesn’t justify, by itself anyway, a blanket claim that adding people to a project will always slow it down — the eight person team can, in fact, get more done than the six person team and therefore should finish the same amount of work sooner, all other things being equal. Of course all other things are not necessarily equal — training costs can reduce the initial productivity of new team members and it’s conceivable the sequential constraints introduce a long leg that can’t be reduced by adding people. In a later post I’ll discuss whether these three factors together might be enough to justify Brooks’s Law.

So, for this team, a jump from six to eight people probably won’t add more communication costs than the productivity of the new people. But obviously Brooks’s is right that eventually quadratic growth will outpace linear growth.1 With a bit of math, we can figure out exactly when the costs of communication start outpacing gains in the ability to do useful work for a given amount of per-pair communication overhead and a given number of hours worked per week. If we have a team of n people that work h-hour weeks and in which each pair spends c hours per week on communication overhead, then the cost to the existing team members of adding one person to the team is n×c because each of the n current team members will now be part of a pair with the newcomer and will have to spend c extra time tending to that pair’s communication needs. Meanwhile the newcomer will also spend n×c hours per week on pairwise communication which, when subtracted from the h total hours they’ll work each week, gives us the amount of non-communication work per week they’ll add to the team’s total capacity. When the amount of productivity lost from the existing team members is the same as the amount of productivity added by the newcomer there’s no point in expanding the team. So we can solve this equation for n:

to get this:

With this formula we can determine that for a team with one-hour per week of per-pair overhead, working 40-hour weeks, the the biggest the team can get without loosing more productivity than it gains, is 20.

But all of these computations may be beside the point, as they’re based on the assumption that communication has to be overhead. What if we had a team of six people that spend not one, but eight hours per day on pairwise communication because they spend all their time pair programing? If we add two people to that team, there is no change in time spent per person on pairwise communication — the only change is, assuming the team rotates partners, that each person will pair with each other person less often. But the communication time obviously can’t be all overhead — the only time anything gets done is when two people are communicating.

So how can this possibly square with Brooks’s analysis? One possibility is to join the ranks of the XP skeptics and simply deny that the pair programming team could possibly get anything done. I’ve had good experiences with pair programming though so I can’t buy that. I think the problem is with Brooks’s underlying assumptions. As I’ve mentioned previously, Brooks assumes that an n-person team will partition the task of writing whatever software they need to write into n pieces, each to be written by one person. To the extent that those pieces of software need to talk to each other, so do the people writing them and this communication is extra work on top of the base amount of work required to write the software. His arguments about training costs, intercommunication, and sequential constraints are all aimed at demonstrating that a task that a single developer could complete in x months will take n developers more than x/n months because the amount of work required for n developers is no longer simply x but x plus overhead.

But there’s another possibility. What if, as I suggested in my post about Sisyphus, n people don’t have more than x work to do but less than x because n people working together and communicating a lot are much more likely to discover a better solution than any one of them working alone? In that case, time spent communicating is not extra work but a way of reducing the total amount of work done.


1. One thing to note about the growth of intercommunication costs is that it is quadratic, not — as some writers have described it — exponential. Quadratic growth is faster than linear, for sure, but nowhere near as fast as exponential. Populations with no limits on their growth — bacteria in Petri dishes or rabbits in Australia — grow exponentially. If communication costs did grow exponentially with the size of the team, then a team would go from spending just slightly over half it’s time on communication to being able to do nothing but communicate, just by adding one person. One author who should certainly know better is Steve McConnell who described the growth of communication paths as “exponential” in Software Estimation, (p. 57). In fact he did know better — in his earlier book, Rapid Development, he described the growth, correctly, as “multiplicative” (p. 311).

Advertisement

A little person with a sense of humor

May 28, 2007

Today my wife was downstairs playing with our eight-month-old daughter, Amelia, and all I could hear was the sound of Amelia laughing, laughing, laughing. I mean, really cracking up then settling down a bit and cracking up all over again. I’m sure I’m far from the first person to have had this feeling but it gives me some small measure of hope for the human race that this little person who barely knows her own name and doesn’t know enough not to crawl off the edge of the bed, has, if nothing else, a sense of humor.

Practical Common Lisp going into 3rd printing

May 26, 2007

I just found out that Apress has decided it’s time for a third printing of Practical Common Lisp. If I recall correctly, the first printing was 5,000 copies, the second 3,000 more. New printings are called for when the publisher thinks they’re going to run out of copies to sell to distributors so this must mean I’m not crazy to dream of someday having a 10k-copies-sold party.

This also means now would be a good time, if you’ve read the book and noticed any errors that you’ve not emailed me about, to send a note. If you put “pcl errata” in the subject it’ll make my life a bit easier. Note, however, that this is just a new printing not a new edition. For a new printing we just fix minor typos and so forth so now is not the time to tell me that there should really be a chapter about how to connect to RDBMSes or what have you.

Brooks’s Law and sequential constraints: one damn thing after another

May 22, 2007

Surely everyone involved in developing software has heard of Brooks’s Law. First presented in the eponymous chapter of Frederick P. Brooks, Jr.’s classic The Mythical Man Month, it states: “Adding manpower to a late software project makes it later.” This “law” is much beloved by software developers as a handy bucket of cold water with which to cool the ardor of overly enthusiastic managers and executives. Lately, however, I’ve been thinking about Brooks’s Law and rereading The Mythical Man Month and I’m no longer as impressed with Brooks’s analysis as I once was. This is the second in what I expect will be a series of posts discussing some of the reasons why. The first post in the series discussed training costs.

As part of his “demythologizing of the man-month” (p. 25) Brooks points out that developing software is subject to what he calls “sequential constraints”. Brooks actually makes two points about sequential constraints, but he doesn’t draw a particularly clear distinction between them in his exposition, so I’ll start by teasing them apart. They are:

  1. All tasks, including software development, have some sequential constraints on their subtasks that determine the minimum time in which the whole task can be completed.
  2. Even if the rest of the work can be parallelized, communication needed to coordinate work can act as a sequential constraint, putting a lower bound on the time needed to complete a task.

Point one is a simple matter of logic. Virtually every task, from harvesting a field of crops, to having a baby, has some sequential constraints on some of its subtasks that determine that certain subtasks can only be done after other subtasks are complete. It’s impossible to complete the whole task in less time than the time it takes to do the longest sequential chain. By definition, subtasks that are not sequentially constrained can be done in parallel and so more people working on them at the same time will get them done sooner than fewer people.

Tasks vary in the both the nature and extent of the sequential constraints that apply to their subtasks. Brooks gives harvesting crops as an example of a task with very few sequential constraints and bearing a child as one that nothing but a long sequentially constrained chain. It’s worth noting, however, that all real-world tasks are sequentially constrained at some level — even harvesting a field of crops, for instance, requires that someone get to the farthest corner of the field, harvest what’s there, and bring it back. No matter how many field hands you hire and no matter how minuscule the part of the field each is responsible for there’s still no way to harvest the whole crop faster than that long leg could be completed.

For practical purposes, however, Brooks is right: harvesting crops is almost entirely parallelizable and bearing a child is almost entirely not. Before we get to how communication itself can act as a sequential constraint, let’s consider another task which is more sequentially constrained than harvesting crops but less so than having a baby, namely baking a cake. Consider for instance this simple cake recipe:

  1. Pre-heat the oven to 350°.
  2. Prepare the cake pans — greasing, lining, and flouring.
  3. Sift together flour, baking soda, and salt.
  4. Cream the butter, shortening, and sugar until light and fluffy.
  5. Add dry ingredients to butter/shortening/sugar mix.
  6. Mix in three eggs.
  7. Pour batter into cake pans.
  8. Bake for 25 to 30 minutes.
  9. Cool on racks for 10 minutes.
  10. Remove from pans and continue cooling.

As with most recipes, there are both opportunities for parallelism and unavoidable sequential constraints. If you had a three cooks in the kitchen one of them could prepare the cake pans while another sifts together the dry ingredients and a third creams the butter, shortening, and sugar. After that, the next three steps, up to pouring the batter into the cake pans, while sequentially constrained relative to each other, could be done in parallel with the oven heating. Thereafter, everything is sequentially constrained. No matter how many cooks you have, you have to heat the oven before you bake the cake and bake the cake before it can cool. Thus there’s no way to decrease the total time it takes to make a cake below about 45 minutes.

Now let’s consider how software development is sequentially constrained. As anyone who has written software knows, there are sequential constraints but where do they come from? There are certainly no physical constraints such as the one that keeps a baker from pouring batter into cake pans before the batter has been made. If, somehow, we knew at the beginning of a development project all the lines of code that needed to be written, we could type them in any order we wanted — the software would work just as well in the end. But the notion that we could know in advance all the code that needs to be written and type it like we were taking dictation is just crazy. Programming isn’t primarily a typing problem, it’s a thinking problem. And thoughts need to be thought in the proper order.

In fact, the only way to figure out how a software system ultimately fits together is to build it. In order to know, in detail, how part X is going to work we need to know how part Y, with which it interacts, is going to work. And the only way to know how Y is going to work is to build it. It may be that we can completely build X and then build Y or we may need to alternate — build a bit of X in order to develop enough information to build a bit of Y from which we learn enough to build another bit of X, and so on. It might also be equally possible to start by building X and then build Y or to start with Y and then build X. But however we do it, the flow of information about the system we are building are the inherent sequential constraints we operate under. Note that this has nothing to do with communication — even if the system were being built by a single developer these constraints would still constrain the order in which various parts of the system could be built.

Now, keeping in mind these inherent sequential constraints, let’s consider Brooks’s second point, that the need to communicate can itself act as a sequential constraint. This is the software development equivalent of Ahmdal’s Law from parallel computing which says that no matter how much you can parallelize a computation, it’ll never complete faster than the time it takes to combine the results of all the parallel computations. In both software development and parallel computing this is because communication is inherently sequential. If ten people — or ten CPUs — each have six minutes worth of information to convey to each other, it’s going to take at least an hour of elapsed time no matter how you slice it; ultimately each person is going to have to spend six minutes “transmitting” their information and fifty-four minutes “receiving” information from the other nine people.

To see how this effect plays out, imagine we have an idealized software development task whose coding can be partitioned among however many developers we like but for every ten hours a developer is going to spend coding, they need to spend one hour writing down what they’re going to do for the benefit of the rest of the team and everybody has to read everyone else’s notes. In other words, before each ten hours’ worth of coding, a developer spends an hour writing an email about what they are about to implement and sends it to all the other developers. Then they have to read the other developers’ emails, spending an hour to absorb each one. After all that communication, the developers can each code for ten hours. For simplicity, we’ll assume that even a developer working alone would spend the hour writing notes for themself documenting what they plan to do in the next ten hours.

Suppose the total coding time needed to develop the system is 100 person-hours. A single developer could do it in 110 hours, ten chunks of an hour of note writing followed by ten hours of coding. Two developers could do it in five chunks of work with each each chunk consisting of twelve hours of work: an hour writing notes, an hour reading the other developer’s notes, and ten hours coding. Thus for the team of two, the total elapsed time would be 60 hours, of which 10 would have been spent on communication. Five developers could complete the project in only two chunks but each chunk would be fifteen hours: one hour writing, four reading, and ten coding. Thus their elapsed time would be 30 hours with 10 hours spent on communication. Ten developers would be done in 20 hours elapsed time — ten hours of development after ten hours of communication.

Even if we could scale down the communication cost proportionally, so less than ten hours of individual work requires a proportionally smaller amount of email writing and reading, the elapsed communication time still stays at ten hours: twenty developers would spend a half-hour writing their emails and nine and a half hours reading nineteen emails before coding for five hours. Indeed, as the number of developers approaches infinity, the amount of time spent coding approaches zero as does the amount of time each developer spends writing their own email while the amount of time spent reading the infinite number of infinitesimally short emails from other developers approaches ten hours and the project as a whole still takes a minimum of ten hours to complete. Thus even when there are no other sequential constraints — when we assume that an infinite number of developers can each be given an infinitesimally small part of the project to work on in isolation — communication remains the one activity that must be performed sequentially.

In real software projects, of course, things are more complicated. The inherent sequential constraints — those that would affect a single developer working alone — interact with communication induced constraints in all sorts of complicated ways. For one thing, if we assume — as Brooks seems to — that the overall task is partitioned into subtasks, each to be developed by a single developer, then the way we do the partitioning can have dramatic affects on the amount of communication needed. If we split the system at its natural joints, then communication will be minimized — if subsystems are naturally decoupled then developers can work on their bit for a while, developing lots of information about how their part of the system works, which only they need to know, and just a little bit of information that they need to share with other developers. On the other hand, if the partitioning is poor, each developer’s part of the system will depend on many details of other parts and the developers will either need to communicate much more often or, more likely, will all go off in their own directions for a bit too long and then discover, when they compare notes, that they need to backtrack and rework things in order to make everything fit together.1

Another issue, which Brooks doesn’t mention, is that the need to communicate can stall productive work. One of the idealized aspects of the hypothetical project above is that the developers work in perfect lockstep — everyone communicates and then works for exactly ten hours and the cycle repeats. At no point is anyone stalled waiting for someone else. In real projects, some subtasks will be bigger than others leaving developers whose pieces happens to be smaller waiting after they’ve finished their work to communicate with developers whose pieces are larger. Every hour that they spend waiting is an hour that gets added to the total number of person-hours it takes to complete the project.

That all said, there’s nothing that says the only way to divide up a task is by partitioning it into pieces that are each implemented by a single developer. In fact there are all sorts of reasons, which I’ll talk about in a later post, that that might be a bad idea. For now, let’s just note that if we could avoid a strict upfront partitioning, and could let developers share ownership of the system as a whole, working together frequently and sharing ideas about how it all fits together, they could probably much more closely emulate the order of development that we would see if we watched a single developer build the whole system, constrained only by the inherent constraints of needing to build enough of X in order to know enough to build Y and discovering, as they go along, enough bits that can be naturally carved off and done in parallel to keep everybody busy.2

So how does all this relate to Brook’s Law? In the concluding paragraph of the chapter, right after he has stated his Law, Brooks goes on to say:

The number of months of a project depends upon its sequential constraints. The maximum number of men depends upon the number of independent subtasks. From these two quantities one can derive schedules using fewer men and more months. … One cannot, however, get workable schedules using more men and fewer months. (pp. 25-6)

The last sentence is only true if the number of workers currently on the project is sufficient to take advantage of all the opportunities for parallelism. For instance, suppose we have a project consisting of forty individual tasks, each of which will take a weeks’s worth of work by one person. Now suppose ten of those tasks are inherently sequentially constrained while the other thirty tasks can be done at any time, in any order. Because of the ten sequentially constrained tasks, the project can’t be completed in any less than ten calendar weeks. But suppose the project has been assigned to a two-person team. It will take them twenty weeks to do the whole project, ten weeks longer than the minimum. Clearly in this case, we can get “workable schedules using more men and fewer months” by adding one or two people to the team. A team of three would finish in a bit over thirteen weeks and four would finish in the minimum time of ten weeks. To say that we can’t reduce calendar time because of sequential constraints would only be correct if we had originally assigned the project to a four person team.

In general, given that Brooks’s Law is talking about late projects, that is, ones we badly underestimated in the first place, what’s the likelihood that our estimate of how many people we needed was exactly right? The real question, if we’re concerned about sequential constraints, is whether or not there’s work that could be done in parallel. Sometimes there is and sometimes there isn’t and assuming that there never is is just as foolish as assuming that there always is.


1. In other words, the only thing worse than paying the costs of communication is not paying the costs of communication. Because we will pay them eventually, with interest.

2. Obviously if the team pair programs then the partitioning problem is made quite a bit easier as n people need only n/2 tasks to keep everyone busy, rather than n.

Brooks’s Law: training costs, but not as much as you might think

May 17, 2007

Surely everyone involved in developing software has heard of Brooks’s Law. First presented in the eponymous chapter of Frederick P. Brooks, Jr.’s classic The Mythical Man Month, it states: “Adding manpower to a late software project makes it later.” This “law” is much beloved by software developers as a handy bucket of cold water with which to cool the ardor of overly enthusiastic managers and executives. Lately, however, I’ve been thinking about Brooks’s Law and rereading The Mythical Man Month and I’m no longer as impressed with Brooks’s analysis as I once was. This is the first in what I expect will be a series of posts discussing some of the reasons why.

When Brooks says that adding manpower makes a late project later, he doesn’t specify what he means by later. Later than it already is? Almost certainly, but so what? Later than your new wildly optimistic estimate? Probably, but again not all that interesting. The slightly paradoxical interpretation that makes Brooks’s Law such a perennial on amusing quotation lists is: later than it would have been if you had just left well enough alone.

Of the various reasons Brooks gives in the chapter “The Mythical Man Month” for projects running out of calendar time, the only one that has specifically to do with adding staff to an existing project is the cost of training the added staff. There are other costs associated with having a bigger team that such as potentially increased intercommunication costs and the need to repartition tasks. I’ll discuss those costs in later posts but for now I’m concerned only with whether Brooks’s own analysis of the costs of training holds water.

If we were to take Brooks’s Law as literally true, then we would have to believe that the costs of training new staff will always be higher than any capacity for productive work they might eventually develop. That seems unlikely. However, Brooks’s Law only refers to “late” projects so perhaps there’s something about being late that makes it true. Unfortunately, he doesn’t define “late” any more than he defines “later” so if we want to apply Brooks’s Law wisely we’re on our own — we need to ask, when can we get more done by adding staff than by not?

Much more often than Brooks lets on, it seems. In the section “Regenerative Schedule Disaster” Brooks uses a hypothetical project, originally estimated to be twelve person-months of effort and assigned to a three person team, to demonstrate how training costs affect our ability to speed up a project. In his scenario the project has been divided into four milestones, each of which should be completed in one calendar month by the team of three, i.e. three person-months per milestone. Unfortunately it takes the team two calendar months, or six person-months, to finish the first milestone, so there are only two months left to complete the remaining three milestones. Brooks then considers two sub-scenarios — one where only the first milestone was mis-estimated, in which case there are nine person-months worth of work left and two months in which to do it, and another where the underestimation was systematic so the three remaining milestones are all, like the first, six person-months of work leaving eighteen person-months of work. The question he then poses is, what happens if a manager attempts to get the project finished in the remaining two calendar months by adding staff.

In the first sub-scenario, a manager who ignores training costs would calculate that they need four and a half people to do nine person-months of work in two months. Rounding up to five, subtracting the three they’ve already got, and they add two people. In the second scenario, eighteen divided by two is nine, subtract the three they’ve got, and they’d need to grow by six. Brooks then analyzes the first sub-scenario, making the rather conservative assumption that it’ll take one month of full-time work by one of the existing team members to train the two newcomers before they’ll be able to do any work. Under that assumption, only two people will do productive work during the third month so only two more person-months of actual work will be done, leaving seven. In the fourth, and final, month, the new people will start contributing and the trainer can get back to real work but it’s too late — they’ll get five person-months worth of work done but with seven left to do the schedule is blown.

But there’s another way to look at it. With the two newcomers, the team managed to complete a total of thirteen person-months worth of actual (non-training) work, or almost 87% of the originally planned functionality (assuming the revised estimate of fifteen person-months for the whole project is correct.) What would have happened if they had heeded Brooks’s Law and just kept going with the original three-person team? They’d have completed only twelve person-months, or 80% of the originally planned effort. Or, if it’s more important to deliver 100% of the functionality as soon as possible, the original team would have needed another month, blowing the schedule by 25% while the augmented team would only need an additional two-fifths of a month, or about 10% over the original schedule.

In Brooks’s second sub-scenario, where the actual project size is assumed to be twenty-four person-months, the benefits of adding staff are even more pronounced. Assuming the same one-month of full-time training, the augmented team finishes almost 71% of the originally planned effort in four months compared to only 50% by the original team. Or they can finish the whole project in a bit less than five months total, extending the original schedule by about 20%, compared to the 100% by which to the original team would blow the original schedule.

The problem is not that adding staff to the project didn’t help; it’s that it didn’t help quite enough. You might ask, why not account for the training costs when figuring out how many new staff are needed? Brooks briefly considers that idea and rejects it on the grounds that the seven person team needed in the first sub-scenario to finish the remaining seven person-months worth of work after training would be too different in kind from a five person team for it to be feasible. That may be true but the question remains, what’s the alternative? Brooks considers attempts to finish the project on the original four-month schedule “disastrous” and recommends that we should instead “reschedule” or “trim the task”. Both of those are probably wise strategies but even with Brooks’s conservative assumptions about training time, the expanded team would still get more done needing either less of a schedule slip or less trimming of functionality.

At any rate, it’s not the case, in either sub-scenario, that training costs on their own would cause the project to finish later with additional staff than it would have without. Brooks does, however, make one important point when he says:

Notice by the end of the third month, things look very black. The [second] milestone has not been reached in spite of all the managerial effort. The temptation is very strong to repeat the cycle, adding yet more manpower. Therein lies madness. (pp. 24-5)

It is important not to lose one’s nerve. If you’ve already used up two months of a four-month schedule, it’s going to be queasy-making to reduce your productivity by a third for another whole month. If you do, you’ve got to stick with it to reap the benefits as your new workers get up to speed. It also suggests two bits of tactics. One: make sure you add enough new staff. If you’re going to take the hit of losing the output of one or more of your currently productive workers to training you want to make sure you get as big a return on that investment as possible — add as many people as you can afford and as you think can be trained in a reasonable amount of time. Second, make sure you invest enough in training. In his own reappraisal of Brooks’s Law Steve McConnell called Brooks’s assumptions about training costs “absurdly conservative” and they may be. But notice that even with those conservative assumptions the investment can still pay off quickly. It can be tempting to try to cheat, adding staff without explicit training, hoping they’ll somehow get up to speed on their own. If it works, great, but more likely they’ll just nibble away at the productivity of the current staff without ever becoming productive enough to offset the cost. Better to plan conservatively and then end the training ahead of schedule if they’re ready to get to fully productive work sooner than planned.

One woman can’t have a baby in nine months

May 10, 2007

As all software developers know, nine women can’t have a baby in a month. Or in Fred Brooks’s more elegant phrasing: “The bearing of a child takes nine months, no matter how many women are assigned.” (The Mythical Man Month, p. 17) The point, of course, is that some tasks are, as Brooks would say, “sequentially constrained”. They’re going to take a certain amount of time no matter what — the time can’t be reduced by having more people work on them.

On the other hand, is it actually the case that one woman can have a baby in nine months? Suppose we have just been put in charge of Project New Baby that must produce a brand new baby in nine months. How should we staff the project. Easy enough — nine women can’t have a baby in a month, right? No point in overstaffing so we’ve just got to find a couple, make sure they’re both fertile and interested in having a kid, and we’re good to go. But wait a sec’, what’s the chance they’ll miss our nine month deadline by more than a month? Pretty high, it turns out.

Typically a couple trying to get pregnant has about a 16% chance in any given month. Once they’ve conceived, there’s, sadly, about a 15-20% chance of miscarriage, usually within the first three months. So the chance our couple will produce a baby nine months from now is only .16 × .85 or 13.6%. If we wanted to we could compute the average time we should expect it to take for one couple to have a baby, using math similar to that in an earlier post. But suppose the deadline is hard — we really, really need to finish Project New Baby in nine months — is there anything we can do?

Sure. Throw bodies at it. While a single couple has a 86.4% chance of missing our deadline, if we had two couples, the chance that they’d both miss it is only .8642 or about 74.6%. With three couples, the chance of blowing it is down to 64.4%. To figure out how many couples we need to have a P chance of hitting our deadline, just plug P into this formula:

Of course, this could get expensive if we need to be really certain of hitting that deadline — to have a 90% chance of hitting it we’d need sixteen couples. But depending on how important Project New Baby and it’s deadline are, it might be worth it.

So what in software is like making babies? Let’s take a look at how Brooks himself tied making babies to making software:

The bearing of a child takes nine months, no matter how many women are assigned. Many software tasks have this characteristic because of the sequential nature of debugging. (p. 17)

Unfortunately, I haven’t been able to find anywhere where he explains what he means by “the sequential nature of debugging” but I can see how debugging is like having a baby. And not that they both can be incredibly painful and that you have a great feeling of relief when you’re done. The similarity that I see is that the time it takes to find a bug has a large random component, like trying to conceive a child. Basically when you’re looking for a bug, there’s some probability p that you’ll find the bug for each unit of time that you spend looking, just like a couple has a 16% chance of getting pregnant for each month they spend trying. If you’re a skillful debugger and know your code really well p will be higher but there’s always a random element — if you go down the wrong path it can take you a while to realize it and all that time is lost whereas if you had tried a different path first, you might have found the bug right away. This is why it’s almost useless to try to estimate how long it will take to find a bug. You could find it in the next five minutes or five weeks from now. Once you find the bug of course you also have to fix it but that tends to be less random — unlike a pregnancy, which always lasts about nine months after conception, different bugs will require more or less work to fix, but once you’ve found it you can usually characterize how big a job it’ll be. And for many, if not most, bugs finding them is the hard part — once you’ve well and truly tracked them down, the fix is often trivial.

All of which suggests we can use the same technique to speeding up debugging as we did on Project New Baby — throw bodies at it. Suppose we’re ten days from the end of a release and there’s one last serious bug to be tracked down. Suppose my chance of finding it is 10% per day. The chance that I won’t find it in the next ten days is (1 − .1)10 or about 35%. But if there’s someone else who can also look for it — say a pair programming partner — who also has a 10% chance of finding it per day, and we both work at it separately. Then the chance that the bug will remain at large by the end of the release drops to 12%. If we can throw even more developers at it, then the chances of the bug escaping drop even more: 4% chance with three developers, 1% with four, 0.5% with five.

Obviously, to be able to take advantage of this strategy requires having multiple developers with enough familiarity with the code to be able to pitch in. Which seems to me a strong argument for practices such as pair programming and collective code ownership. An interesting side question is whether, if you do have developers to throw at debugging in this way, it is better for them to work independently or should they pair up for the debugging on the grounds that two heads are better than one?

If Sisyphus had only had a partner

May 8, 2007

While working on another blog entry (still in progress) about Brooks’s Law, I got to thinking about pair programming and how it’s possible that two people working together, sitting at one computer, can be more productive than the same two people working on their own and combining their work. I certainly believe they can, based on my own experiences with pair programming. But after immersing myself in Brooksian notions of how communication costs quickly eat up all available productivity it seems a bit of a paradox. To the extent that writing software is like carrying rocks up a hill — and doesn’t it often feel that way? — here’s an explanation.

Suppose you have a hundred heavy rocks that you need to carry up a hill. They’re not so heavy that you can’t do it but they’re heavy enough that moderately often you’ll lose your grip and the rock will roll back to the bottom of the hill. Let’s say on each attempt to carry a rock up the hill there’s a 70% chance you’ll lose your grip. Assume that when you don’t drop the rock it takes five minutes to carry it up the hill and a minute to walk back down. First question: how long will it take you to get all the rocks to the top of the hill? Obviously in practice it depends on how often that 70% chance of the dropping the rock actually bites you, but we can figure out an expected value. If the drops are randomly distributed — sometimes near the bottom of the hill and sometimes near the top — you’ll lose an average of three minutes per drop. But once you drop a rock you have to start all over again with it and there’s a chance you’ll drop it again. Thus the amount of time you should expect to spend on each rock is six minutes plus the sum of this infinite series:

Add that seven minutes to the six minutes to get it to the top of the hill without dropping and we get an average of thirteen minutes per rock, or 1,300 minutes for all one hundred rocks.

Now suppose you had a partner. Assuming there’s room for two people to carry rocks at the same time, one way to reduce the time it takes to get all the rocks to the top of the hill would be to simply each carry fifty rocks — the 1,300 minutes would be cut in half, to 650 minutes. But there’s another possibility — since the rocks are just a bit too heavy for one person to manage 100% reliably perhaps the two of you working together would be strong enough to never drop a rock. In that case, you could carry all hundred rocks up without dropping any and the whole job would take only 600 minutes, even better than splitting the work.

Of course if the chance of one person dropping a rock was lower, then working separately might be a better bet. In fact we can figure out exactly what probability makes it better to work separately or together by solving this inequality for p:

The numerator of the left hand side represents the expected time it’ll take for one person to get one rock to the top if it takes x minutes with no drops. We divide by two to account for the fact that there are two people working at it. The right hand side represents the time taken with both folks working together and never dropping a rock. After some algebra the xs all go away and it turns out that when the probability of dropping a rock is greater than ⅔ it’s better to pair up than to work separately.

Now, a ⅔ chance may seem fairly high but it’s worth thinking about where that probability comes from. Let’s consider how a ⅔ chance of dropping the rock over five minutes relates to the chance that we’ll drop it in any single minute. To back out the per-minute chance of dropping, given the total probability of dropping and the number of minutes, we start by recognizing that the probability of dropping is equivalent to one minus the probability of not dropping. And to not drop for five minutes we need to not drop for one minute, five times in a row. More generally, to not drop for m minutes, we need to not drop for one minute, m times in a row. If h is the probability of holding (i.e. not dropping) for one minute, and the probability of holding in any one minute is independent of any other minute (i.e. dropping is more or less random and not the result of fatigue), then the combined probability of m minutes is hm. Thus if D is the probability that we’ll drop a rock any time in m, then we can figure out h, the probability that we can hold a rock for a minute, and from there, trivially, d, the probability that we’ll drop it in any one minute, for a given D and number of minutes m as shown here:

Plugging our ⅔ chance and 5 minutes into this formula we find that that a ⅔ chance of dropping over five minutes is equivalent to about a 20% chance of dropping in any single minute. If we want to find out the probability that we’ll drop a rock over a m minute trip, given d, we can use this formula:

Or perhaps more to the point we can solve this inequality:

to determine the relationship between d and m that determines when the total probability of dropping is greater than the ⅔ chance that makes it worthwhile to pair up rather than working separately:

With this formula we can see that if it only took us three minutes to climb the hill, we could live with up to a 31% chance of dropping per minute before pairing would make sense. But if it took us 20 minutes, then we’d do well to pair up even if every minute we had a 95% chance of keeping our grip.

So is developing software like carrying rocks? I’d argue that in many ways it is. Programming requires keeping a bunch of things in mind and if you lose your mental grip on any of them for a moment you either have to backtrack to re-figure out how things fit together or, worse yet, you proceed with a faulty understanding and introduce a bug which later requires a lot of time to track down and fix. In fact programming is in some ways worse than carrying rocks because the cost of a momentary slip of concentration can be much more than simply the equivalent of a rock rolling back to where you started. A bug that you create a few hours into a programming session may take many hours or even days to track down and fix. Luckily pairing can help there too — while one partner is focusing their mind on the next thing the other partner’s mind may linger for a moment and have a “Wait a sec’” moment that catches a bug before it gets too far away, the equivalent of catching a dropped rock before it rolls all the way back down the hill. Or when bugs do get in, a pair can often find them faster than a single programmer, much the way two people would be able to find a dropped rock if it didn’t just roll back to the bottom of the hill but bounced off in some random direction into thick weeds.