Unit testing in Coders at Work

In his now infamous blog post “The Duct Tape Programmer”, Joel Spolsky quoted Jamie Zawinski from my interview with him in Coders at Work talking about how they didn’t use many unit tests when developing Netscape. “Uncle Bob” Martin, chiming in to say that Spolsky posting was right in general but wrong in almost all his specific claims and criticisms, was particularly riled by Spolsky’s implication that maybe unit tests aren’t 100% necessary at all times:

As for Joel’s consistent dismissal of unit testing, he’s just wrong about that. Unit testing (done TDD style) does not slow you down, it speeds you up. One day I hope Joel eventually realizes this. Programmers who say they don’t have time to write tests are living in the stone age. They might as well be saying that man wasn’t meant to fly.

Tim Bray also jumped in to strongly agree with Uncle Bob on the importance of unit tests, though he couldn’t bring himself to actually agree with much else Uncle Bob said.

Joel is wrong to piss on unit testing, and buys into the common fantasy that it slows you down. It doesn’t slow you down, it speeds you up. It’s been a while since I’ve run a dev team, but it could happen again. If it does, the developers will use TDD or they’ll be looking for another job.

Since this all started from the Zawinski interview in Coders at Work and since there were other people interviewed for the book, I figured it might be interesting to see what some of the other folks I talked to had to say about unit testing and things like TDD (“test driven development” or sometimes “test driven design”, for those of you behind on your acronyms.)

To start with, here’s a bit more context from the Zawinski interview:

Seibel: What about developer-level tests like unit tests?

Zawinski: Nah. We never did any of that. I did occasionally for some things. The date parser for mail headers had a gigantic set of test cases. Back then, at least, no one really paid a whole lot of attention to the standards. So you got all kinds of crap in the headers. And whatever you’re throwing at us, people are going to be annoyed if their mail sorts wrong. So I collected a whole bunch of examples online and just made stuff up and had this giant list of crappily formatted dates and the number I thought that should turn into. And every time I’d change the code I’d run through the tests and some of them would flip. Well, do I agree with that or not?

Seibel: Did that kind of thing get folded into any kind of automated testing?

Zawinski: No, when I was writing unit tests like that for my code they would basically only run when I ran them. We did a little bit of that later with Grendel, the Java rewrite, because it was just so much easier to write a unit test when you write a new class.

Seibel: In retrospect, do you think you suffered at all because of that? Would development have been easier or faster if you guys had been more disciplined about testing?

Zawinski: I don’t think so. I think it would have just slowed us down. There’s a lot to be said for just getting it right the first time. In the early days we were so focused on speed. We had to ship the thing even if it wasn’t perfect. We can ship it later and it would be higher quality but someone else might have eaten our lunch by then.

There’s bound to be stuff where this would have gone faster if we’d had unit tests or smaller modules or whatever. That all sounds great in principle. Given a leisurely development pace, that’s certainly the way to go. But when you’re looking at, “We’ve got to go from zero to done in six weeks,” well, I can’t do that unless I cut something out. And what I’m going to cut out is the stuff that’s not absolutely critical. And unit tests are not critical. If there’s no unit test the customer isn’t going to complain about that. That’s an upstream issue.

I hope I don’t sound like I’m saying, “Testing is for chumps.” It’s not. It’s a matter of priorities. Are you trying to write good software or are you trying to be done by next week? You can’t do both. One of the jokes we made at Netscape a lot was, “We’re absolutely 100 percent committed to quality. We’re going to ship the highest-quality product we can on March 31st.”

So Zawinski says unit testing would have slowed them down. Uncle Bob and Tim Bray both say that unit testing, “doesn’t slow you down, it speeds you up.” Did Zawinski and the rest of the Netscape gang just blow it? They were going all out to develop their software as fast as they could; could they have sped things up with more unit testing? Maybe they were just living in the stone age.

Now, if Uncle Bob and Bray wanted to make a less radical claim than that unit testing always speeds you up, they could point out that unit tests can help you go faster over the longer term, and it’s not clear even Zawinski would disagree. And they’d definitely get some strong support for that claim from the subject of chapter two of Coders, Brad Fitzpatrick. I asked him about any big differences between his early and later programming style:

Fitzpatrick: I’ve also done a lot of testing since LiveJournal. Once I started working with other people especially. And once I realized that code I write never fucking goes away and I’m going to be a maintainer for life. I get comments about blog posts that are almost 10 years old. “Hey, I found this code. I found a bug,” and I’m suddenly maintaining code.

I now maintain so much code, and there’s other people working with it, if there’s anything halfway clever at all, I just assume that somebody else is going to not understand some invariants I have. So basically anytime I do something clever, I make sure I have a test in there to break really loudly and to tell them that they messed up. I had to force a lot of people to write tests, mostly people who were working for me. I would write tests to guard against my own code breaking, and then once they wrote code, I was like, “Are you even sure that works? Write a test. Prove it to me.” At a certain point, people realize, “Holy crap, it does pay off,” especially maintenance costs later.

Another interviewee, Joshua Bloch described how he designs code by starting with the APIs. He claimed this is a sort of “test-first programming and refactoring applied to APIs” since the first thing he does with a newly designed is test whether it would support the use cases that had lead to creating the API in the first place. But since he does all that writing any runnable code, that could also be called old-fashioned, “thinking about what you’re going to do before you do it” programming. Bloch did dispute the claim of those TDD advocates who say the tests produced by TDD can function as a spec for the code under test:

I don’t think tests are even remotely an acceptable substitute for documentation. Once you’re trying to write something that other people can code to, you need precise specs, and the tests should test that the code conforms to those specs.

Elsewhere Bloch described how he used both system and unit testing when he was working on an implementation of transactional shared-memory:

To test the code, I wrote a monstrous “basher.” It ran lots of transactions, each of which contained nested transactions, recursively up to some maximum nesting depth. Each of the nested transactions would lock and read several elements of a shared array in ascending order and add something to each element, preserving the invariant that the sum of all the elements in the array was zero. Each subtransaction was either committed or aborted—90 percent commits, 10 percent aborts, or whatever. Multiple threads ran these transactions concurrently and beat on the array for a prolonged period. Since it was a shared-memory facility that I was testing, I ran multiple multithreaded bashers concurrently, each in its own process.

At reasonable concurrency levels, the basher passed with flying colors. But when I really cranked up the concurrency, I found that occasionally, just occasionally, the basher would fail its consistency check. I had no idea what was going on. Of course I assumed it was my fault because I had written all of this new code.

After the system test demonstrated the presence of a bug he turned to unit tests to find it:

I spent a week or so writing painfully thorough unit tests of each component, and all the tests passed. Then I wrote detailed consistency checks for each internal data structure, so I could call the consistency checks after every mutation until a test failed. Finally I caught a low-level consistency check failing—not repeatably, but in a way that allowed me to analyze what was going on. And I came to the inescapable conclusion that my locks weren’t working. I had concurrent read-modify-write sequences taking place in which two transactions locked, read, and wrote the same value and the last write was clobbering the first.

I had written my own lock manager, so of course I suspected it. But the lock manager was passing its unit tests with flying colors. In the end, I determined that what was broken wasn’t the lock manager, but the underlying mutex implementation! This was before the days when operating systems supported threads, so we had to write our own threading package. It turned out that the engineer responsible for the mutex code had accidentally exchanged the labels on the lock and try-lock routines in the assembly code for our Solaris threading implementation. So every time you thought you were calling lock, you were actually calling try-lock, and vice versa. Which means that when there was actual contention—rare in those days—the second thread just sailed into the critical section as if the first thread didn’t have the lock. The funny thing was that that this meant the whole company had been running without mutexes for a couple weeks, and nobody noticed.

I asked him if he though the author of the mutex code that had been the cause of his problems could or even should have caught the bug with his own unit tests:

I think a good automated unit test of the mutex facility could have saved me from this particular agony, but keep in mind that this was in the early ’90s. It never even occurred to me to blame the engineer involved for not writing good enough unit tests. Even today, writing unit tests for concurrency utilities is an art form.

Donald Knuth, who is also a fan of after-the-fact torture tests, described an approach to coding about as far away from TDD as you can imagine, which he used when originally developing his typesetting system, TeX:

Knuth: When I wrote TeX originally in 1977 and ’78, of course I didn’t have literate programming but I did have structured programming. I wrote it in a big notebook in longhand, in pencil.

Six months later, after I had gone through the whole project, I started typing into the computer. And did the debugging in March of ’78 while I had started writing the program in October of ’77. The code for that is in the Stanford archives—it’s all in pencil—and of course I would come back and change a subroutine as I learned what it should be.

This was a first-generation system, so lots of different architectures were possible and had to be discarded until I’d lived with it for a while and knew what was there. And it was a chicken-and-egg problem—you couldn’t typeset until you had fonts but then you couldn’t have fonts until you could typeset.

But structured programming gave me the idea of invariants and knowing how to make black boxes that I could understand. So I had the confidence that the code would work when I finally would debug it. I felt that I would be saving a lot of time if I waited six months before testing anything. I had enough confidence that the code was approximately right.

Seibel: And the time savings would be because you wouldn’t spend time building scaffolding and stubs to test incomplete code?

Knuth: Right.

So Knuth too disagrees with the notion that unit testing always makes you go faster. Maybe he too is living in the stone age.

Joe Armstrong, on the other hand, says he has moved toward a test-first development style recently:

Seibel: At the point that you start typing code, do you code top-down or bottom-up or middle-out?

Armstrong: Bottom up. I write a little bit and test it, write a little bit and test it. I’ve gone over to this writing test cases first, now. Unit testing. Just write the test cases and then write the code. I feel fairly confident that it works.

The only interviewee who touched directly on TDD versus other approaches was Peter Norvig. He said he does more unit testing than he used to and even said some nice things about TDD but pointed out:

It’s also important to know what you’re doing. When I wrote my Sudoku solver, some bloggers commented on that. They said, “Look at the contrast—here’s Norvig’s Sudoku thing and then there’s this other guy,” whose name I’ve forgotten, one of these test-driven design gurus. He starts off and he says, “Well, I’m going to do Sudoku and I’m going to have this class and first thing I’m going to do is write a bunch of tests.” But then he never got anywhere. He had five different blog posts and in each one he wrote a little bit more and wrote lots of tests but he never got anything working because he didn’t know how to solve the problem.

A bit later Norvig said:

Then bloggers were arguing back and forth about what this means. I don’t think it means much of anything—I think test-driven design is great. I do that a lot more than I used to do. But you can test all you want and if you don’t know how to approach the problem, you’re not going to get a solution.

Ignoring Norvig’s suggestion that the difference between the two attempts doesn’t mean much of anything, it is instructive (or at least interesting, in a rubber-necking kind of way) to look at the two writeups. The “other guy” turns out to be Ron Jeffries, one of the inventors of Extreme Programming, the author of two books on XP, and according to his website an “experienced XP author, trainer, coach, and practitioner”.

Norvig’s writeup is a short essay explaining about 100 lines of Python that can solve any Sudoku. Jeffries writeup, by contrast, is spread over five lengthy blog postings here, here, here, here, and here and ends without coming anywhere close to actually producing a program that can solve any but a tiny subset of all Sudoku problems.

At some level the difference between the two simply boils down—as Norvig suggests—to knowledge: Norvig knew how to solve the problem because it’s a specific instance of a kind of problem he already knew how to solve. Jeffries, obviously, did not. But he did choose to tackle this particular problem using TDD, a technique in which he is supposed to be the expert? Why did he have so little success?

One thing I noticed, reading through Jeffries’s blog posts, was that he got fixated on the problem of how to represent a Sudoku board. He immediately started writing tests of the low-level details of a few functions for manipulating a data structure representing the 9×9 Sudoku board and a few functions for getting at the rows, columns, and boxes of the board. (“Boxes” are what Sudoku players call the 3×3 squares subsquares of the 9×9 board.)

Then he basically wandered around for the rest of his five blog postings fiddling with the representation, making it more “object oriented” and then fixing up the tests to work with the new representation and so on until eventually, it seems, he just got bored and gave up, having made only one minor stab at the problem of actually solving puzzles.

I suspect, having done a small amount of TDD myself, that this is actually a pattern that arises when a programmer tries to apply TDD to a problem they just don’t know how to solve. If I was a high-priced consultant/trainer like Jeffries, I’d probably give this pattern a pithy name like “Going in Circles Means You Don’t Know What You’re Doing”. Because he had no idea how to tackle the real problem, the only kinds of tests he could think of were either the very high-level “the program works” kind which were obviously too much of a leap or low-level tests of nitty-gritty code that is necessary but not at all sufficient for a working solver.

However, since most of what Jeffries spent his time on was the code for representing a Sudoku board and determining which row, column, and box a given square on the board is in, let’s look at that part of Norvig’s code.

Norvig’s basic strategy is to represent a board using a hash table with the keys being row-column pairs like A1, A2, and so on up to I9. After seeing the mess Jeffries makes of trying represent a board this is a refreshingly simple choice. It also seems to me that it requires a bit of creativity: given that a Sudoku board is a 9×9 board, I suspect I’m not the only programmer in the world who might be inclined to start with a 2d array. In a language without true 2d arrays, I might then be tempted, as Jeffries was, to then use an 81-element array and then get all wrapped around the axle, as Jeffries did, making sure I haven’t screwed up the finicky math for converting between 1d and 2d indices. And for all we know Norvig fell into the same trap and only later realized that all he really needed was the easy random access provided by a hash table. But pretty clearly Jeffries’s approach of testing the heck out of an array-based implementation wasn’t sufficient to lead him to the much better hash table-based one.

Given his choice to use a hash table, Norvig needs a list of the keys, i.e. the Cartesian product of the row labels (A-I) and the column labels (1-9). So the first bit of code he shows is a function cross which implements a Cartesian product that combines the pairs of elements by concatenation. The standard mathematical definition of the Cartesian product is:

A × B = {(a,b) | a ∈ A and b ∈ B }
        

Python’s list comprehensions let Norvig express this function in essentially the same notation as the standard mathematical definition:

def cross(A, B):
    return [a+b for a in A for b in B]
        

Could he or should he have developed this function via a test-first strategy? Dunno. Would it have been faster? Possibly, if he had any problems getting it right. But given that he’s just transcribing a mathematical definition, maybe a quick check at Python’s REPL would be sufficient to make sure he hadn’t screwed anything up.

Next he defines two variables to hold the row and column labels:

rows = 'ABCDEFGHI'
cols = '123456789'
        

Do TDD people unit test their data? I don’t know. Should they? I’m not even sure what that would mean, at least in a case like this. At any rate, he then feeds these two untested values to the cross function to produce the list of all 81 squares:

squares  = cross(rows, cols)
        

Now he’s basically done with the representation of the board. Any hash table, with the elements of squares as its keys represents a Sudoku board. Later in his code Norvig will use a hash table with strings containing the possible digits that could be put in each square as values.

However there is one other bit of work to be done: to solve a Sudoku you’re going to need to be able to map from a square to the other squares in the same row, column, and box. Having read a bit about Sudoku solving, Norvig has discovered that people use the term ‘units’ to refer to the rows, columns, and boxes and ‘peers’ to refer to all the squares that are in one of the three ‘units’ of another square. Norvig, unlike Jeffries, realized that this is better represented in data than in code and proceeds to compute the data once and for all.

First he makes a list of all the units, i.e. all rows, columns, and boxes, using list comprehensions and his cross function:

unitlist = ([cross(rows, c) for c in cols] +
            [cross(r, cols) for r in rows] +
            [cross(rs, cs) for rs in ('ABC','DEF','GHI') for cs in ('123','456','789')])
        

Now, he can use unitlist to generate a dictionary, units, that maps each square name to a list of its three units. He completely brute-forces this, linearly scanning unitlist for each square, selecting the units containing the square which itself requires a linear scan of each element of unitlist. But why write something more clever when unitlist is only 27 elements long and it’s elements are each 9 elements long and this whole computation is only going to happen once anyway? Here’s the code:

units = dict((s, [u for u in unitlist if s in u]) 
             for s in squares)
        

Once he’s got units, which will also be used in its own right later, he can compute peers, a hash table that maps from square names to the set of peer squares:

peers = dict((s, set(s2 for u in units[s] for s2 in u if s2 != s))
             for s in squares)
        

And that’s it: 7 definitions in 12 lines of code and he’s done with data representation. I’m not sure how much code Jeffries ended up with. In his fourth installment he had about 81 lines devoted to providing slightly less functionality than Norvig provided in the code we just looked at. In the fifth (and mercifully final) installment, he started adding classes and subclasses and moving things around but never presented all the code again. Safe to say it ended up quite a lot more than 12 lines; if he’s lucky it stayed under 120.

I’m not a proponent (or particularly a detractor) of TDD. If I was, I’d be pretty strongly tempted to throw Jeffries under the bus—maybe TDD isn’t quite as bad as he makes it look in this exercise. It certainly seems that, within the constraints of TDD, he could have done a much better job. Perhaps if he had stopped to think a bit about what he was doing he could have, using TDD, ended up with code as simple as Norvig’s. For instance, if he had started a bit closer to the problem domain he could have started by writing tests of his ability to map from squares to peers and units and then implemented something to provide that functionality. So maybe Norvig is right—maybe there’s not much to learn from this episode except that Fred Brooks is still right and there are still no silver bullets.

Anyway, those are some of the highlights of, and some of the context around, what the folks I interviewed for Coders had to say about testing. There are probably some other good bits I’m forgetting at the moment. Feel free to buy a copy and look for them yourself.

6 Responses to “Unit testing in Coders at Work”

  1. Enzo Says:

    TDD slows you down for sure. Now you have to maintain your regular code and your test code base and refactor both constantly. Usually any bug found can be fixed quicker than maintaining all of this test code. No modern company like Ebay, Amazon, Facebook, Google, Zappos, etc. would even exist if they wasted time on TDD up front.

  2. haji Says:

    where can i get like lisp tutorials (videos)

  3. PK Says:

    Peter – From what I have learnt about TDD. It is awesome if you are building an enterprise software. But not otherwise. In the example you have mentioned Norvig was trying to solve a puzzle. TDD will work great if your aim is to display data on the web page from the database. That is pretty much the reach of TDD. You wont be able to solve complicated problems like building a search engine by thinking about it from a TDD angle. I doubt if Google has set of unit test cases for their search engine. I think there are two types of philosophies – design pattern philosophy and the algorithm functional programming philosophy. I have rarely seen a design pattern guru being an expert in algorithms etc or vice versa. Norvig is an algorithms guru, he is not going to like TDD.

  4. The value of DOING for the CIO: test-driven development Says:

    […] Seibel, “Unit testing in Coders at Work“, October 5, […]

  5. Quora Says:

    Is it worth it becoming a software engineer?…

    If you visit the top schools, you’ll see that there’s no distinction set between a software engineer and a computer scientist. Every course 6 graduate from MIT is both a computer scientist and an engineer. The same goes for Berkeley, Stanford, etc. W…

  6. Global Day of Code Retreat | Coding Is Like Cooking Says:

    […] “you can test all you want and if you don’t know how to approach the problem, you’re not going to get a solution” (from Peter Siebel’s book “Coders At Work”, an extract of which is available in his blog) […]

Leave a reply to The value of DOING for the CIO: test-driven development Cancel reply