In last week’s post, I touched on the story of making debug windows work that I think is worth a longer version. As a refresher (or for new readers), the short version is that I volunteered to work on a debug version of windows that was in massive disrepair. The good news was that I had a chance to work with a developer who had worked on Windows forever. The bad news was, that after a few weeks of working together, he decided to retire. The worse news was that the project (and many related side projects) all fell on my shoulders. I was on my own.
There’s Always More to the Story
This story, like all good stories, requires context. I’m a largely self taught programmer. In 1993 when I started at Midisoft (where I was either tech support, a software tester, or a network administrator - depending on the time of the day), one of the five or so devs there decided to take me under his wing and make me learn C. I started like a lot of new programmers - I read books, tried the exercises, and then asked questions. This was largely pre-internet, so I asked Chris (my mentor), and Kevin (a college student who worked part-time at Midisoft. I didn’t get good by any means, but those two taught me a good foundation, including one quote I share often:
“What are the three most important things in C?”
“Pointers, Pointers, and Pointers.”
To this day, C is my strongest language, and if you really want me to write a program with pointers to pointers to pointers and have it make sense, I can do it.
After I left Midisoft and joined Microsoft, my learning path changed. I discovered Steve Maguire’s Writing Solid Code, and I became infatuated with writing “defensive” code. In those days (and I suppose still in some places today), the “debug” version of a program wasn’t just the version without compiler optimizations. We packed our debug versions full of validation code. We raised exceptions (asserts) on invalid parameters, and we wrote debug only code that ensured calculations were correct. In one famous story from Writing Solid Code, Maquire shares the story of early versions of Excel, where the shipping version of the code did calculations all in hand-written assembly (for speed), but the debug version did the calculations in assembly and C so that the debug version of Excel could immediately identify calculation errors.
On Windows 98, I kind of worked on user mode (application level) API testing. At this point in Windows, we weren’t adding many new APIs at that level, so I mainly worked on application compatibility - Windows relied on applications continuing to work after upgrades, but often we made changes that would cause an application that used one of our APIs in an odd (i.e. unsupported) way would break when we “fixed” the API. It was my job to try and figure out why the application behavior changed so that we could figure out if it was something we could fix/patch (usually), or if we needed to contact the company. Eventually, Windows 98 shipped, and as it was - at the time - the last version of “consumer” Windows, we all moved on to different roles. I joined the NT5 (later Windows 2000) team working on video capture drivers. And then the NT5 ship date slipped.
A Tale of Dependencies
Windows NT5 was supposed to ship soon enough that by the time Windows XP came out that the Windows 98 users would be in line to “upgrade” to XP. With NT5 slipping, that meant that XP would slip, and that meant that the Marketing team came knocking to ask if there could be just one more version of Windows 9x. In 1999, people didn’t often download application from the internet, so Windows - for many msft products was the only available ship vehicle. Hearts and Minesweeper (yes, teams worked on those) - as well as apps like Windows Health and Safety only shipped in Windows. Without a consumer facing version of Windows to ship, those apps would just sit inside the walls of Microsoft.
So we put together a team that made Windows ME (which to this day, I refer to as Windows Marketing Edition). It was a fairly small team if you consider that we were building a new operating system, but I was asked to join for two reasons. I was good at debugging, and I was good at figuring out hard stuff. In fact, when I joined, I was pretty much given the option of choosing my role.
Windows 98 was such a crunch that a few bad things happened. One big one was that the team gave up on the debug version of the OS. It was filled with checks and asserts that would have helped me immensely with debugging applications, but nobody had the time (or knowledge) to keep it up, so it was abandoned. I asked if my job could be to fix it. I also asked for help. The answer to both of those questions - in the short term, at least, was yes.
So - as I mentioned last week, after my “help” retired, I applied pure tenacity and curiosity to figure out how things worked and to make fixes - sometimes in Windows, but more often in user mode applications that were part of Windows until eventually, the OS would boot without breaking into the debugger. The debug version of windows was intense in its validation. In addition to obvious things like double freeing memory, or using memory after it was freed (or even letting you know when you had a buffer overflow), it validated just about every single Win32 object type. It was pretty much impossible to write an application that used the Win32 API incorrectly and have it run on debug windows without showing an error. This is why dogfooding (using your product as part of your daily workflow) was so important in those days - and why I was so passionate about getting debug windows working again.
Almost There
As I mentioned above, I got to a point where Windows would boot - to the login screen, and then after logging in, it would hit three asserts - in three different applications while the desktop was loading.
<sigh>
The challenge with these three applications is that none of them were in the windows code base. They were all applications delivered to us in binary form from different teams at Microsoft. In each case, we had a debug version of the binary from the team that we used in debug windows, and in each case, the application had an assert that was firing for non-critical reasons - meaning that the error in the assert didn’t actually cause any program errors - just a break in the debugger.
At this point, I was 6 years into my career in tech, and just my 3rd or 4th year programming full time. I was a young developer, a young leader, and not nearly as good at influence as I am today. I thought I could just track down the owners of each of these three programs, tell them about the bug and then wait to fix it, in every case, they told me that it wasn’t a priority and they weren’t going to do anything. They told me, “as you said, Alan, there’s no underlying error, so we don’t want to touch the binary”.
<sigh>
I had a few options at this point. One was to make a big fuss about it until they did the thing I wanted. While I was young, I had enough common sense to know that this wasn’t the answer. I talked to our build lead about using the retail binaries on debug windows, but for a lot of complex reasons, we didn’t want to do that either. So I cheated.
Stupid Tricks
One of the “other” things I got to own on the team was the actual kernel debugger we used. I had, of course, used our kernel debugger for years, so fortunately, I really only had to maintain it - which inevitably meant adding only a few features (and allowed me to play some practical jokes that I should remember to talk about in an April 1 post). Between working on low levels of Windows and owning the debugger, I learned a reasonable amount of x86 assembly. To this day, I’m pretty comfortable stepping through source at the assembly level and knowing what’s going on - and that was handy.
In x86 assembly, the byte code instruction for debug break is CC
- and is generated by an int 3
assembly instruction. This means that at the point where that invalid assert was firing, there was an int 3
instruction in the code (which is pretty much what the DebugBreak()
function does on Windows). A possibly even more important code instruction in x86 assembly is 90
- which is generated by the nop
(no operation) command. At this point, I think you see where this is going. I noted the address of the offending int 3, loaded the program in a hex editor, and then replaced the offending CC
with a 90
and saved the file. Voila, I now had debug binaries of these three applications that didn’t break in the debugger anymore. (It’s worth noting, for curious minds, the programs had other int 3
s which were valid, but I left those alone).
Then, I let our build lead know that I had new binaries for the three offending apps. He asked where they came from, and I think I told him that it’s better that he didn’t know.
Stress and Games
Within another month, the debug system was flying. Somewhere in the middle of all of this, I was given a small team of people to help - one super smart developer right out of college, an app compatibility tester to track down issues with 3rd party apps, and the person who ran our nightly stress runs.
As I expect Windows still has today, we had a nightly stress run. Basically, there was an agent installed as part of our internal builds, and then every night (after you were idle), it would distribute hundreds of tests running in a loop. Then in the morning, we’d triage all of the crashes (also part of my job), enter bugs, and make fixes. If we were lucky enough to have a crash on debug windows, we were rewarded with a plethora of additional information. Debug windows fared pretty well here too, and it’s worth noting that my small team and I were able to play networked directx games on debug windows with no issues.
Epilogue
Now - in the end, the failures of Windows ME were only forgotten by the failures of Windows Vista. We did a lot of stuff really well, but obviously missed the boat on some things as well. While I can’t be super proud in hindsight about the end result, I still have a lot of good memories about what we accomplished and how we did it. In the end, there was a lot of learning, and that’s definitely a good thing.