Saturday, May 31, 2008

Probability and Certainty

I've been reading Fooled by Randomness (highly recommended - it will either change your life or you won't finish it because Taleb's writing style annoys you) - and it has me thinking about the nature of certainty in software development.  Consider two approaches to uncertainty and how they are completely at odds with each other.

No Weird Behavior

The "No Weird Behavior" approach goes like this: you never want a harmless behavior inside your code that you don't understand.  The reason is that if you don't understand the behavior, you don't really know that it's harmless!

In fact this isn't just a theoretical problem - I only truly started to believe in "No Weird Behavior" after fixing several very hard to find, hard to reproduce crash bugs, only to discover (once the analysis was done) that the broken code also produced a very frequent, less harmful behavior.  Had I taken the time to investigate the "weird behavior" first (despite it not being of high importance) it would have led me to a ticking time bomb.

No Weird Behavior implies that a bug isn't fixed until you know why it's fixed, and that randomly changing code until the behavior goes away is absolutely unacceptable.  This shouldn't be surprising; if a bridge was swaying would you accept an engineer who fixed it by randomly tightening and loosening screws until it stopped?

Wait And See

I get a lot of email with bug reports, questions, and reports of strange sim behavior.  These emails have some unfortunate statistical properties:
  • A good number of them are resolved by the user who emailed within a day or two.
  • A certain percentage are simply never resolved.  (Often I email a question that would point to a clear diagnosis and the user never writes back.  I can only speculate that the user answered the question, found the problem, fixed it, and promptly forgot about our emails.)
  • A certain percentage are solved by the user, who tells me what the problem was, and it was something I would never, ever be able to help them with.  (Things like "I have this program called WickedMP3Player - it turns out if I set its visualizer setting to 'Rainbow' then X-Plane stops crashing" when I've never even heard of the WickedMP3Player program to begin with.)
  • There is a high correlation between bug reports reported by a very small number of users and a resolution from the above categories, or a resolution by the user changing his or her local configuration.
So playing the odds, the right thing to do when presented by a third party with weird behavior is to wait and see who else reports it; the frequency of reports gives us information about the likely resolution.

(And for those of you who think our tech support are being lame when they ask if you've updated your drivers, they are playing the odds to the hilt - they ask you about drivers first because updating drivers fixes an absurdly huge number of the tech support calls we get.)

Getting It Wrong

So we have motivation to investigate everything (no weird behavior), motivation to ignore everything (wait and see) and a rule of thumb that the frequency of reports can help us pick which strategy is best.  Of course, sometimes the rule of thumb is completely wrong.

One user reported a crash using the current web updater (version 2.04) - I had not seen this crash anywhere and it was inside the operating system UI code, so I assumed it was a local problem, e.g. some kind of extension or add-on that caused the OS problems.

The problem, it turns out, is simply low frequency - as the incorrect code made it into 902b1, I got a few reports from more users and realized that this wasn't a local problem, it was weird behavior!  (The bug will be fixed in the 205 installer and 902b2, both of which will be out in June.)

Consider this: if you make a measurement of a known quantity with a dubious measuring device, you learn more about the measuring device than the object you are measuring.  (If you have a ruler and you don't know the units, you could determine them by measuring yourself.)

In a number of cases, we have seen serious "should-happen-all-the-time" crash bugs that get reported by very few users.  (Once we know the actual root cause of the bug, we can deduce logically whether the bug should happen to all users with the configuration or be intermittent.) We can then look back at the actual number of reports to make a judgement call on how much testing is really happening.

For example, we can make some guesses about how quickly a new Macintosh have saturated the X-Plane user base when there is a hardware-specific serious bug in just that machine.

We had this with the iMacs (where the runway lights were broken) and we could watch the machines sell - slowly at first, but then quite quickly.  (BTW, I think that 10.5.3 fixes this and anti-aliasing problems.)  We can even see who runs with anti-aliasing when there is a setting-specific bug (a lot of you do)!

In the end, I think the right approach to balancing "no weird behavior" and "wait and see" is to remember a truth about uncertainty that is very hard for humans to grasp:

The most likely outcome of an uncertain situation is not guaranteed to happen - in fact a lot of the time it won't.*

So we can play the rule of thumb and wait and see, but we always have to keep one eye toward the improbable, which is still possible!

* Blatant blog cross-promotion...the point of my big rant here was essentially the same...it's easy to expect the most likely outcome, but the unlikely outcome will happen some of the time. 

No comments: