A few of the interesting mistakes I made in Basis. Part 1: Groundhog Day.
A few days after Shopi’s launch I was checking the server logs, scanning down to get an idea of how things were going. And out of the blue I was in Groundhog Day: the same transactions were being repeated over and over again as I scrolled down. I started paging quickly to see where they ended … and they didn’t. Eventually one cycle ended, then some normal interactions, followed by another seemingly endless cycle. It was at this point I thought to check the log size: half a gigabyte and climbing rapidly.
Now, tales of product launches going horribly awry are old hat these days when everyone’s dog seems to run an online service. I enjoy these war stories as much as the next nerd, and I had thought I knew what it would be like. But let me tell you there’s a special sort of exquisite gut-wrenching dread that you can only experience when something you thought was going great, something you put your heart into, something you had tested literally for over a year, dammit, starts going wrong before your eyes and you have no idea why. All you know is that the poor sods on the other end of it are probably starting to think that maybe they should try something else, you know, something that actually works.
Now, this may not seem related, but bear with me: did you know that floating point numbers are the devil’s work? Floating point numbers kill. They cause accidents costing billions of dollars. And they lie: we can’t even ask them to represent honest numbers like 0.1 without them deceiving us.
Floating point numbers also caused me 24 hours of barely-suppressed panic.1 Because, as well as being useless for actually storing numbers,2 it turns out they also can’t be reliably round-tripped in textual form. If you write a floating point number as a string to a certain precision, and then read it in again, you’re not guaranteed to get the same string if you write it out to again.
Boy, I wish I’d known that before I launched.
Anything that happens, happens. Anything that, in happening, causes something else to happen, causes something else to happen. Anything that, in happening, causes itself to happen again, happens again. It doesn’t necessarily do it in chronological order, though.
— Douglas Adams, Mostly Harmless
A rite of passage when building any sort of system with two or more entities sending messages to each other is to be frantically hitting Ctrl-C (or its moral equivalent) when you trigger your first infinite event loop.
So one of the things you become conditioned to think about, when multiple entities in a system can respond to an event by emitting an event, is whether loops are possible, and how you’re going to detect and terminate them. Because if you don’t, Murphy’s Law has a special corollary with your name on it.
And here I am presuming to lecture you, when I went ahead and ignored this advice in Basis. Oh, I knew that there was an theoretical event cycle in there, but I also ‘knew’ it wasn’t possible to trigger it. And yet there I was, looking at a half-gigabyte log documenting an event cycle that couldn’t happen.
Long story short: the event cycle was due to the server and the client disagreeing about the hash ID for states. And because they disagreed consistently — as opposed to one-off’s caused by communication errors — an update caused a reset, which caused an update, which caused a reset, …
The reason they disagreed about state ID’s? Floating point numbers.
The protocol between client and server is text-based, and FP numbers are transferred using their scientific representation: for example
1.38641674071679E+02. And while I was aware that this approach could lose some (miniscule amount of) of precision, what I didn’t realise was that certain — let’s call them evil — numbers convert to different strings on iOS and the JVM. Since I was using the textual value of these numbers3 in the state ID hash, the ID’s were different.
For example, when the number
138.6416740716795 is read into, and then written out again on iOS, we get:
And on the JVM we get:
The kicker was why I hadn’t seen this bug in over a year of testing: the numbers were GPS coordinates. I had managed to write a location-based bug. And qualified for a new badge of honour: “works at my coordinates”.
Even though I had tested at simulated locations around the world, the simulated coordinates lacked the precision to trigger the bug, and my local coordinates apparently just weren’t evil enough.
Just Don’t Do That Then
Another long story short (*i.e.* most of those 24 horrible hours): it turned out that … I couldn’t fix it. iOS and the JVM simply have different ways of approximating FP numbers that cannot be precisely represented in decimal form — and I wasn’t about to try to try to write my own.4 The root of the problem is: there is no absolutely correct way to do this, because FP numbers simply cannot be perfectly represented in decimal.
Decimal. That word that turned out to be the key to the solution. Crazy idea: why not store decimal numbers, as decimals? Deal with the problems of FP numbers by simply not dealing with them.
And in fact Clojure already has excellent support for decimals (they’re Java
BigDecimal’s under the hood), and iOS has
NSDecimalNumber. All I had to do was add support for
NSDecimalNumber to my EDN library for iOS and OS X.
Now You Have Two Problems
Floating point is the default way to represent real numbers in the vast majority of languages, and this gives many of us a false sense of security. I don’t think most developers truly understand how slippery floating point is, even if they’ve been forewarned by their CS training. Floating point numbers simply do not behave as we intuitively expect. Worse, they lull us into thinking we understand them by working fine in many situations, but then fail violently in edge cases.
The Patriot missile failure I linked to earlier happened to a missile battery which had been operating just fine, right up to the point that its clock value became large enough that, after being multiplied by
0.1 (FP can’t accurately represent
1/10 remember), the approximation error ballooned out of control. The algorithm failed, and an incoming enemy missile was not intercepted. The missile interception algorithm was correct only for certain ranges of numbers, a design error that I suspect many of us would make.
Beware of floating point numbers! Consider whether you’d be better served by a decimal representation if you’re manipulating numbers whose decimal form you care about.
And when something is core to the functioning of your application, do not allow any fuzziness to creep in. I’ve been down on floating point in this post, but the blame for the problem lies solely with me: I failed to understand the implications of converting floating point numbers to and from decimal, and paid the price.
Update (June 9, 2014)
I got some great feedback on this post: thanks to those people who got in touch. I’ve updated the post to take them into account.
One excellent point made by several people is that I should simply keep in mind that FP numbers are, like any number on the computer, represented in base 2, and thus can precisely represent any binary number (so long as it fits). But, just like
1/3 can be represented trivially in base 3 (it’s simply
0.1) but can’t be represented exactly in base 10 (
1/10 can’t be repesented precisely in binary.
I originally mistyped this as “gloating point numbers”, which just has to be Freudian, or something. ↩︎
I hope it’s obvious I’m trying to be amusing here. Of course I’m not really claiming that floating point numbers are useless, because quite clearly they’re not. I am going to go ahead and claim they’re a hazard and that, for most of us there is a better solution. But I’m getting ahead of myself. ↩︎
Due to the fact that the numbers might not be exactly transferred, I couldn’t use their binary (IEEE 754) representation for the hash.
Would a binary format have mitigated this? Yep. And perhaps a binary format would be better in general here — but I had already launched. Plus I would have added the overhead of rolling by own IO for a binary EDN protocol on iOS. ↩︎
There are a number of standard ways to round FP numbers. The closest one to iOS’s and Java’s behaviour seemed to be “half even”, otherwise known as “banker's rounding”. But even that didn’t didn’t always produce the same result on both. ↩︎