From BIOS to Chaos · blog.friedl.net

I should really start a whole category for these kinds of posts. The madness in our computing boxes has seemingly no borders. I guess few who have been with software for a while would be surprised if they wake up tomorrow and the whole financial system broke down because someone turned on their toaster.¹ Anyhow, here’s how a quirk in the OG PC (the IBM PC) haunted the boot process of the little box next to you for 30 years: the ominous A20 line.

In short the original IBM PC, based on the Intel 8088, could address 1 MB of memory. That’s 20 address lines. The processor registers where only 16-bits wide. Hence, an addressing scheme based on a 16-bit segment and a 16-bit offset was used. It’s quite simple: first shift the 16-bit segment by 4, then add the 16-bit offset and finally chop off everything above the 20th bit.

Now this can obviously overflow, like any addition can. And like any integer overflow, if you’re just doing nothing this causes a wrap-around. It’s essentially an addition modulo 20. This is nothing new. It pops up again and again. It is also the underlying glitch in the Y2K problem.

So far so good. What was maybe less good was that the Intel 80286 could handle 24-bit addresses, but at the same time aimed to be 100% compatible with the 8088 in real mode. Now with a wider address bus our “natural” wrap around doesn’t happen that naturally anymore, does it. Not when overflowing just 20-bits anyways.

Apparently some software made use of that overflow. For why they did, I don’t know. Laziness, carelessness, oversight, or, in the best case, an intentional hack. IBM didn’t want to break said software on their machines and so they devised a hack - note how IBM hacked a solution for a non-backwards compatible behavior in Intel processors that only came about because some software relied on overflowing addresses. This sounds an awful lot like some managers did not know how to say “no” here.

The hack used the keyboard controller. Yes, a pin in the keyboard controller would be used to disable (i.e. null out) address line A20 (which is the 21st bit in the address). Somehow this is supposed to simulate a wrap-around and apparently it worked well enough. This is insanity already. If your bones aren’t chilled yet, wait for what came next.

Over time people sought to create simpler solutions for the A20 line problem. Admittedly, using the keyboard controller is also butt ugly. But alas, as it so happens, we ended up with many incarnations of “the simple solution” and at the end nobody knew which of them is now the right one to use. So here we are with 171-lines of assembler to maybe, just maybe enable the A20 line. Something akin to this ran in the major bootloader of every x86 based system until 2010. It looks like x64 got rid of this for the most part, but then again people are still confused to this day.

And if you haven’t abandoned all hope yet, here’s how Microsoft invalidated the interrupt descriptor table to intentionally cause a triple fault, causing the system to reset and forcefully enter real mode: https://web.archive.org/web/20110629015711/http://blogs.msdn.com/b/larryosterman/archive/2005/02/08/369243.aspx.

It’s just mad hacks all over the place. That the world is still running on these layers upon layers of crapware is nothing short of a miracle. But, I can’t help to also marvel at the alluring ingenuity of these hacks. At least until the world ends because this house of cards inevitably and completely falls apart. Then my marveling is definitely over.

The post mortem will later read: […] causing a server in IBMs basement to reboot. Said server was set up for a product demo in 1973. The server was later sold as AI trader robot to the Agriculture Bank of China. An engineer at the ABC hacked together a workaround for the clock skew that had built up over 5 decades in IBMs machine. In sprint 342. Because their scrum master convinced management that perfect is the enemy of finished. Also they had to groom their burndown chart so their scrum master could get a promotion. The reboot caused the clocks to synchronize via NTP. The corrected clock caused the workaround to malfunction. How the IBM machine knew about NTP is still being investigated. Also, where it had the clock from is still unclear. […] ↩︎