[SIGCIS-Members] Numerical errors

Sat Jul 4 14:44:23 PDT 2020

I have two contributions to this fascinating discussion -- I hope you find them of interest. Something for the Fourth of July. One is serious, the other less so (but you be the judge);

As most of you know, the Space Shuttle was equipped with five identical IBM 4-pi computers, with a voting circuit to override any hardware fault in one of them. The fifth computer was programmed by a separate team. The reasoning was that an error in the software would vitiate the redundancy of the hardware, since all five would possibly have the same "bug." After the Shuttle was in operation for a while, NASA realized that an error in the _specifications_ would have been common to all five, regardless of who programmed them or whether they were programmed correctly. In other words, the "belt and suspenders" philosophy was perhaps flawed.

The second is my recollection of a meeting at the Charles Babbage Institute in its inaugural year, when George Stibitz of Bell Labs was in attendance. While at Bell Labs, Stibitz was active in developing error-detecting codes for relay computers, which were notorious for their tendency to encounter intermittent hardware faults. As others mentioned, Hamming extended this work, and error-detecting and error-correcting codes are now common in most digital systems. After retiring from Bell Labs, Stibitz moved to Vermont and took a post at Dartmouth, across the Connecticut River. He told us that when he went to get a Vermont driver's license, he was told that they couldn't give him the license that day because "the computer made an error." He replied "That's impossible. I invented the computer, and when I did I made sure that it could not make errors." He had a wonderful sense of humor, so perhaps he was just messin' with us. If anyone else was there & remembers the story, let me know.

Paul Ceruzzi
________________________________
From: Members <members-bounces at lists.sigcis.org> on behalf of thomas.haigh at gmail.com <thomas.haigh at gmail.com>
Sent: Saturday, July 4, 2020 1:05 PM
To: 'Matthew Kirschenbaum' <mkirschenbaum at gmail.com>; 'members' <members at sigcis.org>
Subject: [SIGCIS-Members] Numerical errors

External Email - Exercise Caution

Hello Matt,

Great question. I’m going to reply first on the normal treatment of error in numerical applications, and separately on the larger question of design mistakes in hardware and software. You are correct that the Pentium bug fits into the second category, but many of the replies have focused on the first and they are both relevant.

I’m not actually competent in numerical mathematics, but a spell in 2004-6 conducting full career oral history interviews with numerical software specialists as a subcontractor for the Society for Industrial and Applied Mathematics on a DOE grant exposed me to a lot of the history of this area in ways that have occasionally surfaced in my other work. The oral histories from the project are at http://history.siam.org/oralhistories.htm<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhistory.siam.org%2Foralhistories.htm&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568730793&sdata=mbDZO1GZfc99GZ4Wuf20RI2cLYRhbif7RwdSDSzXm8M%3D&reserved=0>.

One of the things it taught me is that the question of a “correct” numerical answer is not nearly as straightforward as most of us assume. In integer mathematics sure 2+2=4 etc. But the kinds of problems scientists needed early computers for invariably involved some very large and small quantities. So even though the hardware didn’t support floating point they basically had to do the same thing manually, storing a certain number of significant digits and tracking the scaling factor that related these to the actual quantity. If you look at the ENIAC flow diagram on our poster, you’ll see little notations tracking the power of ten scaling factors in front of many of the variable names in the program boxes https://eniacinaction.com/docs/MonteCarloPoster.pdf<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Feniacinaction.com%2Fdocs%2FMonteCarloPoster.pdf&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568730793&sdata=6jFNr6qp0xzAAu0qL%2FFJSRLWtLNmiUyfcLt6%2Ftg0m4M%3D&reserved=0>. That manual process was itself a major source of error and frustration, so from the 1950s onward all large computers intended for scientific use included hardware floating point so that if, for example, a very small quantity was multiplied by a very large constant the computer would figure out both the significant digits and the power of ten (or two) needed to scale them.

But whether done manually or automatically, the numbers being represented are only approximations of the actual quantities. When doing calculations manually, scientist and engineers has always had to make a decision on how many digits to use, and doing that responsibly required some knowledge of how reliable the final answer would be based on the initial rounding and the potential of compounding errors as surplus digits were thrown away each time numbers were multiplied.

The other important thing to understand here is that in real world computing even things like differential equations, which college calculus might fool you into thinking can be solved exactly, are solved approximately with numerical methods. These methods are usually iterative, based on measuring how far off target the current answer is so that an initial guess eventually converges on an accurate approximation. The conventional numerical methods found in textbooks, etc. were not well suited for automatic computers. Digital computers could carry our operations thousands of times faster than human computers, which in the worst case allowed errors to compound thousands of times faster.

The new field of “numerical analysis” grew up at the intersection of computing and applied mathematics to address this. It included new methods to track the compounding of numerical errors through computations, and the development of more efficient and accurate algorithms for common mathematical chores such as calculating the matrix eigenvalues. I heard the terms “overflow,” “underflow,” “truncation error” and “rounding error” a lot in the interviews as well as more esoteric terms such as “successive overrelaxation.” One stream of work on “backward error analysis” led to an early Turing Award, for Jim Wilkinson (https://amturing.acm.org/award_winners/wilkinson_0671216.cfm<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Famturing.acm.org%2Faward_winners%2Fwilkinson_0671216.cfm&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568740789&sdata=L%2B%2BJU2XMr3a9%2F5qd5VQC3EMMPNUe%2BbD1YHHwuKGFRfE%3D&reserved=0>). Those methods were also more complex and harder for non-specialists to reliably implement, which led to some of the earliest initiatives in software libraries (SHARE), peer-review of software, portable software (BLAS, PFORT), and software packaging and distribution (LINPACK and EISPACK). One side of the story I did tell was through biographies of Cleve Moler (https://tomandmaria.com/Tom/Writing/MolerBio.pdf<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftomandmaria.com%2FTom%2FWriting%2FMolerBio.pdf&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568740789&sdata=pMuLLi2mhC28ifLfx9xHB78h5ZDn7rWWd0f%2Bh%2FJD9%2F0%3D&reserved=0>) and Jack Dongarra (https://tomandmaria.com/Tom/Writing/DongarraBio.pdf<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftomandmaria.com%2FTom%2FWriting%2FDongarraBio.pdf&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568750785&sdata=yk4CvrJUQNoeS9jVoSULS2BCR5q8iredi0HbreiMR8Y%3D&reserved=0>). Moler founded Mathworks (which you probably hear sponsoring things on NPR). The specialists also complained that ordinary scientists and engineers didn’t want to develop the skills needed to understand which methods could safely be applied to which classes of equation, and so would introduce errors by grabbing the code for an inappropriate method. (A very popular book, Numerical Recipes, was accused of encouraging this and earned the disdain of some of my interviewees).

Doing the interviews, I was struck by the very strong and personal aesthetic preferences the numerical software producers expressed for the floating point arithmetic of particular machines. The IBM 709X machines were acclaimed, whereas the CDC supercomputers were distained. I sneaked a little of this into the Revised History of Modern Computing with Paul Ceruzzi, in terms of the terrible step back introduced with the IBM System/360 arithmetic. This needed expensive fixes to installed computers, like the Pentium bug, but it wasn’t a bug – just the result of the design engineers making decisions without a good idea of how they would impact scientific users.

Although System/360 was intended to work equally well for scientific and data processing applications it was much more successful for data processing. The problems began with the System/360 floating point. It used hexadecimal (base 16) rather than binary, which was efficient for smaller, business-oriented machines but would create major problems with rounding errors for scientific users. The new general-purpose registers raised more problems with the handling of single and double precision numbers. When IBM described its new architecture, William Kahan, then of Waterloo University, and others “went nuts” as they “recognized something really perverse about the arithmetic.” IBM found ways to work around some of the issues in software libraries, but Kahan recalls that after the full scale of the problem was acknowledged in 1966, following lobbying by SHARE, the company spent millions tweaking the hardware of machines already installed.

The wide range of approaches to floating point arithmetic was also a threat to portability. FORTRAN code could be moved from one system to another, but it would give different answers when run on them. There might also be relatively large shifts in answers based on tiny variations in initial inputs. So the question of error gets complicated as the “right” answer depends on the machine the code is being run on. Also an algorithm might run accurately but give misleading answers because it is being applied to an equation with unsuitable characteristics.

Kahan is the central figure in addressing these problems, leading the IEEE standards effort to come up with an optimal floating point design that could be standardized across manufactures. Luckily, Intel was the first adopter thanks to a consulting contract Kahan had. He’s a fascinating figure (I wrote the profile at (https://amturing.acm.org/award_winners/kahan_1023746.cfm<https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Famturing.acm.org%2Faward_winners%2Fkahan_1023746.cfm&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568750785&sdata=o0fso1DPuvQKWAXvuYs5tD4KQ28V5YRWdzYp5rMnFbQ%3D&reserved=0>) but relatively little known because floating point is seen as such a niche area. When I showed up for the interview he talked for 24 hours spread over four days (http://history.siam.org/pdfs2/Kahan_final.pdf<https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhistory.siam.org%2Fpdfs2%2FKahan_final.pdf&data=02%7C01%7Cceruzzip%40si.edu%7C813d4d9b22d5440a265d08d8203c8287%7C989b5e2a14e44efe93b78cdd5fc5d11c%7C0%7C1%7C637294791568760781&sdata=B0LQUfhAiXHo7prQvO82aMeZ7D62EatZ%2FEv5CTnTtOc%3D&reserved=0>)  Here’s how we tell that story in the Revised History:

Doing engineering calculations or financial modelling cost a lot less with a personal computer, such as the Apple II, than with a mainframe or timesharing system. But only small jobs would fit into its limited memory and run acceptably quickly. Complex models still needed big computers. That began to change with the IBM PC. Even the original IBM PC could be expanded to much larger memory capacities than the Apple.

The other big difference was floating point. Since the 1950s capable floating-point hardware support had been the defining characteristic of large scientifically-oriented computers. The 8088 used in the original PC did not support floating point and its performance on technical calculations was mediocre. But every PC included an empty socket waiting for a new kind of chip, the 8087 “floating point coprocessor.” The 8087 was the first chip to implement a new approach to floating point, proposed by William Kahan and later formalized in the standard IEEE 754. Its adoption by firms including DEC and IBM was a major advance for scientific computing. Code, even in a standard language like FORTRAN, had previously produced inconsistent floating-point results when run on different computers. According to Jerome Coonen, a student of Kahan’s who managed software development for the original Macintosh, this standardization on robust mechanisms was a “huge step forward” from the previous “dismal situation…. Kahan’s achievement was having floating point taken for granted for 40 years.”

The 8087 was announced in 1980 but trickled onto the market because it pushed the limits of Intel’s production processes. Writing in Byte, Steven S. Fried called it “a full-blown 80-bit processor that performs numerical operations up to 100 times faster… at the same speed as a medium-sized minicomputer, while providing more accuracy that most mainframes.” The 8088 itself had only 29,000 transistors, but its coprocessor needed 45,000 to implement its own registers and stack.

Code had to be rewritten to use special floating-point instructions, were executed in parallel with whatever the main processor was doing. Scientific users quickly embraced the 8087, which made the PC a credible alternative to minicomputers. Fried had promised that “the 8087 can also work wonders with business applications” but software support was limited. Even Lotus-1-2-3, which existed only to crunch numbers, did not utilize it. Fried began a business selling patches to add coprocessor support to such packages.  Over time, IEEE style floating point became a core part of every processor. By the time Intel launched the 80486 in 1989, its factories were just about able to manufacture a one million transistor chip with a coprocessor built in. Software developers, particularly videogame programmers, began to use floating point instructions. By the late-1990s PC processors competed largely on the strength of their floating-point capabilities.

So that’s two big kinds of error to dig into: errors related to the handling of arithmetic in a particular machine and errors introduced by the algorithm (or as they call it “methods”) chosen to solve an equation numerically. Thanks to reliance on IEEE standard floating point and the eclipse of FORTRAN by modern systems like MATLAB both have been largely black-boxed from typical scientific users.

Best wishes,

Tom

From: Members <members-bounces at lists.sigcis.org> On Behalf Of Matthew Kirschenbaum
Sent: Friday, July 3, 2020 12:55 PM
To: members <members at sigcis.org>
Subject: [SIGCIS-Members] the nature of computational error

Hello all,

I am interested in a better understanding of the nature of computational error. My sense is that actual, literal (mathematical) mistakes in modern computers are quite rare; the notorious Pentium bug of the early 1990s is the exception that proves the rule. Most bugs are, rather, code proceeding to a perfectly correct logical outcome that just so happens to be inimical or intractable to the user and/or other dependent elements of the system. The Y2K "bug," for instance, was actually code executing in ways that were entirely internally self-consistent, however much havoc the code would wreak (or was expected to wreak).

Can anyone recommend reading that will help me formulate such thoughts with greater confidence and accuracy? Or serve as a corrective? I'd like to read something fundamental and even philosophical about, as my subject line has it, the nature of computational error. I'd also be interested in collecting other instances comparable to the Pentium bug--bugs that were actual flaws and mistakes hardwired at the deepest levels of a system.

Thank you-- Matt

--

Matthew Kirschenbaum
Professor of English and Digital Studies
Director, Graduate Certificate in Digital Studies
Printer's Devil, BookLab
University of Maryland

mgk at umd.edu<mailto:mgk at umd.edu>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.sigcis.org/pipermail/members-sigcis.org/attachments/20200704/bcab4649/attachment.htm>