Real-life examples of floating point errors

The Patriot missile failures

On February 25, 1991, during the Gulf War, an American Patriot Missile battery in Dharan, Saudi Arabia, failed to track and intercept an incoming Iraqi Scud missile. The Scud struck an American Army barracks, killing 28 soldiers and injuring around 100 other people. Patriot missile A report of the General Accounting office, GAO/IMTEC-92-26, entitled Patriot Missile Defense: Software Problem Led to System Failure at Dhahran, Saudi Arabia reported on the cause of the failure. The Patriot missile failures in Desert Storm were due to roundoff error because the system was designed to be on alert for only a few minutes at a time.  Instead, they ran continuously for days. The clock, which was storing time in floating-point, began to have too many digits to the left of the decimal place and not enough after the decimal place to do accurate timings of incoming SCUD velocities.


The Vancouver Stock Exchange

The Vancouver Stock Exchange ran for 22 months with a bad algorithm for computing the index.  The initial value of the index was normalized to be 1000.000.  Instead of the obvious algorithm of adding up all the current values of the stocks, somebody thought it would be more efficient if only the stocks that actually changed that day were included in the computation, since some stocks don't change price on some days.  So, instead of the obvious "add all the stock prices up" algorithm, they used:
 
    for each stock S
        if S changed in price today then
            index := index + change in S
        end if
    end for
Of course, the changes are almost always small compared to 1000.000, and we're adding a bunch of small numbers to a big number. 
Result: loss of precision, and after 22 months, when the actual value of the VSE had become about 1098.892, the accumulated roundoff error of 22 months had the index valued at 520.

The sinking of the Sleipner A offshore platform

(Taken from http://www.ima.umn.edu/~arnold/disasters/)

The Sleipner A platform produces oil and gas in the North Sea and is supported on the seabed at a water depth of 82 m. It is a Condeep type platform with a concrete gravity base structure consisting of 24 cells and with a total base area of 16 000 m2. Four cells are elongated to shafts supporting the platform deck. The first concrete base structure for Sleipner A sprang a leak and sank under a controlled ballasting operation during preparation for deck mating in Gandsfjorden outside Stavanger, Norway on 23 August 1991.

Immediately after the accident, the owner of the platform, Statoil, a Norwegian oil company appointed an investigation group, and SINTEF was contracted to be the technical advisor for this group.

The investigation into the accident is described in 16 reports...

The conclusion of the investigation was that the loss was caused by a failure in a cell wall, resulting in a serious crack and a leakage that the pumps were not able to cope with. The wall failed as a result of a combination of a serious error in the finite element analysis and insufficient anchorage of the reinforcement in a critical zone.

A better idea of what was involved can be obtained from this photo and sketch of the platform. Sleipner Sleipner A The top deck weighs 57,000 tons, and provides accommodation for about 200 people and support for drilling equipment weighing about 40,000 tons. When the first model sank in August 1991, the crash caused a seismic event registering 3.0 on the Richter scale, and left nothing but a pile of debris at 220m of depth. The failure involved a total economic loss of about $700 million.


The 24 cells and 4 shafts referred to above are shown to the left while at the sea surface. cells tricell The cells are 12m in diameter. The cell wall failure was traced to a tricell, a triangular concrete frame placed where the cells meet. At right one is pictured undergoing failure testing.

The post accident investigation traced the error to inaccurate finite element approximation of the linear elastic model of the tricell (using the popular finite element program NASTRAN). The shear stresses were underestimated by 47%, leading to insufficient design. In particular, certain concrete walls were not thick enough. More careful finite element analysis, made after the accident, predicted that failure would occur with this design at a depth of 62m, which matches well with the actual occurrence at 65m.

 

The explosion of the Ariane 5

(Taken from http://www.ima.umn.edu/~arnold/disasters/)

On June 4, 1996 an unmanned Ariane 5 rocket launched by the European Space Agency exploded just forty seconds after its lift-off from Kourou, French Guiana. Ariane explosion The rocket was on its first voyage, after a decade of development costing $7 billion. The destroyed rocket and its cargo were valued at $500 million. A board of inquiry investigated the causes of the explosion and in two weeks issued a report. It turned out that the cause of the failure was a software error in the inertial reference system. Specifically a 64 bit floating point number relating to the horizontal velocity of the rocket with respect to the platform was converted to a 16 bit signed integer. The number was larger than 32,767, the largest integer storeable in a 16 bit signed integer, and thus the conversion failed.

The following paragraphs are extracted from the report of the Inquiry Board. An interesting article on the accident and its implications by James Gleick appeared in The New York Times Magazine of 1 December 1996. The CNN article reporting the explosion, from which the above graphics were taken, is also available.

On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded.

The failure of the Ariane 501 was caused by the complete loss of guidance and attitude information 37 seconds after start of the main engine ignition sequence (30 seconds after lift-off). This loss of information was due to specification and design errors in the software of the inertial reference system.

The internal SRI* software exception was caused during execution of a data conversion from 64-bit floating point to 16-bit signed integer value. The floating point number which was converted had a value greater than what could be represented by a 16-bit signed integer.

London Millenium Bridge, wobbling (compare Tacoma Bridge)

(Simulation fails because of wrong estimates for pedestrian forces, 2000)
http://www.arup.com/MilleniumBridge/ Choose: IN DEPTH, Technical and Video
More stories ...

Collection of Software Bugs

by Prof. Thomas Huckle

http://wwwzenger.informatik.tu-muenchen.de/persons/huckle/bugse.html

Back to CH E 374