Ken Shirriff's blog: reverse-engineering

Showing posts with label reverse-engineering. Show all posts

Reverse engineering circuitry in a Spacelab computer from 1980

Spacelab was a reusable laboratory that could be carried in the cargo bay of the Space Shuttle, providing lab space for astronauts and experiments. Spacelab was controlled by a French-built minicomputer, called the Mitra 125 MS. Unlike modern computers, this computer didn't contain a microprocessor chip. Instead, its 16-bit processor was constructed from several boards of chips. In this article, I reverse-engineer one of the processor boards, shown below, part of the computer's Arithmetic/Logic Unit (ALU).

The Mitra 125 MS computer, built by CIMSA, with one of the ALU/register cards shown.

Spacelab consisted of a pressurized cylindrical laboratory that held experiments, computers, and work areas for researchers. A tunnel connected the laboratory to the Shuttle, allowing researchers to move between the Shuttle and Spacelab. Spacelab also supported up to five unpressurized "pallets" that were exposed to space, holding experiments such as telescopes and sensors. The illustration below shows the tunnel, the Spacelab laboratory, and a pallet installed in the Shuttle's cargo bay.1

Illustration of the Spacelab-3 mission. From NASA.

Because Spacelab was a European project, it used a European computer, the Mitra 125 MS. The Mitra line started in 1971 when a French company called CII introduced the Mitra 15 minicomputer, a 16-bit computer that used magnetic core memory. Mitra is a French acronym2 that translates as "Mini-machine for Real-Time and Automatic Computing." As the name suggests, Mitra was both small and designed for real-time computing, making it suitable for controlling experiments. The Mitra 15 was a popular computer, with almost 8000 units sold.

In 1975, CII produced a successor called the Mitra 125. The Mitra 125 improved on the Mitra 15 by adding memory management, I/O processors, higher performance, and additional instructions. Spacelab used the Mitra 125 MS minicomputer,3 a militarized variant of the Mitra 125 that was produced by a company called CIMSA. A Spacelab mission had three of these computers: the Subsystem Computer controlled and managed Spacelab itself, while the Experiment Computer handled the experiments. A Backup Computer could take over if either computer failed.1 These computers were part of Spacelab's Command and Data Management Subsystem, which controlled experiments and collected data.4

The three computers were normally mounted in the Spacelab laboratory underneath the Work Bench Rack (details). The computers were controlled through a keyboard and a color CRT display, called the Data Display System (DDS). The computer installation and a DDS are visible in the photo below.

This photo shows astronauts inside Spacelab (but not in space). The Spacelab computers were mounted under the Work Bench (right arrow). The Data Display System (left arrow) provided the interface to the computers. Photo is STS-51B Crew Portrait, 1984.

For some Spacelab missions, the laboratory was omitted entirely, providing more room for experiment pallets. In this case, the computers were mounted in a small pressurized cylinder called the igloo. The researchers remained in the Shuttle, controlling experiments through two Data Display Systems that were mounted in the Shuttle's rear flight deck (photo).

The 74181 ALU chip

The Spacelab computer didn't use a microprocessor chip. Instead, like most minicomputers at the time, it was built from simple integrated circuits that were combined to implement the computer's circuitry. Unlike modern CMOS integrated circuits, these chips contained bipolar transistors, which were fast, but large and power-hungry, a technology known as TTL (transistor-transistor logic). Electronics hobbyists of a certain age will recall the popular 7400 series of TTL chips. The Spacelab computer was built from the military grade of these chips, the 5400 series.

The most complex chip in the computer was probably the '181 Arithmetic/Logic Unit (ALU) chip, containing about 170 transistors. The arithmetic/logic unit is the heart of a computer, performing arithmetic operations as well as Boolean logic operations. In 1970, Texas Instruments put a complete 4-bit arithmetic/logic unit on a single chip, called the 74181. Since the chip was fast, compact, and inexpensive, it was widely used, providing the ALU in computers from the popular PDP-11 and Xerox Alto to the powerful VAX-11/780 "superminicomputer".

The 74181 provides a full set of binary logical operations, including AND, OR, XOR, and complement. For arithmetic, it includes addition, subtraction, incrementing, and decrementing.5 Inconveniently, the 74181 doesn't support shifting right. Moreover, multiplication and division were much too complicated to be included in the 74181. Instead, a processor implemented multiplication and division through repeated addition or subtraction, combined with shifting. Likewise, floating-point operations were way beyond the capability of the 74181, but a processor could use the 74181 when performing the steps of a floating-point operation.

Although the 74181 only handled four bits, multiple 74181 chips could be combined to handle larger words, such as 16 bits or 32 bits. To handle carries, the chips could be chained together, with the carry-out from one chip fed into the carry-in of the next chip. This approach was simple but slow, since the carry had to "ripple" through all the chips before the answer could be obtained. The carry process could be sped up by using a carry-lookahead chip called the 74182, which speeds up addition by computing the carries from four 74181 chips (i.e., 16 bits) in parallel.

The Mitra's ALU/register boards

The Spacelab computer used eight '181 ALU chips to implement a 32-bit adder.6 (Specifically, these chips are the 54S181, a variant of the 74181: "54" indicates that the chips handle the military temperature range, and "S" indicates that the chip is built from high-speed Schottky logic.) However, the ALU boards required numerous additional chips. Depending on the instruction, eight different inputs could be selected for the ALU. Chips called multiplexers selected the desired value, requiring 32 multiplexer chips. Three 32-bit registers provided storage for ALU inputs and outputs, requiring 24 chips. Two 54S182 carry-lookahead chips provided fast carry computation. Finally, some simple logic chips (inverters and NAND gates) tied things together.

Due to the number of chips required, the ALU/register circuitry was spread across three boards, as shown below. (I reverse-engineered the board on the right.7) The '181 chips are immediately visible as they are much larger than the other chips; they have 24 pins, compared to 14 or 16 pins for the other chips. The first board has two '181 chips, while the last two boards each have three '181 chips. The last two boards are similar, but not identical.

The three ALU/register boards from the Spacelab computer. Click this image (or any other) for a larger version.

Finding a 32-bit ALU was a surprise to me, since the computer is a 16-bit computer. The expanded ALU was probably implemented to improve performance. Multiplying two 16-bit numbers yields a 32-bit result, so a 32-bit ALU makes multiplication faster. Moreover, the computer supports 32-bit floating-point numbers, so the 32-bit ALU presumably makes floating-point operations faster.

The diagram below shows the architecture of the computer's 32-bit ALU system. In the middle is the ALU itself, operating on two 32-bit operands: A and B. At the left, multiplexers ("mux") select one of four values for A and one of four values for B. At the right, the output of the ALU can be stored in three 32-bit registers, or sent to the rest of the computer via the bus. The first two registers are shift registers, allowing the value to be shifted left or right, while the third register simply holds the value in flip-flops. The first two registers are connected by buses to the rest of the computer, while the value of the third register can only be accessed by using it for another arithmetic operation.8 I suspect that the shift registers are used for multiplication and division to shift the arguments at each step.

Block diagram of the ALU/register board.

The inputs to the multiplexers provide flexibility. For instance, you can add register 1 to a number from the bus, or add register 2 shifted to the right to register 3. (Note that this shifting is implemented by wiring the inputs to the multiplexer shifted left or right, completely separate from the shift register's shifting.) The "all 1's" input is either a zero input (with negative logic), or -1 (in two's-complement). The B input can be taken from the bus, allowing the value to come from memory or from a general-purpose register. The mix input is a jumble of signal lines, register bits, a shift register input, and a pull-up with no apparent pattern. I describe a few more mysteries in the footnote;9 presumably, the mysteries would be resolved if I reverse-engineered the whole computer.

The functions of the multiplexers, ALU chips, and registers depend on what instruction is being executed. Specifically, the computer's microcode engine generates control signals for the computer, including the ALU/register boards. Some of these control signals select which multiplexer inputs are used. Other control signals select the ALU's function. Finally, control signals select which register receives the ALU's output.

The board that I reverse engineered implements 12 of the 32 bits of the ALU and registers. The image below shows the role of each chip on the board. The three 4-bit ALU chips are indicated 2, 1, and 0. Each ALU chip has two multiplexer chips to select the four A input bits and two multiplexer chips to select the four B input bits.10 Thus, there are 12 multiplexer chips on the board. The three 12-bit registers A, B, and C are each implemented with three 4-bit chips. Three hex inverter chips and a 4-input NAND chip complete the board.11

The ALU/register board with the chips labeled.

These printed-circuit boards (PCBs) have some interesting features. In most electronics, circuit boards have holes only where they are needed, but the Spacelab boards have holes in a fixed grid pattern. (IBM used similar boards in its System/360 computers in the 1960s.12) A hole can hold an IC pin or other component. Or a hole can be used as a via, connecting PCB traces on different layers. Another interesting feature of the boards is the vertical metal bars underneath the integrated circuits. These bars carry heat away from the integrated circuits.

The PCB traces are more visible on the back of the board (below). The traces are thin enough that two traces can pass between a pair of holes. Note the yellow "bodge" wires, correcting errors on the circuit board. I assume that these errors were fixed for the computers used in flight.

Back of an ALU/register board. This is a different board from the one I reverse engineered, since I wanted to show the yellow wires.

Each board has a 96-pin connector at the bottom, which plugs into the computer's motherboard. Note the three cylindrical pins sticking out of the connector. These pins are keyed to ensure that a board can only be plugged into the correct slot. That is, each pin has a metal tab oriented in one of six directions. On the motherboard, the connectors have corresponding notches. If the tabs and the notches don't match up, the board can't be plugged in.

A close-up of the connector, showing the keying. Also note that the zig-zag pin numbering on the left changes to an irregular number on the right. Unexpectedly, pin 52 is between pins 49 and 51, for example,

The boards in the Spacelab computer are dense, tightly packing integrated circuits to minimize the size of the computer. However, the boards are considerably less dense than American aerospace computers. In particular, the Spacelab computer used the same integrated circuit packages that were used in consumer electronics: through-hole DIPs (dual in-line packages with two rows of pins). In contrast, IBM's line of 4 Pi aerospace computers used "flat-pack" integrated circuits that were considerably smaller and thinner (details). As a result, IBM's double-sided circuit boards could hold 156 integrated circuits compared to 30 on a single-sided Mitra board of roughly the same size.

A brief history of the French computer industry leading up to this computer

Bull is one of France's earliest computing companies, created in 1931. Bull initially sold punch-card equipment, competing with IBM. By the 1960s, Bull was a major computer company with products such as the transistorized Gamma 60 computer, a large-scale mainframe that was said to be the first system specifically designed for parallel and multiprogramming. Unfortunately, Bull had difficulty competing with IBM, its stock collapsed, and Bull was acquired by General Electric in 1964, forming Bull-GE. The collapse and controversial takeover were a blow to the French computer industry, and the incident was dubbed the Affaire Bull. To make things worse, GE soon canceled two of Bull's computers, focusing instead on GE's computer line.

The Affaire Bull was not only an affront to French pride, but an indication that France was largely dependent on the US for computer technology. A second incident revealed the critical military consequences of France's weakness. In the early 1960s, France was attempting to improve its nuclear strength by develop a hydrogen bomb. The mathematics of fusion is computationally intense, so France attempted to buy powerful American computers: the CDC 6600 supercomputer and the IBM 360/92.13 However, the US government blocked the export of these computers to France in an attempt to limit nuclear proliferation.

These problems led French president Charles de Gaulle to decide that France needed a strong computer industry of its own. In 1966, he developed a plan for computing (Plan Calcul)14, where the French government would reorganize the computer industry, picking companies to lead in each sector from minicomputers to semiconductors.

In the minicomputer sector, the government created a company called CII by combining three French computer companies: SEA, CAE, and SETI. CII was primarily owned by a large French company called Thomson-CSF (now Thales).15 CII played a key role in the Spacelab computer, since CII developed the Mitra line of computers. In the mid-1970s, CII and the American company Honeywell merged, with the computer division spun off to form a new company called SEMS, with majority shareholder Thomson. Another Thomson subsidiary, CIMSA, focused on military electronics and produced the militarized versions of the Mitra line. In particular, CIMSA produced the computer for Spacelab.16

France's Plan Calcul is generally viewed as a failure. Despite expensive subsidies, the French computer industry remained weak and unable to escape American dominance. When Giscard d'Estaing was elected president of France in 1974, he ended Plan Calcul. There are various interpretations, such as the failure of government planning versus the free market, but my view is that in the 1960s and 1970s, IBM crushed most challengers in the computer industry, both American and foreign, so Plan Calcul didn't have a chance. As for Bull, the company went through a dizzying sequence of American takeovers and nationaizations by France.17 Just two months ago (March 2026), the company was reacquired by the French government.

Replacement by the IBM AP-101SL computer

Since Spacelab was a European project, using a European computer was a point of pride. Unfortunately, the French computers were eventually replaced by IBM computers due to performance needs and undoubtedly political factors.

During the Space Shuttle program, the computers on the Shuttle and in Spacelab became obsolete as computer technology rapidly advanced. Although the computers were originally considered powerful, their performance and memory capacity became problems over time. The Space Shuttle's IBM AP-101 computers were upgraded to IBM AP-101S computers, first flying in 1991. The AP-101S was half the size, three times faster, and had more than twice the memory, using semiconductor memory instead of magnetic core memory.

The Spacelab computer system needed a similar upgrade, and in 1991, the CIMSA computers on Spacelab were replaced with IBM AP-101SL computers. The AP-101SL was based on the Shuttle's upgraded AP-101S computer, but modified to support the Mitra's hardware architecture, instruction set, and I/O capabilities. The packaging of IBM's computer was slightly changed to match the dimensions of the CIMSA computer and to use an external heat exchanger rather than an internal heat exchanger.

The IBM AP-101SL Spacelab computer. The circuit boards are much larger than the original Spacelab computer boards or the original AP-101B boards. Note the flat-pack ICs on the boards. Photo courtesy of Kyle Owen.

Changing the Shuttle's 32-bit AP-101S computer to run the 16-bit Mitra instruction set was easier than you might expect, since the AP-101S already supported multiple instruction sets: a 32-bit instruction set derived from the IBM System/360 and a 16-bit instruction set called 1750A that was an Air Force Standard. Because the AP-101S implemented its instructions in microcode—low-level software that specified the steps of a machine instruction—the instruction set could be modified by updating the microcode.

I compared the circuit boards in an AP-101S with the boards in an AP-101SL to quantify the changes. The semiconductor memory boards and power supplies were essentially identical. The CPU boards had minor changes. Unsurprisingly, the I/O boards were completely different, and the complex I/O Processor (IOP) in the Shuttle's AP-101S was omitted. For more on the IBM AP-101 line, see my History of IBM's 4 Pi computers.

Conclusions

The Spacelab computer provides an interesting look at how computers were built before microprocessors took over. The components of a computer, such as the ALU, registers, and control circuitry, were constructed from simple chips. Since each chip didn't do much, the computer required 36 boards full of chips. Even so, the computer was compact enough to go into space. By modern standards, these computers aren't much—each computer had a memory capacity of just 128 KB of magnetic core memory—but they played a critical part in the space program.

I'm not going to reverse-engineer the full computer, but I may write some more about it. For updates, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS.

Credits: Thanks to Steve Jurvetson for providing the Spacelab computer for examination. Thanks to Poul-Henning Kamp for comments.

AI statement: Despite the presence of the em dash, no AI was used in the writing of this article (details).

Notes and references

For details on Spacelab, see Spacelab News Reference. ↩↩
To avoid cluttering the main article, I'll summarize the French acronyms and companies in this footnote.
- CAE: Compagnie européenne d'automatisme électronique (European Electronic Automation Company). A French computer company founded in 1960, selling versions of American computers such as TRW's RW-300. Part of the 1966 merger that formed CII.
- CII: Compagnie internationale d'informatique (International Computer Company): the company that created the Mitra line of minicomputers. CII also sold computers designed by the American company SDS (Scientific Systems), which was bought by Xerox in 1969 and became XDS (Xerox Data Systems). XDS was shut down in 1975, costing Xerox hundreds of millions of dollars.
- CIMSA: Compagnie d'informatique militaire, spatiale et aéronautique (Military, Space, and Aeronautical Computing Company): the company that manufactured the Spacelab computer.
- CSF: Compagnie Générale de Télégraphie Sans Fil (General Wireless Telegraphy Company). A radio company dating back to 1918. It merged with Thomson in 1968 to form Thomson-CSF.
- MATRA: Mécanique Aviation Traction (Mechanics-Aviation-Traction). An electronics company that was the contractor for Spacelab's data systems.
- Mitra: Mini-machine pour l'Informatique Temps Réel et Automatique ("Mini-machine for Real-Time and Automatic Computing"). A line of minicomputers.
- SEA: Société d'électronique et d'automatisme (Electronics and Automation Company): a French computer manufacturer, started in 1947 and merged into CII in 1966.
- SEMS: Société Européenne de Mini-informatique et de Systèmes (European Society for Minicomputers and Systems). A subsidiary of Thomson, created by the French government in 1976 during the merger of CII and Honeywell. SEMS took over the manufacturing of Mitra computers from CII.
- SETI: Société européenne de traitement de l'information (European Information Processing Society). SETI was a French computer company formed in 1961. The American computer company Packard Bell owned a quarter of SETI, and SETI sold the desk-sized Packard Bell 250 computer.
↩
On the ground, the Spacelab project used Mitra 125 S computers that were functionally identical to the Mitra 125 MS (details). ↩
Spacelab's Command and Data Management Subsystem (CDMS) is surprisingly complicated because of the data communication paths between Spacelab, the Shuttle, and the ground. Moreover, multiple units store, encode, and decode data. In the CDMS block diagram below, I've highlighted the three computers; they are just a small part of the CDMS. See Section 3.5 of Spacelab News Reference or The Command and Data Management System of Spacelab for details on CDMS.

A block diagram of Spacelab's Command and Data Management Subsystem. From The Command and Data Management System of Spacelab. Click for a larger version.

↩
I reverse-engineered the 74181 ALU chip in this article and explained the motivation for its quirky set of operations in this article. ↩
Another board in the Spacelab computer has four 74S181 chips implementing a 16-bit ALU. My guess is that this board is part of the I/O processor. The board has the cryptic label "HMSG". ↩
My reverse-engineering process was straightforward but tedious. I used a multimeter to beep out the connections between the integrated circuits as well as the connections to the connector. (Unlike many systems that I look at, these boards didn't have conformal coating, which made beeping out the connections practical.) I created a schematic in KiCad from this data; this schematic was "physical", with the layout of the chips and pins matching their physical location on the board. Next, I converted the integrated circuit symbols from physical rectangles to logical symbols. Finally, I moved the symbols around on the schematic to make a reasonable schematic. (I had to go back and beep out more connections as I discovered errors or missing connections.) Theoretically, I could reverse-engineer the entire computer, but reverse-engineering one of the 36 boards is enough for me. ↩
My full reverse-engineered schematic of the ALU/register board is below. Click for a larger version.

Schematic of the ALU/register board.

↩
A few mysteries remain in the ALU/register board. The three registers probably act as an accumulator, a temporary register, and an extra register for multiplication/division, but it's not clear which register is which. I don't understand why the inputs are organized as they are; for instance, you can't add register 1 to register 2 shifted. The mix input seems very random; maybe these signals are part of a self test? On the board, I expected to see 12 bits out of a uniform 32-bit ALU. However, the top two 4-bit "nibbles" have different control lines and different zero-detection from the third. Perhaps this is because the Mitra floating-point numbers have 24 bits of mantissa and 8 bits of exponent. It would make sense for the ALU/register board to handle these parts separately. Another mystery is that the board has a circuit to test two hardwired bits and two external bits to see if they are all 0 or all 1, for some reason. ↩
The multiplexer chips are dual 4-to-1 multiplexers. Thus, two multiplexer chips are required to support four bits. ↩
The chips in the Spacelab computer use a variety of part number systems. A few chips have standard industry part numbers such as "SNJ5483" (equivalent to a 7483 adder). Most of the chips are labeled with military part numbers such as JM38510/07801 BJB, using the MIL-M-38510 standard. These part numbers can be cross-referenced using the MIL-HDBK-983 handbook. Other chips, like the ones below, have Fairchild part numbers that are a mystery to me. The first line is presumably the part number, "929 567" and "929 705", but I can't find these numbers anywhere. If you know what these numbers mean, please let me know! (07263 is the CAGE code for Fairchild, and the last line is the date code.)

Two Fairchild ICs with mysterious part numbers.

The ALU/register board that I examined uses the following JM38510 part numbers, which I have converted to standard parts:
/01403 = 54153 dual 4-1 multiplexer
/07003 = 54S04 hex inverter
/07006 = 54S20 4-input NAND
/07601 = 54S194 4-bit shift register
/07801 = 54S1814-bit ALU
/30107 = 54LS175 quad flip-flop ↩
The photo below compares an IBM board (top) with a Spacelab board (bottom), both from the early 1980s. It's interesting how similar the boards are. Both use a 0.1" grid of holes, unlike most printed-circuit boards, which only use holes where needed. Both boards are multi-layer with integrated circuits on one side. The IBM board is denser; the chips are spaced 0.1" apart rather than 0.3" apart.

An IBM computer board (top) and a board from the Spacelab computer (bottom).

I don't know which IBM system used this board, but it was a commercial system, not an aerospace system. This board is a bit unusual for IBM, since most of the chips are standard DIPs rather than the square metal cans that IBM typically used. ↩
The US blocked computer exports to France with NSAM 294, a 1964 National Security Action Memorandum. The US later allowed sales of the CDC 6600 and IBM 360/91 computers to France on the condition that France not use the computers for atomic weapon development, a condition that France apparently violated. See A.E.C. Bids Industry Avoid Sales Aiding French Tests (1964) and Paris Promises Not to Use Equipment for Atomic Weapons (1966). The CDC 6600 supercomputer executed up to 10 million instructions per second (MIPS) while the IBM 360/91 executed about 17 MIPS. (In comparison, a 1995 Pentium Pro or a 2012 cell phone is faster than these computers.) In 1971, Henry Kissinger was still blocking computer exports to France, as shown in this transcript. (One confusing issue in these articles is that IBM announced the 360/92 computer in 1964, but renamed it as the 360/91 before it shipped in 1967.) ↩
Some contemporary articles on Plan Calcul are France Entering Computer Battle: Starts All-French Company to Compete (New York Times, 1967) and France: First the Bomb, Then the "Plan Calcul" (Science, 1967). See History of Computing in France: A Brief Sketch for an overview of the French computer industry. ↩
Thomson has a complicated history. In 1883, two Americans, Thomson and Houston, started the Thomson-Houston Electric Company. A decade later, this company became General Electric, with a French subsidiary: Thomson Houston International. After various mergers, the French subsidiary became Thomson-CSF, a major defense and electronics firm.

In a sense, Thomson-Houston both created and destroyed GE. The Thomson-Houston Electrical Company became GE, but the French subsidiary of Thomson-Houston ended up being a key part of GE's collapse almost a century later. Specifically, the French rail transport company Alsthom (later Alstom) was formed from the French heavy engineering subsidiary of Thomson-Hudson in 1928; the "thom" in "Alsthom" comes from "Thomson". In 2014, General Electric acquired Alstom for $10.1 billion. The acquisition was a disaster, and in 2018, GE wrote off $23 billion. This loss, along with other financial problems, led to GE's announcement in 2021 that it would break up into three companies. ↩
One more company should be mentioned: MATRA. MATRA was the contractor for Spacelab's data systems, so the Spacelab computer was produced under a contract from MATRA. People often confuse Mitra (the name of the computer line) with MATRA. ↩
Due to financial difficulties, Bull was acquired by General Electric in 1964, then was acquired by Honeywell, nationalized by France, partnered with NEC, acquired Zenith, privatized by France, and acquired by Atos. Less than two months ago, France acquired Bull, continuing the series of reorganizations. ↩

Instruction decoding in the Intel 8087 floating-point chip

In the 1980s, if you wanted your IBM PC to run faster, you could buy the Intel 8087 floating-point coprocessor chip. With this chip, CAD software, spreadsheets, flight simulators, and other programs were much speedier. The 8087 chip could add, subtract, multiply, and divide, of course, but it could also compute transcendental functions such as tangent and logarithms, as well as provide constants such as π. In total, the 8087 added 62 new instructions to the computer.

But how does a PC decide if an instruction was a floating-point instruction for the 8087 or a regular instruction for the 8086 or 8088 CPU? And how does the 8087 chip interpret instructions to determine what they mean? It turns out that decoding an instruction inside the 8087 is more complicated than you might expect. The 8087 uses multiple techniques, with decoding circuitry spread across the chip. In this blog post, I'll explain how these decoding circuits work.

To reverse-engineer the 8087, I chiseled open the ceramic package of an 8087 chip and took numerous photos of the silicon die with a microscope. The complex patterns on the die are formed by its metal wiring, as well as the polysilicon and silicon underneath. The bottom half of the chip is the "datapath", the circuitry that performs calculations on 80-bit floating point values. At the left of the datapath, a constant ROM holds important constants such as π. At the right are the eight registers that the programmer uses to hold floating-point values; in an unusual design decision, these registers are arranged as a stack. Floating-point numbers cover a huge range by representing numbers with a fractional part and an exponent; the 8087 has separate circuitry to process the fractional part and the exponent.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5 mm×6 mm. Click this image (or any others) for a larger image.

The chip's instructions are defined by the large microcode ROM in the middle.1 To execute an instruction, the 8087 decodes the instruction and the microcode engine starts executing the appropriate micro-instructions from the microcode ROM. In the upper right part of the chip, the Bus Interface Unit (BIU) communicates with the main processor and memory over the computer's bus. For the most part, the BIU and the rest of the chip operate independently, but as we will see, the BIU plays important roles in instruction decoding and execution.

Cooperation with the main 8086/8088 processor

The 8087 chip acted as a coprocessor with the main 8086 (or 8088) processor. When a floating-point instruction was encountered, the 8086 would let the 8087 floating-point chip carry out the floating-point instruction. But how do the 8086 and the 8087 determine which chip executes a particular instruction? You might expect the 8086 to tell the 8087 when it should execute an instruction, but this cooperation turns out to be more complicated.

The 8086 has eight opcodes that are assigned to the coprocessor, called ESCAPE opcodes. The 8087 determines what instruction the 8086 is executing by watching the bus, a task performed by the BIU (Bus Interface Unit).2 If the instruction is an ESCAPE, the instruction is intended for the 8087. However, there's a problem. The 8087 doesn't have any access to the 8086's registers (and vice versa), so the only way that they can exchange data is through memory. But the 8086 addresses memory through a complicated scheme involving offsest registers and segment registers. How can the 8087 determine what memory address to use when it doesn't have access to the registers?

The trick is that when an ESCAPE instruction is encountered, the 8086 processor starts executing the instruction, even though it is intended for the 8087. The 8086 computes the memory address that the instruction references and reads that memory address, but ignores the result. Meanwhile, the 8087 watches the memory bus to see what address is accessed and stores this address internally in a BIU register. When the 8087 starts executing the instruction, it uses the address from the 8086 to read and write memory. In effect, the 8087 offloads address computation to the 8086 processor.

The structure of 8087 instructions

To understand the 8087's instructions, we need to take a closer look at the structure of 8086 instructions. In particular, something called the ModR/M byte is important since all 8087 instructions use it.

The 8086 uses a complex system of opcodes with a mixture of single-byte opcodes, prefix bytes, and longer instructions. About a quarter of the opcodes use a second byte, called ModR/M, that specifies the registers and/or memory address to use through a complicated encoding. For instance, the memory address can be computed by adding the BX and SI registers, or from the BP register plus a two-byte offset. The first two bits of the ModR/M byte are the "MOD" bits. For a memory access, the MOD bits indicate how many address displacement bytes follow the ModR/M byte (0, 1, or 2), while the "R/M" bits specify how the address is computed. A MOD value of 3, however, indicates that the instruction operates on registers and does not access memory.

Structure of an 8087 instruction

The diagram above shows how an 8087 instruction consists of an ESCAPE opcode, followed by a ModR/M byte. An ESCAPE opcode is indicated by the special bit pattern 11011, leaving three bits (green) available in the first byte to specify the type of 8087 instruction. As mentioned above, the ModR/M byte has two forms. The first form performs a memory access; it has MOD bits of 00,01, or 10 and the R/M bits specify how the memory address is computed. This leaves three bits (green) to specify the address. The second form operates internally, without a memory access; it has MOD bits of 11. Since the R/M bits aren't used in the second form, six bits (green) are available in the R/M byte to specify the instruction.

The challenge for the designers of the 8087 was to fit all the instructions into the available bits in such a way that decoding is straightforward. The diagram below shows a few 8087 instructions, illustrating how they achieve this. The first three instructions operate internally, so they have MOD bits of 11; the green bits specify the particular instruction. Addition is more complicated because it can act on memory (first format) or registers (second format), depending on the MOD bits. The four bits highlighted in bright green (0000) are the same for all ADD instructions; the subtract, multiplication, and division instructions use the same structure but have different values for the dark green bits. For instance, 0001 indicates multiplication and 0100 indicates subtraction. The other green bits (MF, d, and P) select variants of the addition instruction, changing the data format, direction, and popping the stack at the end. The last three bits select the R/M addressing mode for a memory operation, or the stack register ST(i) for a register operation.

The bit patterns for some 8087 instructions. Based on the datasheet.

Selecting a microcode routine

Most of the 8087's instructions are implemented in microcode, implementing each step of an instruction in low-level "micro-instructions". The 8087 chip contains a microcode engine; you can think of it as the mini-CPU that controls the 8087 by executing a microcode routine, one micro-instruction at a time. The microcode engine provides an 11-bit micro-address to the ROM, specifying the micro-instruction to execute. Normally, the microcode engine steps through the microcode sequentially, but it also supports conditional jumps and subroutine calls.

But how does the microcode engine know where to start executing the microcode for a particular machine instruction? Conceptually, you could feed the instruction opcode into a ROM that would provide the starting micro-address. However, this would be impractical since you'd need a 2048-word ROM to decode an 11-bit opcode.3 (While a 2K ROM is small nowadays, it was large at the time; the 8087's microcode ROM was a tight fit at just 1648 words.) Instead, the 8087 uses a more efficient (but complicated) instruction decode system constructed from a combination of logic gates and PLAs (Programmable Logic Arrays). This system holds 22 microcode entry points, much more practical than 2048.

Processors often use a circuit called a PLA (Programmable Logic Array) as part of instruction decoding. The idea of a PLA is to provide a dense and flexible way of implementing arbitrary logic functions. Any Boolean logic function can be expressed as a "sum-of-products", a collection of AND terms (products) that are OR'd together (summed). A PLA has a block of circuitry called the AND plane that generates the desired sum terms. The outputs of the AND plane are fed into a second block, the OR plane, which ORs the terms together. Physically, a PLA is implemented as a grid, where each spot in the grid can either have a transistor or not. By changing the transistor pattern, the PLA implements the desired function.

A simplified diagram of a PLA.

A PLA can implement arbitrary logic, but in the 8087, PLAs often act as optimized ROMs.4 The AND plane matches bit patterns,5 selecting an entry from the OR plane, which holds the output values, the micro-address for each routine. The advantage of the PLA over a standard ROM is that one output column can be used for many different inputs, reducing the size.

The image below shows part of the instruction decoding PLA.6 The horizontal input lines are polysilicon wires on top of the silicon. The pinkish regions are doped silicon. When polysilicon crosses doped silicon, it creates a transistor (green). Where there is a gap in the doped silicon, there is no transistor (red). (The output wires run vertically, but are not visible here; I dissolved the metal layer to show the silicon underneath.) If a polysilicon line is energized, it turns on all the transistors in its row, pulling the associated output columns to ground. (If no transistors are turned on, the pull-up transistor pulls the output high.) Thus, the pattern of doped silicon regions creates a grid of transistors in the PLA that implements the desired logic function.7

Part of the PLA for instruction decoding.

The standard way to decode instructions with a PLA is to take the instruction bits (and their complements) as inputs. The PLA can then pattern-match against bit patterns in the instruction. However, the 8087 also uses some pre-processing to reduce the size of the PLA. For instance, the MOD bits are processed to generate a signal if the bits are 0, 1, or 2 (i.e. a memory operation) and a second signal if the bits are 3 (i.e. a register operation). This allows the 0, 1, and 2 cases to be handled by a single PLA pattern. Another signal indicates that the top bits are 001 111xxxxx; this indicates that the R/M field takes part in instruction selection.8 Sometimes a PLA output is fed back in as an input, so a decoded group of instructions can be excluded from another group. These techniques all reduce the size of the PLA at the cost of some additional logic gates.

The result of the instruction decoding PLA's AND plane is 22 signals, where each signal corresponds to an instruction or group of instructions with a shared microcode entry point. The lower part of the instruction decoding PLA acts as a ROM that holds the 22 microcode entry points and provides the selected one.9

Instruction decoding inside the microcode

Many 8087 instructions share the same microcode routines. For instance, the addition, subtraction, multiplication, division, reverse subtraction, and reverse division instructions all go to the same microcode routine. This reduces the size of the microcode since these instructions share the microcode that sets up the instruction and handles the result. However, the microcode obviously needs to diverge at some point to perform the specific operation. Moreover, some arithmetic opcodes access the top of the stack, some access an arbitrary location in the stack, some access memory, and some reverse the operands, requiring different microcode actions. How does the microcode do different things for different opcodes while sharing code?

The trick is that the 8087's microcode engine supports conditional subroutine calls, returns, and jumps, based on 49 different conditions (details). In particular, fifteen conditions examine the instruction. Some conditions test specific bit patterns, such as branching if the lowest bit is set, or more complex patterns such as an opcode matching 0xx 11xxxxxx. Other conditions detect specific instructions such as FMUL. The result is that the microcode can take different paths for different instructions. For instance, a reverse subtraction or reverse division is implemented in the microcode by testing the instruction and reversing the arguments if necessary, while sharing the rest of the code.

The microcode also has a special jump target that performs a three-way jump depending on the current machine instruction that is being executed. The microcode engine has a jump ROM that holds 22 entry points for jumps or subroutine calls.10 However, a jump to target 0 uses special circuitry so it will instead jump to target 1 for a multiplication instruction, target 2 for an addition/subtraction, or target 3 for division. This special jump is implemented by gates in the upper right corner of the jump decoder.

The jump decoder and ROM. Note that the rows are not in numerical order; presumably, this made the layout slightly more compact. Click this image (or any other) for a larger version.

Hardwired instruction handling

Some of the 8087's instructions are implemented directly by hardware in the Bus Interface Unit (BIU), rather than using microcode. For example, instructions to enable or disable interrupts, or to save or restore state are implemented in hardware. The decoding for these instructions is performed by separate circuitry from the instruction decoder described above.

In the first step, a small PLA decodes the top 5 bits of the instruction. Most importantly, if these bits are 11011, it indicates an ESCAPE instruction, the start of an 8087 operation. This causes the 8087 to start interpreting the instruction and stores the opcode in a BIU register for use by the instruction decoder. A second small PLA takes the outputs from the top-5 PLA and combines them with the lower three bits. It decodes specific instruction values: D9, DB, DD, E0, E1, E2, or E3. The first three values correspond to specific ESCAPE instructions, and are recorded in latches.

The two PLAs decode the second byte in the same way. Logic gates combine the PLA outputs from the second byte with the latched values from the first byte, detecting eleven hardwired instructions.11 Some of these instructions operate directly on registers, such as clearing exceptions; the decoded instruction signal goes to the relevant register and modifies it in an ad hoc way. 12. Other hardwired instructions are more complicated, writing chip state to memory or reading chip state from memory. These instructions require multiple memory operations, controlled by the Bus Interface Unit's state machine. Each of these instructions has a flip-flop that is triggered by the decoded instruction to keep track of which instruction is active.

For the instructions that save and restore the 8087's state (FSAVE and FRSTOR), there's one more complication. These instructions are partially implemented in the BIU, which moves the relevant BIU registers to or from memory. But then, instruction processing switches to microcode, where a microcode routine saves or loads the floating-point registers. Jumping to the microcode routine is not implemented through the regular microcode jump circuitry. Instead, two hardcoded values force the microcode address to the save or restore routine.13

Constants

The 8087 has seven instructions to load floating-point constants such as π, 1, or log₁₀(2). The 8087 has a constant ROM that holds these constants, as well as constants for transcendental operations. You might expect that the 8087 simply loads the specified constant from the constant ROM, using the instruction to select the desired constant. However, the process is much more complicated.14

Looking at the instruction decode ROM shows that different constants are implemented with different microcode routines: the constant-loading instructions FLDLG2 and FLDLN2 have one entry point; FLD1, FLD2E, FLDL2T, and FLDPI have a second entry point, and FLDZ (zero) has a third entry point. It's understandable that zero is a special case, but why are there two routines for the other constants?

The explanation is that the fraction part of each constant is stored in the constant ROM, but the exponent is stored in a separate, smaller ROM. To reduce the size of the exponent ROM, only some of the necessary exponents are stored. If a constant needs an exponent one larger than a value in the ROM, the microcode adds one to the exponent ROM value, computing the exponent on the fly.

Thus, the load-constant instructions use three separate instruction decoding mechanisms. First, the instruction decode ROM determines the appropriate microcode routine for the constant instruction, as before. Then, the constant PLA decodes the instruction to select the appropriate constant. Finally, the microcode routine tests the bottom bit of the instruction and increments the exponent if necessary.

Conclusions

To wrap up the discussion of the decoding circuitry, the diagram below shows how the different circuits are arranged on the die. This image shows the upper-right part of the die; the microcode engine is at the left and part of the ROM is at the bottom.

The upper-left portion of the 8087 die, with functional blocks labeled.

The 8087 doesn't have a clean architecture, but instead is full of ad hoc circuits and corner cases. The 8087's instruction decoding is an example of this. Decoding is complicated to start with due to the 8086's convoluted instruction formats and the ModR/M byte. On top of that, the 8087's instruction decoding has multiple layers: the instruction decode PLA, microcode conditional jumps that depend on the instruction, a special jump target that depends on the instruction, constants selected based on the instruction, and instructions decoded by the BIU.

The 8087 has a reason for this complicated architecture: at the time, the chip was on the edge of what was possible, so the designers needed to use whatever techniques they could to reduce the size of the chip. If implementing a corner case could shave a few transistors off the chip or make the microcode ROM slightly smaller, the corner case was worthwhile. Even so, the 8087 was barely manufacturable at first; early yield was just two working chips per silicon wafer. Despite this difficult start, a floating-point standard based on the 8087 is now part of almost every processor.

Thanks to the members of the "Opcode Collective" for their contributions, especially Smartest Blob and Gloriouscow.

For updates, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS.

Notes and references

The contents of the microcode ROM are available here, partially decoded thanks to Smartest Blob. ↩
It is difficult for the 8087 to determine what the 8086 is doing because the 8086 prefetches instructions. Thus, when an instruction is seen on the bus, the 8086 may execute it at some point in the future, or it may end up discarded.

In order to tell what instruction is being executed, the 8087 floating-point chip internally duplicates the 8086 processor's queue. The 8087 watches the memory bus and copies any instructions that are prefetched. Since the 8087 can't tell from the bus when the 8086 starts a new instruction or when the 8086 empties the queue when jumping to a new address, the 8086 processor provides two queue status signals to the 8087. With the help of these signals, the 8087 knows exactly what the 8086 is executing.

The 8087's instruction queue has six 8-bit registers, the same as the 8086. Surprisingly, the last two queue registers in the 8087 are tied together, so there are only five usable queue registers. My hypothesis is that since the 8087 copies the active instruction into separate registers (unlike the 8086), only five queue registers are needed. This raises the question of why the excess register wasn't removed from the die, rather than wasting valuable space.

The 8088 processor, used in the IBM PC, has a four-byte queue instead of a six-byte queue. The 8088 is almost identical to the 8086 except it has an 8-bit memory bus instead of a 16-bit memory bus. With the narrower memory bus, prefetching is more likely to get in the way of other memory accesses, so a smaller prefetch queue was implemented.

Knowing the queue size is essential to the 8087 floating-point chip. To indicate this, when the processor boots, a signal lets the 8087 determine if the attached processor is an 8086 or an 8088. ↩
The relevant part of the opcode is 11 bits: the top 5 bits are always 11011 for an ESCAPE opcode, so they can be ignored during decoding. The Bus Interface Unit has a 3-bit register to hold the first byte of the instruction and an 8-bit register to hold the second byte. The BIU registers have an irregular appearance because there are 3-bit registers, 8-bit registers, and 10-bit registers (holding half of a 20-bit address). ↩
What's the difference between a PLA and a ROM? There is a lot of overlap: a ROM can replace a PLA, while a PLA can implement a ROM. A ROM is essentially a PLA where the first stage is a binary decoder, so the ROM has a separate row for each input value. However, the first stage of a ROM can be optimized so multiple inputs share the same output value; is this a ROM or a PLA?

The "official" difference is that in a ROM, one row is activated at a time, while in a PLA, multiple rows can be activated at once, so the output values are combined. (Thus, it is straightforward to read the values out of a ROM, but more difficult to read the values out of a PLA.)

I consider the instruction decoding PLA to be best described as a PLA first stage with the second stage acting as a ROM. You could also call it a partially-decoded ROM, or just a PLA. Hopefully my terminology isn't too confusing. ↩
To match a bit pattern in an instruction, the bits of the instruction are fed into the PLA, along with the complements of these bits; this allows the PLA to match against a 0 bit or a 1 bit. Each row of a PLA will match a particular bit pattern in the instruction: bits that must be 1, bits that must be 0, and bits that don't matter. If the instruction opcodes are assigned rationally, a small number of bit patterns will match all the opcodes, reducing the size of the decoder.

I may be going too far with this analogy, but a PLA is a lot like a neural net. Each column in the AND plane is like a neuron that fires when it recognizes a particular input pattern. The OR plane is like a second layer in a neural net, combining signals from the first layer. The PLA's "weights", however, are fixed at 0 or 1, so it's not as flexible as a "real" neural net. ↩
The instruction decoding PLA has an unusual layout, where the second plane is rotated 90°. In a regular PLA (left), the inputs (red) go into the first plane, the perpendicular outputs from the first plane (purple) go into the second plane, and the PLA outputs (blue) exit parallel to the inputs. In the address PLA, however, the second plane is rotated 90°, so the outputs are perpendicular to the inputs. This approach requires additional wiring (horizontal purple lines), but presumably, this layout worked better in the 8087 since the outputs are lined up with the rest of the microcode engine.

Conceptual diagram of a regular PLA on the left and a rotated PLA on the right.

↩
To describe the implementation of a PLA in more detail, the transistors in each row of the AND plane form a NOR gate, since if any transistor is turned on, it pulls the output low. Likewise, the transistors in each column of the OR plane form a NOR gate. So why is the PLA described as having an AND plane and an OR plane, rather than two NOR planes? By using De Morgan's law, you can treat the NOR-NOR Boolean equations as equivalent to AND-OR Boolean equations (with the inputs and outputs inverted). It's usually much easier to understand the logic as AND terms OR'd together.

The converse question is why don't they build the PLA from AND and OR gates instead of NOR gates? The reason is that AND and OR gates are harder to build with NMOS transistors, since you need to add explicit inverter circuits. Moreover, NMOS NOR gates are typically faster than NAND gates because the transistors are in parallel. (CMOS is the opposite; NAND gates are faster because the weaker PMOS transistors are in parallel.) ↩

The 8087's opcodes can be organized into tables, showing the underlying structure. (In each table, the row (Y) coordinate is the bottom 3 bits of the first byte and the column (X) coordinate is the 3 bits after the MOD bits in the second byte.)

Memory operations use the following encoding with MOD = 0, 1, or 2. Each box represents 8 different addressing modes.

	0	1	2	3	4	5	6	7
0	FADD	FMUL	FCOM	FCOMP	FSUB	FSUBR	FDIV	FDIVR
1	FLD		FST	FSTP	FLDENV	FLDCW	FSTENV	FSTCW
2	FIADD	FIMUL	FICOM	FICOMP	FISUB	FISUBR	FIDIV	FIDIVR
3	FILD		FIST	FISTP		FLD		FSTP
4	FADD	FMUL	FCOM	FCOMP	FSUB	FSUBR	FDIV	FDIVR
5	FLD		FST	FSTP	FRSTOR		FSAVE	FSTSW
6	FIADD	FIMUL	FICOM	FICOMP	FISUB	FISUBR	FIDIV	FIDIVR
7	FILD		FIST	FISTP	FBLD	FILD	FBSTP	FISTP

The important point is that the instruction encoding has a lot of regularity, making the decoding process easier. For instance, the basic arithmetic operations (FADD through FDIVR) are repeated on alternating rows. However, the table also has significant irregularities, which complicate the decoding process.

The register operations (MOD = 3) have a related layout, but there are even more irregularities.

	0	1	2	3	4	5	6	7
0	FADD	FMUL	FCOM	FCOMP	FSUB	FSUBR	FDIV	FDIVR
1	FLD	FXCH	FNOP		misc1	misc2	misc3	misc4
2
3					misc5
4	FADD	FMUL			FSUB	FSUBR	FDIV	FDIVR
5	FFREE		FST	FSTP
6	FADDP	FMULP		FCOMPP	FSUBP	FSUBRP	FDIVP	FDIVRP
7

In most cases, each box indicates 8 different values for the stack register, but there are exceptions. The NOP and FCOMPP instructions each have a single opcode, "wasting" the rest of the box.

Five of the boxes in the table encode multiple instructions instead of the register number. The first four (red) are miscellaneous instructions handled by the decoding PLA:
misc1 = FCHS, FABS, FTST, FXAM
misc2 = FLD1, FLDL2T, FLDL2E, FLDPI, FLDLG2, FLDLN2, FLDZ (the constant-loading instructions)
misc3 = F2XM1, FYL2X, FPTAN, FPATAN, FXTRACT, FDECSTP, FINCSTP
misc4 = FPREM, FYL2XP1, FSQRT, FRNDINT, FSCALE

The last miscellaneous box (yellow) holds instructions that are handled by the BIU.
misc5 = FENI, FDISI, FCLEX, FINIT

Curiously, the 8087's opcodes (like the 8086's) make much more sense in octal than in hexadecimal. In octal, an 8087 opcode is simply 33Y MXR, where X and Y are the table coordinates above, M is the MOD value (0, 1, 2, or 3), and R is the R/M field or the stack register number. ↩

The 22 outputs from the instruction decoder PLA correspond to the following groups of instructions, activating one row of ROM and producing the corresponding microcode address. From this table, you can see which instructions are grouped together in the microcode.

 0 #0200 FXCH
 1 #0597 FSTP (BCD)
 2 #0808 FCOM FCOMP FCOMPP
 3 #1008 FLDLG2 FLDLN2
 4 #1527 FSQRT
 5 #1586 FPREM
 6 #1138 FPATAN
 7 #1039 FPTAN
 8 #0900 F2XM1
 9 #1020 FLDZ
10 #0710 FRNDINT
11 #1463 FDECSTP FINCSTP
12 #0812 FTST
13 #0892 FABS FCHS
14 #0065 FFREE FLD
15 #0217 FNOP FST FSTP (not BCD)
16 #0001 FADD FDIV FDIVR FMUL FSUB FSUBR
17 #0748 FSCALE
18 #1028 FXTRACT
19 #1257 FYL2X FYL2XP1
20 #1003 FLD1 FLDL2E FLDL2T FLDPI
21 #1468 FXAM

↩

The instruction decoding PLA has 22 entries, and the jump table also has 22 entries. It's a coincidence that these values are the same.

An entry in the jump table ROM is selected by five bits of the micro-instruction. The ROM is structured with two 11-bit words per row, interleaved. (It's also a coincidence that there are 22 bits.) The upper four bits of the jump number select a row in the ROM, while the bottom bit selects one of the two rows.

This implementation is modified for target 0, the three-way jump. The first ROM row is selected for target 0 if the current instruction is multiplication, or for target 1. The second row is selected for target 0 if the current instruction is addition or subtraction, or for target 2. The third row is selected for target 0 if the current instruction is division, or for target 3. Thus, target 0 ends up selecting rows 1, 2, or 3. However, remember that there are two words per row, selected by the low bit of the target number. The problem is that target 0 with multiplication will access the left word of row 1, while target 1 will access the right word of row 1, but both should provide the same address. The solution is that rows 1, 2, and 3 have the same address stored twice in the row, so these rows each "waste" a value.

For reference, the contents of the jump table are:
```
 0: Jumps to target 1 for FMUL, 2 for FADD/FSUB/FSUBR, 3 for FDIV/FDIVR
 1: #0359
 2: #0232
 3: #0410
 4: #0083
 5: #1484
 6: #0122
 7: #0173
 8: #0439
 9: #0655
10: #0534
11: #0299
12: #1572
13: #1446
14: #0859
15: #0396
16: #0318
17: #0380
18: #0779
19: #0868
20: #0522
21: #0801
```
↩
Eleven instructions are implemented in the BIU hardware. Four of these are relatively simple, setting or clearing bits: FINIT (initialize), FENI (enable interrupts), FDISI (disable interrupts), and FCLEX (clear exceptions). Six of these are more complicated, storing state to memory or loading state from memory: FLDCW (load control word), FSTCW (store control word), FSTSW (store status word), FSTENV (store environment), FLDENV (load environment), FSAVE (save state), and FRSTOR (restore state). As explained elsewhere, the last two instructions are partially implemented in microcode. ↩
Even a seemingly trivial instruction uses more circuitry than you might expect. For instance, after the FCLEX (clear exception) instruction is decoded, the signal goes through nine gates before it clears the exception bits in the status register. Along the way, it goes through a flip-flop to synchronize the timing, a gate to combine it with the reset signal, and various inverters and drivers. Even though these instructions seem like they should complete immediately, they typically take 5 clock cycles due to overhead in the 8087. ↩
I'll give more details here on the circuit that jumps to the save or restore microcode. The BIU sends two signals to the microcode engine, one to jump to the save code and one to jump to the restore code. These signals are buffered and delayed by a capacitor, probably to adjust the timing of the signal.

In the microcode engine, there are two hardcoded constants for the routines, just above the jump table; the BIU signal causes the appropriate constant to go onto the micro-address lines. Each bit in the address has a pull-up transistor to +5V or a pull-down transistor to ground. This approach is somewhat inefficient since it requires two transistor sites per bit. In comparison, the jump address ROM and the instruction address ROM use one transistor site per bit. (As in a PLA, each transistor is present or absent as needed, so the number of physical transistors is less than the number of transistor sites.)

Two capacitors in the 8087. This photo shows the metal layer with the silicon and polysilicon underneath.

Since capacitors are somewhat unusual in NMOS circuits, I'll show them in the photo above. If a polysilicon line crosses over doped silicon, it creates a transistor. However, if a polysilicon region sits on top of the doped silicon without crossing it, it forms a capacitor instead. (The capacitance exists for a transistor, too, but the gate capacitance is generally unwanted.) ↩
The documentation provides a hint that the microcode to load constants is complicated. Specifically, the documentation shows that different constants take different amounts of time to load. For instance, log₂(e) takes 18 cycles while log₂(10) takes 19 cycles and log₁₀(2) takes 21 cycles. You'd expect that pre-computed constants would all take the same time, so the varying times show that more is happening behind the scenes. ↩

Conditions in the Intel 8087 floating-point chip's microcode

In the 1980s, if you wanted your computer to do floating-point calculations faster, you could buy the Intel 8087 floating-point coprocessor chip. Plugging it into your IBM PC would make operations up to 100 times faster, a big boost for spreadsheets and other number-crunching applications. The 8087 uses complicated algorithms to compute trigonometric, logarithmic, and exponential functions. These algorithms are implemented inside the chip in microcode. I'm part of a group that is reverse-engineering this microcode. In this post, I examine the 49 types of conditional tests that the 8087's microcode uses inside its algorithms. Some conditions are simple, such as checking if a number is zero or negative, while others are specialized, such as determining what direction to round a number.

To explore the 8087's circuitry, I opened up an 8087 chip and took numerous photos of the silicon die with a microscope. Around the edges of the die, you can see the hair-thin bond wires that connect the chip to its 40 external pins. The complex patterns on the die are formed by its metal wiring, as well as the polysilicon and silicon underneath. The bottom half of the chip is the "datapath", the circuitry that performs calculations on 80-bit floating point values. At the left of the datapath, a constant ROM holds important constants such as π. At the right are the eight registers that the programmer uses to hold floating-point values; in an unusual design decision, these registers are arranged as a stack.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. Click for a larger image.

The chip's instructions are defined by the large microcode ROM in the middle. To execute a floating-point instruction, the 8087 decodes the instruction and the microcode engine starts executing the appropriate micro-instructions from the microcode ROM. The microcode decode circuitry to the right of the ROM generates the appropriate control signals from each micro-instruction.1 The bus registers and control circuitry handle interactions with the main 8086 processor and the rest of the system.

The 8087's microcode

Executing an 8087 instruction such as arctan requires hundreds of internal steps to compute the result. These steps are implemented in microcode with micro-instructions specifying each step of the algorithm. (Keep in mind the difference between the assembly language instructions used by a programmer and the undocumented low-level micro-instructions used internally by the chip.) The microcode ROM holds 1648 micro-instructions, implementing the 8087's instruction set. Each micro-instruction is 16 bits long and performs a simple operation such as moving data inside the chip, adding two values, or shifting data. I'm working with the "Opcode Collective" to reverse engineer the micro-instructions and fully understand the microcode (link).

The microcode engine (below) controls the execution of micro-instructions, acting as the mini-CPU inside the 8087. Specifically, it generates an 11-bit micro-address, the address of a micro-instruction in the ROM. The microcode engine implements jumps, subroutine calls, and returns within the microcode. These jumps, subroutine calls, and returns are all conditional; the microcode engine will either perform the operation or skip it, depending on the value of a specified condition.

The microcode engine. In this image, the metal is removed, showing the underlying silicon and polysilicon.

I'll write more about the microcode engine later, but I'll give an overview here. At the top, the Instruction Decode PLA2 decodes an 8087 instruction to determine the starting address in microcode. Below that, the Jump PLA holds microcode addresses for jumps and subroutine calls. Below this, six 11-bit registers implement the microcode stack, allowing six levels of subroutine calls inside the microcode. (Note that this stack is completely different from the 8087's register stack that holds eight floating-point values.) The stack registers have associated read/write circuitry. The incrementer adds one to the micro-address to step through the code. The engine also implements relative jumps, using an adder to add an offset to the current location. At the bottom, the address latch and drivers boost the 11-bit address output and send it to the microcode ROM.

Selecting a condition

A micro-instruction can say "jump ahead 5 micro-instructions if a register is zero" and the microcode engine will either perform the jump or ignore it, based on the register value. In the circuitry, the condition causes the microcode engine to either perform the jump or block the jump. But how does the hardware select one condition out of the large set of conditions?

Six bits of the micro-instruction can specify one of 64 conditions. A circuit similar to the idealized diagram below selects the specified condition. The key component is a multiplexer, represented by a trapezoid below. A multiplexer is a simple circuit that selects one of its four inputs. By arranging multiplexers in a tree, one of the 64 conditions on the left is selected and becomes the output, passed to the microcode engine.

A tree of multiplexers selects one of the conditions. This diagram is simplified.

For example, if bits J and K of the microcode are 00, the rightmost multiplexer will select the first input. If bits LM are 01, the middle multiplexer will select the second input, and if bits NO are 10, the left multiplexer will select its third input. The result is that condition 06 will pass through the tree and become the output.3 By changing the bits that control the multiplexers, any of the inputs can be used. (We've arbitrarily given the 16 microcode bits the letter names A through P.)

Physically, the conditions come from locations scattered across the die. For instance, conditions involving the opcode come from the instruction decoding part of the chip, while conditions involving a register are evaluated next to the register. It would be inefficient to run 64 wires for all the conditions to the microcode engine. The tree-based approach reduces the wiring since the "leaf" multiplexers can be located near the associated condition circuitry. Thus, only one wire needs to travel a long distance rather than multiple wires. In other words, the condition selection circuitry is distributed across the chip instead of being implemented as a centralized module.

Because the conditions don't always fall into groups of four, the actual implementation is slightly different from the idealized diagram above. In particular, the top-level multiplexer has five inputs, rather than four.4 Other multiplexers don't use all four inputs. This provides a better match between the physical locations of the condition circuits and the multiplexers. In total, 49 of the possible 64 conditions are implemented in the 8087.

The circuit that selects one of the four conditions is called a multiplexer. It is constructed from pass transistors, transistors that are configured to either pass a signal through or block it. To operate the multiplexer, one of the select lines is energized, turning on the corresponding pass transistor. This allows the selected input to pass through the transistor to the output, while the other inputs are blocked.

A 4-1 multiplexer, constructed from four pass transistors.

The diagram below shows how a multiplexer appears on the die. The pinkish regions are doped silicon. The white lines are polysilicon wires. When polysilicon crosses over doped silicon, a transistor is formed. On the left is a four-way multiplexer, constructed from four pass transistors. It takes inputs (black) for four conditions, numbered 38, 39, 3a, and 3b. There are four control signals (red) corresponding to the four combinations of bits N and O. One of the inputs will pass through a transistor to the output, selected by the active control signal. The right half contains the logic (four NOR gates and two inverters) to generate the control signals from the microcode bits. (Metal lines run horizontally from the logic to the control signal contacts, but I dissolved the metal for this photo.) Each multiplexer in the 8087 has a completely different layout, manually optimized based on the location of the signals and surrounding circuitry. Although the circuit for a multiplexer is regular (four transistors in parallel), the physical layout looks somewhat chaotic.

Multiplexers as they appear on the die. The metal layer has been removed to show the polysilicon and silicon. The "tie-die" patterns are due to thin-film effects where the oxide layer wasn't completely removed.

The 8087 uses pass transistors for many circuits, not just multiplexers. Circuits with pass transistors are different from regular logic gates because the pass transistors provide no amplification. Instead, signals get weaker as they go through pass transistors. To solve this problem, inverters or buffers are inserted into the condition tree to boost signals; they are omitted from the diagram above.

The conditions

Of the 8087's 49 different conditions, some are widely used in the microcode, while others are designed for a specific purpose and are only used once. The full set of conditions is described in a footnote7 but I'll give some highlights here.

Fifteen conditions examine the bits of the current instruction's opcode. This allows one microcode routine to handle a group of similar instructions and then change behavior based on the specific instruction. For example, conditions test if the instruction is multiplication, if the instruction is an FILD/FIST (integer load or store), or if the bottom bit of the opcode is set.5

The 8087 has three temporary registers—tmpA, tmpB, and tmpC—that hold values during computation. Various conditions examine the values in the tmpA and tmpB registers.6 In particular, the 8087 uses an interesting way to store numbers internally: each 80-bit floating-point value also has two "tag" bits. These bits are mostly invisible to the programmer and can be thought of as metadata. The tag bits indicate if a register is empty, contains zero, contains a "normal" number, or contains a special value such as NaN (Not a Number) or infinity. The 8087 uses the tag bits to optimize operations. The tags also detect stack overflow (storing to a non-empty stack register) or stack underflow (reading from an empty stack register).

Other conditions are highly specialized. For instance, one condition looks at the rounding mode setting and the sign of the value to determine if the value should be rounded up or down. Other conditions deal with exceptions such as numbers that are too small (i.e. denormalized) or numbers that lose precision. Another condition tests if two values have the same sign or not. Yet another condition tests if two values have the same sign or not, but inverts the result if the current instruction is subtraction. The simplest condition is simply "true", allowing an unconditional branch.

For flexibility, conditions can be "flipped", either jumping if the condition is true or jumping if the condition is false. This is controlled by bit P of the microcode. In the circuitry, this is implemented by a gate that XORs the P bit with the condition. The result is that the state of the condition is flipped if bit P is set.

For a concrete example of how conditions are used, consider the microcode routine that implements FCHS and FABS, the instructions to change the sign and compute the absolute value, respectively. These operations are almost the same (toggling the sign bit versus clearing the sign bit), so the same microcode routine handles both instructions, with a jump instruction to handle the difference. The FABS and FCHS instructions were designed with identical opcodes, except that the bottom bit is set for FABS. Thus, the microcode routine uses a condition that tests the bottom bit, allowing the routine to branch and change its behavior for FABS vs FCHS.

Looking at the relevant micro-instruction, it has the hex value 0xc094, or in binary 110 000001 001010 0. The first three bits (ABC=110) specify the relative jump operation (100 would jump to a fixed target and 101 would perform a subroutine call.) Bits D through I (000010) indicate the amount of the jump (+`). Bits J through O (001010, hex 0a) specify the condition to test, in this case, the last bit of the instruction opcode. The final bit (P) would toggle the condition if set, (i.e. jump if false). Thus, for FABS, the jump instruction will jump ahead one micro-instruction. This has the effect of skipping the next micro-instruction, which sets the appropriate sign bit for FCHS.

Conclusions

The 8087 performs floating-point operations much faster than the 8086 by using special hardware, optimized for floating-point. The condition code circuitry is one example of this: the 8087 can test a complicated condition in a single operation. However, these complicated conditions make it much harder to understand the microcode. But by a combination of examining the circuitry and looking at the micocode, we're making progress. Thanks to the members of the "Opcode Collective" for their hard work, especially Smartest Blob and Gloriouscow.

For updates, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS.

Notes and references

The section of the die that I've labeled "Microcode decode" performs some of the microcode decoding, but large parts of the decoding are scattered across the chip, close to the circuitry that needs the signals. This makes reverse-engineering the microcode much more difficult. I thought that understanding the microcode would be straightforward, just examining a block of decode circuitry. But this project turned out to be much more complicated and I need to reverse-engineer the entire chip. ↩
A PLA is a "Programmable Logic Array". It is a technique to implement logic functions with grids of transistors. A PLA can be used as a compressed ROM, holding data in a more compact representation. (Saving space was very important in chips of this era.) In the 8087, PLAs are used to hold tables of microcode addresses. ↩
Note that the multiplexer circuit selects the condition corresponding to the binary value of the bits. In the example, bits 000110 (0x06) select condition 06. ↩
The five top-level multiplexer inputs correspond to bit patterns 00, 011, 10, 110, and 111. That is, two inputs depend on bits J and K, while three inputs depend on bits J, K, and L. The bit pattern 010 is unused, corresponding to conditions 0x10 through 0x17, which aren't implemented. ↩
The 8087 acts as a co-processor with the 8086 processor. The 8086 instruction set is designed so instructions with a special "ESCAPE" sequence in the top 5 bits are processed by the co-processor, in this case the 8087. Thus, the 8087 receives a 16-bit instruction, but only the bottom 11 bits are usable. For a memory operation, the second byte of the instruction is an 8086-style ModR/M byte. For instructions that don't access memory, the second byte specifies more of the instruction and sometimes specifies the stack register to use for the instruction.

The relevance of this is that the 8087's microcode engine uses the 11 bits of the instruction to determine which microcode routine to execute. The microcode also uses various condition codes to change behavior depending on different bits of the instruction. ↩
There is a complication with the tmpA and tmpB registers: they can be swapped with the micro-instruction "ABC.EF". The motivation behind this is that if you have two arguments, you can use a micro-subroutine to load an argument into tmpA, swap the registers, and then use the same subroutine to load the second argument into tmpA. The result is that the two arguments end up in tmpB and tmpA without any special coding in the subroutine.

The implementation doesn't physically swap the registers, but renames them internally, which is much more efficient. A flip-flop is toggled every time the registers are swapped. If the flip-flop is set, a request goes to one register, while if the flip-flop is clear, a request goes to the other register. (Many processors use the same trick. For instance, the Intel 8080 has an instruction to exchange the DE and HL registers. The Z80 has an instruction to swap register banks. In both cases, a flip-flop renames the registers, so the data doesn't need to move.) ↩

The table below is the real meat of this post, the result of much circuit analysis. These details probably aren't interesting to most people, so I've relegated the table to a footnote. Descriptions in italics are provided by Smartest Blob based on examination of the microcode. Grayed-out lines are unused conditions.

The table has five sections, corresponding to the 5 inputs to the top-level condition multiplexer. These inputs come from different parts of the chip, so the sections correspond to different categories of conditions.

The first section consists of instruction parsing, with circuitry near the microcode engine. The description shows the 11-bit opcode pattern that triggers the condition, with 0 bits and 1 bits as specified, and X indicating a "don't care" bit that can be 0 or 1. Where simpler, I list the relevant instructions instead.

The next section indicates conditions on the exponent. I am still investigating these conditions, so the descriptions are incomplete. The third section is conditions on the temporary registers or conditions related to the control register. These circuits are to the right of the microcode ROM.

Conditions in the fourth section examine the floating-point bus, with circuitry near the bottom of the chip. Conditions 34 and 35 use a special 16-bit bidirectional shift register, at the far right of the chip. The top bit from the floating-point bus is shifted in. Maybe this shift register is used for CORDIC calculations? The conditions in the final block are miscellaneous, including the always-true condition 3e, which is used for unconditional jumps.

Cond.	Description
00	not XXX 11XXXXXX
01	1XX 11XXXXXX
02	0XX 11XXXXXX
03	X0X XXXXXXXX
04	not cond 07 or 1XX XXXXXXXX
05	not FLD/FSTP temp-real or BCD
06	110 xxxxxxxx or 111 xx0xxxxx
07	FLD/FSTP temp-real
08	FBLD/FBSTP
09
0a	XXX XXXXXXX1
0b	XXX XXXX1XXX
0c	FMUL
0d	FDIV FDIVR
0e	FADD FCOM FCOMP FCOMPP FDIV FDIVR FFREE FLD FMUL FST FSTP FSUB FSUBR FXCH
0f	FCOM FCOMP FCOMPP FTST
10
11
12
13
14
15
16
17
18	exponent condition
19	exponent condition
1a	exponent condition
1b	exponent condition
1c	exponent condition
1d	exponent condition
1e	eight exponent zero bits
1f	exponent condition
20	tmpA tag ZERO
21	tmpA tag SPECIAL
22	tmpA tag VALID
23	stack overflow
24	tmpB tag ZERO
25	tmpB tag SPECIAL
26	tmpB tag VALID
27	st(i) doesn't exist (A)?
28	tmpA sign
29	tmpB top bit
2a	tmpA zero
2b	tmpA top bit
2c	Control Reg bit 12: infinity control
2d	round up/down
2e	unmasked interrupt
2f	DE (denormalized) interrupt
30	top reg bit
31
32	reg bit 64
33	reg bit 63
34	Shifted top bits, all zero
35	Shifted top bits, one out
36
37
38	const latch zero
39	tmpA vs tmpB sign, flipped for subtraction
3a	precision exception
3b	tmpA vs tmpB sign
3c
3d
3e	unconditional
3f

This table is under development and undoubtedly has errors. ↩

Reverse engineering circuitry in a Spacelab computer from 1980

The 74181 ALU chip

The Mitra's ALU/register boards

A brief history of the French computer industry leading up to this computer

Replacement by the IBM AP-101SL computer

Conclusions

Notes and references

Instruction decoding in the Intel 8087 floating-point chip

Cooperation with the main 8086/8088 processor

The structure of 8087 instructions

Selecting a microcode routine

Instruction decoding inside the microcode

Hardwired instruction handling

Constants

Conclusions

Notes and references

Conditions in the Intel 8087 floating-point chip's microcode

The 8087's microcode

Selecting a condition

The conditions

Conclusions

Notes and references

Don't miss a post!