
Photo from IBM.
It’s nice to know that our desktop computers, and maybe even our cell phones, have more power than a room full of vacuum tubes and wires from the early days of computing. But now imagine a computer that is 100,000 times more powerful than a typical personal computer, that can perform more operations per second than a stack of laptops a mile and a half high, and that can solve a problem in a day that would take a PC a lifetime. These are the touted capabilities of the next generation of supercomputers, the petascale supercomputers, which are now under construction. Petascale computing represents the next rung in the ongoing climb towards faster and faster supercomputing, and these machines will be approximately 1,000 times faster than their predecessors.
The exciting thing about these machines, other than the sheer magnitude of the technical achievement, is that they can solve problems and answer questions that could never have been considered before—the “grand challenge” problems in science, engineering, and other fields.
What is a Petascale Supercomputer?
The short answer is that a petascale supercomputer is one that can perform at least a quadrillion (1,000,000,000,000,000) floating-point operations per second. A quadrillion floating-point operations per second is often abbreviated as petaflops. (Similarly, a trillion floating-point operations per second is abbreviated as teraflops, a million operations per second as megaflops, and single operation per second as flops.) A floating-point operation is an arithmetic process such as an addition or multiplication on numbers that are stored in floating-point format (this is the most common format used to store real numbers—that is, numbers that may have a decimal component, such as 123.45).
A problem with computer performance ratings, however, is that they can be somewhat vague or misleading. Say a manufacture has rated a supercomputer at 1 petaflops. Great. But which floating point operations were performed and in what application context? As Hennessy and Patterson point out in their venerable Computer Architecture textbook, the most meaningful performance metric is the speed at which a computer executes the applications you actually run. However, to have a universal metric for comparing supercomputer performance, it is useful to test all computers using a common benchmark program.
In 1993, Jack Dongarra, a computer scientist at the University of Tennessee, started compiling the now prestigious and universally recognized TOP500 list, which ranks supercomputers according to their performance. To measure performance, TOP500 uses a benchmark program known as Linpack, which solves a system of linear equations, and—for a given problem size—involves a known number of floating point operations (specifically, additions and multiplications). Therefore, measuring the time it takes a supercomputer to run Linpack for a given problem size tells us how many flops it can perform when running Linpack (not necessarily when running some other type of application).
So when a computer manufacture claims a certain number of flops for their machine, they could be referring to performance on the Linpack benchmark or possibly on another benchmark. Or, worse, they could be referring to a theoretical, calculated peak number of flops based on the maximum number of operations that can possibly be executed during a machine cycle, rather than the performance that is measured when the computer actually runs a specific application. (In fact, the TOP500 list displays such a theoretical value, known as Rpeak, but uses the actual performance on the benchmark, Rmax, to rank the computers.)
Even if we know that the flops rating is based on a benchmark such as Linpack, Andy Bechtolsheim (the chief designer of the Sun Constellation) has pointed out that some machines can excel on a benchmark involving floating-point calculations, but then slow to a crawl when running programs that involve moving significant amounts of data between processors, as many applications do.
The main point is that in the quest for supercomputing speed, announcements of reaching certain landmarks, such as petascale computing, should be taken lightly and we need to look more closely at what the performance ratings actually represent.
The Rivalry
We are in an interesting chapter in the ongoing saga for greater supercomputing speeds. Not only are we on the threshold of petascale computing, but also the reigning supercomputer champion, the IBM Blue Gene, is now facing serious competition from a contender just reentering the supercomputing marketplace, Sun Microsystems with their new Constellation supercomputer. So the quest involves not only designer vs. silicon, but also designer vs. designer.
The TOP500 list is compiled and announced twice a year, in June at the International Supercomputing Conference and in November at the SC conference (SC06, SC07, and so on). On the most recent list, announced last June at the International Supercomputing Conference in Dresden, Germany, the world’s fastest supercomputer was the IBM Blue Gene/L, rated at 280.6 teraflops. (It’s been number one four times in a row, ever since its debut in 2004.) Sun was nowhere to be seen in the top-ten entries on the list. The new Sun Constellation, however, is clearly intended to be a Blue Gene/L killer. (It’s designed to be three times faster than a similarly configured Blue Gene/L.) But in the meanwhile, IBM is furiously working on the next Blue Gene generation, the Blue Gene/P, and a fully-loaded Blue Gene/P—assuming that anyone can afford to commission one—is designed to run about 75% faster than a fully-loaded Constellation (approximately 3 petaflops vs. 1.7 petaflops).
The Blue Gene/P and the Constellation were both announced at the June International Supercomputing Conference. The first Blue Gene/P and Constellation supercomputers are currently under construction. The completion dates are uncertain, although the installations are targeted for the late 2007 or early 2008 timeframe. The initial machines will have relatively minimal configurations.
The first Blue Gene/P is slated for installation at the U.S. Department of Energy’s Argonne National Laboratory, and will be a 111-teraflop system. The first Sun Constellation is destined for the Texas Advanced Computing Center (TACC) at the University of Texas in Austin, and is designed to run at 504 teraflops.
So whether we see an upset—perhaps only a temporary one—in the number one computer on the TOP500 list will depend upon the actual system completion dates, and watching the list in the next year or so should be interesting.
The Architectures
A common theme in the architectures of modern supercomputers, such as the Blue Gene/P and the Constellation, is that performance is achieved not by designing some sort of high-clock-rate “superprocessor,” but rather by linking a massive number of relatively low-speed commodity processors and by maximizing the speed of the connections between these processors and other components such as memory and disk storage.
Such an architecture obviously lends itself to applications that can be broken up into separate parallel tasks (each one running as a separate process or thread). This is another issue to consider when evaluating a supercomputer’s speed rating, as discussed above. A petascale computer is certainly not one that can execute a single stream of floating-point operations at a petaflops speed! Achieving the full flops rating means breaking up the running application(s) into as many separate tasks as there are processor cores in the system (and as you’ll see, there can be a lot of them). This is the same programming challenge that is now facing desktop computing due to the emergence of multicore chips.
The moderate clock speed of the individual processor cores allows the chips to use less power and creates a more energy efficient supercomputer. IBM claims that the Blue Gene/P is seven times more energy efficient than any other supercomputer. However, energy efficient is a highly relative term. IBM also says that a one-petaflops system will consume about 2.9 megawatts, the approximate power consumption of 2,900 homes (and this figure doesn’t include the significant energy requirements for cooling the system).
Both IBM and Sun have also worked to increase component densities, thereby reducing the floor space required for their new architectures. However, high density is also a relative term. According to Sun’s Jonathan Schwartz, the total size of the Constellation computing facility in Texas will be about half the size of an NBA basketball court.
The initial installations of both systems will run the Linux operating system.
The IBM Blue Gene/P

Researcher working on a Blue Gene/P. Photo from IBM.
Jack Dongarra, of TOP500 fame, likens the Blue Gene/P to the Hubble telescope. The architecture has a peak capability of 3 petaflops, but a more realistic estimate for continuous operation in real-world situations is 1 petaflops.
The Blue Gene/P chip consists of four PowerPC 450 processor cores running at 850 MHz. A 2 foot by 2 foot circuit board contains 32 chips, and a 6-foot high rack holds 32 circuit boards. An entire Blue Gene/P system, which is known as a cluster, consists of multiple racks tied together. A 1-petaflops system needs 72 racks (294,912 processor cores) and a 3-petaflops system needs 216 racks (884,736 processor cores!). Racks can be added as requirements expand (and budget allows).
IBM has maximized the speed of the connections between the chips and other components by using a high-speed optical network. (IBM is less forthcoming about their interconnection design than Sun.)
The Sun Constellation

A Constellation blade rack. Photo from Sun Microsystems.
The Constellation architecture has a potential peak capacity of 1.7 petaflops, although the first model being built in Texas will clock in at “only” 504 teraflops.
The Constellation architecture supports various chips, including the Sun UltraSparc as well as chips from Intel. The Texas model uses four-core AMD Barcelona chips. (The Barcelona runs at clock frequencies up to 2 GHz.) A blade contains 4 chips, and a rack holds 48 blades (a blade is a narrow module that can be plugged into a rack). A complete Constellation system is a cluster consisting of multiple connected racks. The Texas model has 82 racks, for a total of 62,976 processor cores.

A Constellation switch. Photo from Sun Microsystems.
Sun has maximized the speed of the connections between the chips and other components by using a novel switching architecture that is a greatly enhanced version of the InfiniBand high-speed networking technology. Many blades connect directly to a very large, central switch (called the Magnum switch) in a star-like pattern (hence the name “Constellation,” according to one source). Because the switches are so large (with 3,456 ports each), only a few switches are required (the Texas model has just 2). The small number of switches means that individual blades are connected much more directly (there are fewer intermediate switches that data must pass through), which significantly reduces the interconnection latency. Also, having fewer switches lessens the amount of expensive cable and floor space that is required for the system.
The Applications
As the cost per teraflops falls, supercomputers are increasingly being used in the commercial sector for product design, virtual prototyping, economic forecasting, and other purposes. However, their chief application is still solving grand challenge science and engineering problems. For example, the latest supercomputers are being used to
- simulate and forecast worldwide weather patterns and climate changes
- study the impact of global warming
- explore the formation and evolution of galaxies in the early universe
- learn the properties of proteins at the atomic scale
- model the human brain
- analyze the properties of elementary particles
- map genes and discover treatments for genetic diseases
- determine the state of nuclear devices
- design technology to be more energy efficient
- run simulated trials for new drugs
According to Jack Dongarra, computing is becoming a third pillar of science, on a par with theory and experiment. Although we already have terascale supercomputers that are being applied to advanced scientific problems, the new petascale generation of supercomputers will be able to look at more variables and examine scientific phenomena at finer levels of resolution, thereby extending the questions we can ask into new and undreamed of realms.
Links
“New IBM supercomputer achieves petaflop”
“IBM unveils BlueGene/P supercomputer to break Petaflop barrier”
“Which supercomputers rule?”
“Blue Gene”
“IBM Triples Performance of World’s Fastest, Most Energy-Efficient Supercomputer”
“Sun eyes supercomputing glory”
“Texas-Sized Supercomputer to Break Computing Power Record”
“Billionaire Thinks in Trillions for His Computer Designs”
“Sun to challenge IBM with new “Constellation” supercomputer”
“TOP500 Supercomputer Sites”
“Constellation Petascale Computing”
“Switching Subjects”