Future Programming—Art or Engineering?

September 24th, 2007

The emergence of multicore computer systems—ranging from portable and desktop computers to petascale supercomputers—is placing new demands on computer software, and is catalyzing a reexamination of the fundamental methods of computer programming.

When computer architects focused on increasing the speed of single processors, software performance improvements accrued automatically. But to take advantage of multiple cores, programmers must explicitly divide the code into multiple (and possibly conflicting) threads of execution that all run in parallel.

This is a non-trivial programming task. And the problem of creating languages and tools to make parallel programming easy and reliable is one that David Patterson, University of California computer science professor, feels will require a new Manhattan Project to solve.

One of the fundamental questions that this reexamination of programming techniques is raising is the perennial one of whether computer programming is or should be a form of art or a type of engineering.

Gordon Morrison’s Proposal

This article was prompted by a news item I read in EETimes (“Inventor forges fresh approach to writing software”) about Gordon Morrison’s announcement of his Coherent Object Software Architecture (COSA) approach to writing parallel computer programs.

Morrison is a long-time, independent computer architect who funds his work using royalties derived from computer architecture patents that are licensed to several large electronics firms (including IBM, Motorola, and Intel).

His COSA system is a highly systematic, parallel programming approach based upon state machines, tree data structures, and mathematical formulas. Its purpose is to simplify and bring more rigor to parallel software development, thereby reducing development costs and even eliminating the use of languages such as Java or C++ (which he considers too low-level). In Morrison’s own words: “Ten programmers could not produce the same program to solve a problem because programming is an artful approach. I want to eliminate the art and make it an engineering approach.” Regarding contemporary software project management, he has also stated that “coding standards are left to the artistic nature of the developer, hence the term herding cats.”

In other words, Morrison feels that computer programming is currently an art form (meaning unsystematic and unpredictable), but should be converted to engineering (meaning systematic and predictable).

The big questions this statement raises, however, are whether computer programming should be fully “engineerized,” whether the ten programmers in his example should produce the same program to solve a given problem, or whether a systematic, engineering approach might produce predictable, adequate programs yet stifle the independent creativity required to produce truly great programs.

Without attempting to answer this fundamental question, I would simply like to juxtapose the interesting story of Kazushige Goto, plus a few words from two renowned computing experts.

The Goto Story

Kazushige Goto works at the Texas Advanced Computing Center at the University of Texas, Austin. He has no formal training in computer programming and, mainly for fun, taught himself high-performance programming during his free time. His business card reads simply “high performance computing.”

Goto has developed a set of linear algebra subroutines called the Goto Basic Linear Algebra Subroutines, or BLAS, which in 2003 were used on 7 of the fastest 10 supercomputers as they competed for ranking on the TOP500 list. He hand optimizes his code for each system, meticulously reordering the instructions using his native skill, well-honed intuition, and—in his own words—trial and error. His hand-crafted code is consistently faster than that generated using either automated approaches or teams of developers. It has become legendary.

Goto’s story certainly sounds like the triumph of art over engineering, rare genius over standardized, automated development tools.

(See “Writing the Fastest Code, by Hand, for Fun: A Human Computer Keeps Speeding Up Chips.”)

A Few Words from the Masters

I also came across an article by Jim Sexton, IBM lead scientist for Blue Gene Applications (“Petascale Era Will Force Software Rethink”). His viewpoint is the exact opposite of Morrison’s. He feels that the challenges of parallel architectures actually demand a more artful approach to computer programming. In his own words, “On parallel systems, programming has changed from being a routine technical effort to being a creative art form.”

And finally, I discovered an article from Stanford Magazine (“Love at First Byte”) about Donald Knuth, professor emeritus of computer science at Stanford and one of the founders of academic computer science. Although Knuth was originally educated as a mathematician, a quote from this article reveals a very artful approach to programming. He says, “Like a poet has to write poetry, I wake up in the morning and I have to write a computer program.” It is no wonder that he named his multi-volume masterwork The Art of Computer Programming.

The Quest for Petascale Computing

September 15th, 2007

bluegenerack.jpg
Photo from IBM.

It’s nice to know that our desktop computers, and maybe even our cell phones, have more power than a room full of vacuum tubes and wires from the early days of computing. But now imagine a computer that is 100,000 times more powerful than a typical personal computer, that can perform more operations per second than a stack of laptops a mile and a half high, and that can solve a problem in a day that would take a PC a lifetime. These are the touted capabilities of the next generation of supercomputers, the petascale supercomputers, which are now under construction. Petascale computing represents the next rung in the ongoing climb towards faster and faster supercomputing, and these machines will be approximately 1,000 times faster than their predecessors.

The exciting thing about these machines, other than the sheer magnitude of the technical achievement, is that they can solve problems and answer questions that could never have been considered before—the “grand challenge” problems in science, engineering, and other fields.

What is a Petascale Supercomputer?

The short answer is that a petascale supercomputer is one that can perform at least a quadrillion (1,000,000,000,000,000) floating-point operations per second. A quadrillion floating-point operations per second is often abbreviated as petaflops. (Similarly, a trillion floating-point operations per second is abbreviated as teraflops, a million operations per second as megaflops, and single operation per second as flops.) A floating-point operation is an arithmetic process such as an addition or multiplication on numbers that are stored in floating-point format (this is the most common format used to store real numbers—that is, numbers that may have a decimal component, such as 123.45).

A problem with computer performance ratings, however, is that they can be somewhat vague or misleading. Say a manufacture has rated a supercomputer at 1 petaflops. Great. But which floating point operations were performed and in what application context? As Hennessy and Patterson point out in their venerable Computer Architecture textbook, the most meaningful performance metric is the speed at which a computer executes the applications you actually run. However, to have a universal metric for comparing supercomputer performance, it is useful to test all computers using a common benchmark program.

In 1993, Jack Dongarra, a computer scientist at the University of Tennessee, started compiling the now prestigious and universally recognized TOP500 list, which ranks supercomputers according to their performance. To measure performance, TOP500 uses a benchmark program known as Linpack, which solves a system of linear equations, and—for a given problem size—involves a known number of floating point operations (specifically, additions and multiplications). Therefore, measuring the time it takes a supercomputer to run Linpack for a given problem size tells us how many flops it can perform when running Linpack (not necessarily when running some other type of application).

So when a computer manufacture claims a certain number of flops for their machine, they could be referring to performance on the Linpack benchmark or possibly on another benchmark. Or, worse, they could be referring to a theoretical, calculated peak number of flops based on the maximum number of operations that can possibly be executed during a machine cycle, rather than the performance that is measured when the computer actually runs a specific application. (In fact, the TOP500 list displays such a theoretical value, known as Rpeak, but uses the actual performance on the benchmark, Rmax, to rank the computers.)

Even if we know that the flops rating is based on a benchmark such as Linpack, Andy Bechtolsheim (the chief designer of the Sun Constellation) has pointed out that some machines can excel on a benchmark involving floating-point calculations, but then slow to a crawl when running programs that involve moving significant amounts of data between processors, as many applications do.

The main point is that in the quest for supercomputing speed, announcements of reaching certain landmarks, such as petascale computing, should be taken lightly and we need to look more closely at what the performance ratings actually represent.

The Rivalry

We are in an interesting chapter in the ongoing saga for greater supercomputing speeds. Not only are we on the threshold of petascale computing, but also the reigning supercomputer champion, the IBM Blue Gene, is now facing serious competition from a contender just reentering the supercomputing marketplace, Sun Microsystems with their new Constellation supercomputer. So the quest involves not only designer vs. silicon, but also designer vs. designer.

The TOP500 list is compiled and announced twice a year, in June at the International Supercomputing Conference and in November at the SC conference (SC06, SC07, and so on). On the most recent list, announced last June at the International Supercomputing Conference in Dresden, Germany, the world’s fastest supercomputer was the IBM Blue Gene/L, rated at 280.6 teraflops. (It’s been number one four times in a row, ever since its debut in 2004.) Sun was nowhere to be seen in the top-ten entries on the list. The new Sun Constellation, however, is clearly intended to be a Blue Gene/L killer. (It’s designed to be three times faster than a similarly configured Blue Gene/L.) But in the meanwhile, IBM is furiously working on the next Blue Gene generation, the Blue Gene/P, and a fully-loaded Blue Gene/P—assuming that anyone can afford to commission one—is designed to run about 75% faster than a fully-loaded Constellation (approximately 3 petaflops vs. 1.7 petaflops).

The Blue Gene/P and the Constellation were both announced at the June International Supercomputing Conference. The first Blue Gene/P and Constellation supercomputers are currently under construction. The completion dates are uncertain, although the installations are targeted for the late 2007 or early 2008 timeframe. The initial machines will have relatively minimal configurations.

The first Blue Gene/P is slated for installation at the U.S. Department of Energy’s Argonne National Laboratory, and will be a 111-teraflop system. The first Sun Constellation is destined for the Texas Advanced Computing Center (TACC) at the University of Texas in Austin, and is designed to run at 504 teraflops.

So whether we see an upset—perhaps only a temporary one—in the number one computer on the TOP500 list will depend upon the actual system completion dates, and watching the list in the next year or so should be interesting.

The Architectures

A common theme in the architectures of modern supercomputers, such as the Blue Gene/P and the Constellation, is that performance is achieved not by designing some sort of high-clock-rate “superprocessor,” but rather by linking a massive number of relatively low-speed commodity processors and by maximizing the speed of the connections between these processors and other components such as memory and disk storage.

Such an architecture obviously lends itself to applications that can be broken up into separate parallel tasks (each one running as a separate process or thread). This is another issue to consider when evaluating a supercomputer’s speed rating, as discussed above. A petascale computer is certainly not one that can execute a single stream of floating-point operations at a petaflops speed! Achieving the full flops rating means breaking up the running application(s) into as many separate tasks as there are processor cores in the system (and as you’ll see, there can be a lot of them). This is the same programming challenge that is now facing desktop computing due to the emergence of multicore chips.

The moderate clock speed of the individual processor cores allows the chips to use less power and creates a more energy efficient supercomputer. IBM claims that the Blue Gene/P is seven times more energy efficient than any other supercomputer. However, energy efficient is a highly relative term. IBM also says that a one-petaflops system will consume about 2.9 megawatts, the approximate power consumption of 2,900 homes (and this figure doesn’t include the significant energy requirements for cooling the system).

Both IBM and Sun have also worked to increase component densities, thereby reducing the floor space required for their new architectures. However, high density is also a relative term. According to Sun’s Jonathan Schwartz, the total size of the Constellation computing facility in Texas will be about half the size of an NBA basketball court.

The initial installations of both systems will run the Linux operating system.

The IBM Blue Gene/P

bluegenep.jpg
Researcher working on a Blue Gene/P. Photo from IBM.

Jack Dongarra, of TOP500 fame, likens the Blue Gene/P to the Hubble telescope. The architecture has a peak capability of 3 petaflops, but a more realistic estimate for continuous operation in real-world situations is 1 petaflops.

The Blue Gene/P chip consists of four PowerPC 450 processor cores running at 850 MHz. A 2 foot by 2 foot circuit board contains 32 chips, and a 6-foot high rack holds 32 circuit boards. An entire Blue Gene/P system, which is known as a cluster, consists of multiple racks tied together. A 1-petaflops system needs 72 racks (294,912 processor cores) and a 3-petaflops system needs 216 racks (884,736 processor cores!). Racks can be added as requirements expand (and budget allows).

IBM has maximized the speed of the connections between the chips and other components by using a high-speed optical network. (IBM is less forthcoming about their interconnection design than Sun.)

The Sun Constellation

constellation.jpg
A Constellation blade rack. Photo from Sun Microsystems.

The Constellation architecture has a potential peak capacity of 1.7 petaflops, although the first model being built in Texas will clock in at “only” 504 teraflops.

The Constellation architecture supports various chips, including the Sun UltraSparc as well as chips from Intel. The Texas model uses four-core AMD Barcelona chips. (The Barcelona runs at clock frequencies up to 2 GHz.) A blade contains 4 chips, and a rack holds 48 blades (a blade is a narrow module that can be plugged into a rack). A complete Constellation system is a cluster consisting of multiple connected racks. The Texas model has 82 racks, for a total of 62,976 processor cores.

constellationswitch.jpg
A Constellation switch. Photo from Sun Microsystems.

Sun has maximized the speed of the connections between the chips and other components by using a novel switching architecture that is a greatly enhanced version of the InfiniBand high-speed networking technology. Many blades connect directly to a very large, central switch (called the Magnum switch) in a star-like pattern (hence the name “Constellation,” according to one source). Because the switches are so large (with 3,456 ports each), only a few switches are required (the Texas model has just 2). The small number of switches means that individual blades are connected much more directly (there are fewer intermediate switches that data must pass through), which significantly reduces the interconnection latency. Also, having fewer switches lessens the amount of expensive cable and floor space that is required for the system.

The Applications

As the cost per teraflops falls, supercomputers are increasingly being used in the commercial sector for product design, virtual prototyping, economic forecasting, and other purposes. However, their chief application is still solving grand challenge science and engineering problems. For example, the latest supercomputers are being used to

  • simulate and forecast worldwide weather patterns and climate changes
  • study the impact of global warming
  • explore the formation and evolution of galaxies in the early universe
  • learn the properties of proteins at the atomic scale
  • model the human brain
  • analyze the properties of elementary particles
  • map genes and discover treatments for genetic diseases
  • determine the state of nuclear devices
  • design technology to be more energy efficient
  • run simulated trials for new drugs

According to Jack Dongarra, computing is becoming a third pillar of science, on a par with theory and experiment. Although we already have terascale supercomputers that are being applied to advanced scientific problems, the new petascale generation of supercomputers will be able to look at more variables and examine scientific phenomena at finer levels of resolution, thereby extending the questions we can ask into new and undreamed of realms.

Links

“New IBM supercomputer achieves petaflop”

“IBM unveils BlueGene/P supercomputer to break Petaflop barrier”

“Which supercomputers rule?”

“Blue Gene”

“IBM Triples Performance of World’s Fastest, Most Energy-Efficient Supercomputer”

“Sun eyes supercomputing glory”

“Texas-Sized Supercomputer to Break Computing Power Record”

“Billionaire Thinks in Trillions for His Computer Designs”

“Sun to challenge IBM with new “Constellation” supercomputer”

“TOP500 Supercomputer Sites”

“Constellation Petascale Computing”

“Switching Subjects”

The Ancient Game of Go Succumbs to the Monte Carlo Method

September 7th, 2007

goboardwikipedia.jpg
from Wikipedia

An article in the Seattle Times paints a vivid picture of the fascination of computer programs that play the ancient game of Go. One evening in 2006, Peter Drake, an assistant professor of Computer Science at Portland’s Lewis & Clark College stays late in his office and begins adjusting the algorithms in his computer Go program, known as Orego. The next morning, he is still staring at the computer in his office.

What is so exciting about this particular computer program that it motivates a grown man to put in an all-nighter? It turns out that Drake’s most recent strategy, known as the Monte Carlo method, coupled with his latest algorithmic tactics, are finally allowing Orego to beat a tough competing program in a challenge over the Internet.

Go

The game of Go originated more than 2,000 years ago in China. One player uses black stones, the other white. Players lay stones on a grid, traditionally 19 x 19 lines, in an attempt to capture the most territory. The rules are simple, but mastering the game is not.

Writing a computer program to play Go is especially challenging. Although recent computer versions of chess can beat even world champions, writing a computer Go program that can compete against expert players is still one of the grand challenges in the field of artificial intelligence.

Why are computer Go programs so hard to write?

A traditional game program chooses the next move by constructing a tree representing all possible next moves, followed by all possible countermoves, and then all possible countermoves to the countermoves, and so on. Although the tree broadens very rapidly, it’s often possible to “prune the tree” by evaluating potential intermediate game states without going all the way to the end of the game. (For example, in chess a non-end game state might be evaluated by simply counting the pieces on both sides.)

Go has two problems with this approach. First, because of the large size of the board and the huge number of allowed moves, the branching factor in a game tree would much larger than in chess and its size would soon grow out of control. Also, in Go it is much less feasible to stop building the tree before the endgame, because it is much more difficult to evaluate non-end game states. (I’m not a Go player, but I’ve read that this has something to do with effectively “dead” stones remaining on the board for many moves.)

The superiority of humans over computers in Go might also be explained in more humanistic terms. Go challenges computer algorithms because it requires broad pattern matching skills and a relatively intuitive approach, while chess favors an algorithmic solution because the focus is more localized (on the capture of a single piece) and the approach is more logical. The game of Go can demand such a whole-body, visceral level of concentration that the fictional Go master in Trevanian’s novel Shibumi eventually develops stomach cancer due to the constant strain of it.

goboardmattryall.jpg
“My Go Board” by Matt Ryall

Enter Monte Carlo

A recent and surprisingly effective breakthrough in computer Go is the use of the clever and conceptually simple Monte Carlo method. On small-sized boards (9 x 9 lines), Monte Carlo based Go programs can now challenge human experts. (Although Monte Carlo Go on a 19 x 19 board is not yet a strong contender against either computer or human opponents.)

The most famous early use of the Monte Carlo method was for running simulations during the development of the atom bomb in the Manhattan Project at Los Alamos, New Mexico. According to Stanislaw Ulam, a Manhattan Project researcher and famous mathematician, the Monte Carlo method was named in honor of his uncle, an inveterate gambler. (See the fascinating autobiography Adventures of a Mathematician by Stanislaw Ulam.)

How does the Monte Carlo method work?

Sometimes it’s easy to calculate the probability of a given event. For example, the probability of getting two heads with two flips of a coin is simply 1/2 * 1/2 = 1/4. Other times, however, it’s difficult or impossible to calculate the probability of a particular outcome of an experiment. In this case, we can use the Monte Carlo method to actually do the experiment many times (or at least simulate it) and find out the probability directly.

An example of a probability problem that is somewhat cumbersome to calculate is the classic puzzle known as Galileo’s Dice. According to legend, a gambler once ask Galileo to find the probability of getting a sum of 9 when three dice are rolled. To solve this problem using the Monte Carlo approach, we can write a program that uses a random number generator to simulate throwing three dice, and repeat that simulation maybe 1000 times. The number of times we get 9, divided by 1000, is the observed relative frequency of getting 9. If the simulated experiment is repeated a large number of times (1000 is a good choice), the observed relative frequency is an accurate point estimate for the actual probability of 9.

When a Monte Carlo Go program chooses the next play, it uses the Monte Carlo method to estimate, for each possible move, the probability that the move will result in a winning game. It then simply chooses the move that has the highest estimated probability of leading to a win. The probability estimate for each possible move works just like the dice example—but instead of simulating three dice throws to estimate the probability of getting a 9, it simulates a sequence of subsequent Go moves until the hypothetical game ends, to estimate the probability of winning with that move. The simulation might be repeated 100 or 1000 times. (The more times the better, but the rules of play might limit the permissible response time.)

A big advantage of a pure Monte Carlo approach is that it is based simply on random numbers and simulations, and the program does not need to use complex if-then rules based on expert knowledge specific to the game of Go. However, more effective Monte Carlo Go programs do incorporate a limited amount of expert Go knowledge in the form of heuristics (i.e. rules-of-thumb). Using heuristics is time consuming so a program can’t use too many.

The Monte Carlo method makes heavy use of memory and processing cycles. However, according to Remi Coulom, a pioneer in the development of Monte Carlo Go, the algorithms are easy to parallelize and can therefore take good advantage of today’s multi-core processors.

The Significance of Computer Go Research

It seems that serious computer-science types involved in game programming often feel a need to justify their fixation by describing the real-world problems that game research might solve. In the case of Go, however, it’s likely that the challenges encountered in game programming might truly resemble problems that are found in a high-entropy, complex, non-black-and-white world (more so than the challenges encountered in programming a more neatly logical game such as chess). Examples of these real-world problems suggested by computer Go researchers include designing electronic circuits, building self-driving cars, targeting web advertising, optimizing channel allocation in cellular systems, and determining the best sites for industrial plants. The techniques developed in computer Go programming might also transfer to other academic fields—such as cognitive science, pattern recognition, and machine learning—to a greater degree than chess programming techniques.

So creating a winning Go program is not only a grand challenge in pure artificial intelligence, but might also have real-world and academic payoffs.

Links

“What’s black, white and can stump computers?”

“AI Invades Go Territory”

“Orego”

“Computer Go”

“Monte Carlo Method”