1The Mathematics Extreme Computation Cluster At Harvard
7In Spring 2002 I assembled and configured MECCAH, which is a
8rack-mounted cluster of six fast computers for use by mathematicians
9who are doing demanding computational work. This article is about my
10experience building and maintaining MECCAH. It should be of use to
11anyone considering undertaking or funding a similar project at their
14As a graduate student at Berkeley and a faculty member at Harvard, the
15computational resources available to me at my host universities
16consisted of scattered Sun workstations running at about one-fourth
17the raw speed of the current Pentium processors. These machines spent
18much of their time running Netscape and texing documents, so they were
19not suitable for demanding computations that could easily use all
20available resources. Sure, at each institution a senior faculty
21member had a powerful computer (McMullen at Berkeley, Elkies at
22Harvard), but that was for his own personal use.
24In 2001, the Harvard sysadmin, Arthur Gaer, mentioned that the
25department was tentatively considering spending several tens of
26thousands of dollars (that they didn't yet have) on a single
27multi-processor Sun workstation to support computation-intensive work.
28My opinion was that such a workstation would be solid but hardly
29useful; the raw computational power would scarcely touch what two
30cheap Intel-based Linux boxes could do, though the Linux boxes would
31likely be less reliable.
33I decided to build a cluster of dual processor machines running Linux.
34I did research and discussed possible configurations with Berkeley
35grad student Wayne Whitney and a Harvard undergrad named Alex Healy,
36and requested money. Finally, I secured a grant of $6000 from
37Harvard, and Harvard alumnus William Randolph Hearst III gave me an
38additional $14000, which made the budget $20000.
40I decided to assemble an Athlon-based system. The Athlon 2000MP is a
41multi-processor-ready Pentium-like CPU that Athlon claims has
42performance that is similar to a 2GHz Pentium IV. I selected the
43Athlon 2000MP processor in March because it was the fastest available
44budget-priced multi-processor capable CPU on the market. Intel's only
45fast multi-processor capable CPU was the Xeon, which was then much
46more expensive (the Xeon might be a good choice today). Six months
47later, Athlon has just announced the 2200MP, so I don't feel like
48Athlon 2000MPs are out of date.
50In February 2002, I ordered first one, then five more, custom-built
51Athlon 2000MP machines in 2U-sized rack-mount cases from
52http://www.pcsforeveryone.com/, which is a local Cambridge chop shop.
53They ordered the parts I wanted, assembled them, tested them, found
54surprisingly often that they were defective, got replacements, and
55finally delivered the individual computers. I still have occasional
56hardware reliability problems with two of the nodes, but they are a
59Unwrapping the rack and putting the computers in it took Alex Healy a
60full afternoon. Once assembled, I had to keep the machine in my
61office, because the math department's server closet was tiny and
62currently full of equipment. It would be several months until we made
63room in the server closet for the cluster. In the meantime, I kept a
64rack of noisy and hot computers running in my office. When students
65came to see me during office hours, they had to shout over the 30
66cooling fans in MECCAH.
68And, the fuses kept blowing! My neighbor's office is on the same
69circuit as mine and when he returned from vacation and turned his
70computer on, the circuit breaker blew, so I had to call the
71electricians out to switch it back. I moved back to running only four
72machines, then once increased to five, again blowing the circuit.
74MECCAH's operating system is Redhat 7.2 with Linux kernel version
752.4.16 on all six nodes. MECCAH also uses openMosix, which makes the
76rack of six computers appear to the user as a single computer with 12
77processors and 13GB memory (though a single process should not use
78more memory than on any node). Under openMosix, jobs are
79automatically migrated from one node to another to dynamically balance
80the overall system load. Users only have accounts and login
81privileges for the master node, and never worry about logging into
82other nodes. I also configured MECCAH to use the ext3 journaled
83filing system, so, e.g., I can pull the plug from the wall, plug it
84back in, and have MECCAH back up in five minutes with absolutely no
85file system corruption.
87For computations, people mainly use MAGMA, PARI, Python, C++, and
88Mathematica. Though Harvard has a Mathematica site license, I HATE
89administering Mathematica because the licenses regularly expire and
90limit the number of copies of Mathematica that can be run at once
91(there should be a way around the latter problem). MAGMA for Linux,
92on the other hand, requires no license and is free to me because I'm a
93MAGMA developer. Evidently, Maple is expensive, so we have only a
94limited Sun license for Maple in the math department.
96Here is how I organize computation of a basis for the space of modular
97forms with level N and weight 2 for N between 1 and 1000. I run 12
98jobs simultaneously that each look to see the next level that hasn't
99been computed, compute that level, and save the result. If it took 1
100day to do this computation on my 1Ghz Pentium III last year, it will
101take only 1 hour to do it on MECCAH. When I am in the throws of a big
102computation, having this kind of computational resource available to
103me is extremely exciting. Instead of waiting 1 day, I wait only an
104hour to generate more than enough data to stimulate theorem proving!
106I've given MECCAH accounts to nearly 80 mathematicians all over the
107world. Abuse of the system by users is rare but not unheard of.
108Somewhat surprisingly, the usage pattern comes in bursts. There are
109almost always at least two or three jobs running, but every so often
110many mathematicians simultaneously become inspired to run lots of
111computations all at once.
113I am the only systems administrator of MECCAH, and I typically spend
114under five hours a week on administrative responsibilities. I still
115haven't upgraded the Linux kernel or openMosix since March, but I
116probably should since there have been a few unexplained problems that
117might be fixed by a Linux and openMosix upgrade. I use a 30GB
118Onstream ADRx2 tape drive to make regular backups.
120If I were to build a similar cluster from scratch again, I would
121probably buy more expensive and better warrantied pre-configured
122dual-processor rack mount nodes instead of custom designing the nodes
123myself. I definitely would not have kept the computer in my office.
124When first designing MECCAH, I thought long about whether or not to
125stack a bunch of conventional cases on shelves or to buy a rack and
126rack-mount cases. A rack costs nearly $1000 and rack-mount cases cost
127more than double what ordinary cases cost. In retrospect, it would
128have been madness to buy conventional cases and shelves, because I've
129had to move the cluster around many times, and it barely fits in the
130tiny server closet. The $1500 premium for a rack-mounted system was
131well worth it. I also deliberated between a fancy serial console or a
132KVM (keyboard, video, mouse) switch; I went with the $500 KVM, which
133turned out to be an excellent choice.
135The six nodes are networked via a switched 100Mbps ethernet network.
136I wish the network were faster, because it takes a few minutes to
137transfer 1 GB from one computer to another. Since user programs
138migrate between machines and frequently do use in excess of 1GB
139memory, this transfer time is significant. I purchased 100Mbps
140ethernet instead of 1Gbps ethernet, because I read that 1Gbps ethernet
141with Linux is not very reliable, and there can be significant latency
142problems. Since I didn't have the resources to experiment with many
143configurations, I opted for 100Mbps, which is very easy.
145In summary, I love this machine. Not only does it satisfy my current
146computational needs, but also those of many other pure mathematicians
147all over Earth.