The Atlas computing cluster
Atlas is located in a 450 square meter basement of the Max Planck Institute’s laboratory building. To allow continuous and secure round-the-clock operations, power is provided by two uninterruptible power supply systems. Four cooling machines transfer heat from the cluster basement to the outside. With a total power rating of one megawatt, these systems are able to provide power and cooling for about six minutes at full load in the event of an external power failure.
In early 2020, the cluster was extended to more than 50,000 physical CPU cores (about 90,000 logical ones) in 3,000 compute servers. These servers range from 2000 older 4 CPU core systems with 16GB RAM each, 550 systems with 28 CPU cores and 192GB RAM to the latest 444 ones with 64 CPU cores and 512 GB RAM each. Additionally, about 350 high performance, specialized graphics processing units (GPU) cards were added in parallel to the existing set of about 2,000 for special purpose computing. These additions raise the theoretical peak computing power of Atlas to more than 2 PFLOP/s.
All these computers are connected via common Gigabit Ethernet with all other compute and storage servers. To connect all compute nodes, a total of 15 kilometers of Ethernet cables have been used. The total bandwidth is about 20 terabit/s.
Any data set (detector data, intermediate data products, results) has its own class of storage servers using thousands of hard disk and flash drives. While detector data are available on servers optimized for massively parallel read access, temporary or intermediate data products are stored either locally on the compute nodes or on dedicated “scratch” servers optimized for reading and writing quickly.
Final results and everything needed for interactive and development use can be stored in the users' “home” file systems backed by a tiered storage architecture. A large data cache consisting of a mix of hard disk and flash drives offers fast access to often used files. In the background, a robotic tape archive stores up to 15 petabytes of old and rarely used data at the expense of quick access.
Gravitational-wave data analysis
The most important research area of the Observational Relativity and Cosmology division is the development and implementation of data analysis algorithms to search for the different expected types of gravitational-wave sources. This includes burst, stochastic, continuous wave, and inspiral signals in data from the LIGO and Virgo gravitational-wave detectors.
Searches for weak gravitational wave signals are very compute-intensive. In some cases, the lack of computing resources makes the searches substantially less sensitive than would be possible using the same data, but with infinite computing power. For this reason, one of the central activities of the division is to maintain and improve Atlas.
Atlas also plays an important role for the distributed volunteer computing project Einstein@Home, which uses computing power donated by the general public to search for gravitational waves and electromagnetic emission from neutron stars. Here, Atlas is used for the preparation of data sets and new search runs, and for the analysis of the results from Einstein@Home.
Operating system and high-throughput computing
Atlas is a high-throughput computer (HTC) cluster, i.e. it is well-suited to efficiently execute a large number of loosely-coupled tasks. The main design goal was to provide very high computing throughput at very low cost, primarily for “trivially parallel” analyses. However, it can also efficiently run highly-parallel low-latency codes such as parameter estimation for gravitational-wave signals.
About 40 users are actively using Atlas at the moment. They submit compute jobs to the Atlas nodes via the batch scheduler HTCondor. Interactive data analysis and job implementation is possible on one of four dedicated machines (head nodes).
The main operating system used is Debian GNU/Linux and thoroughly optimized for full automation (using FAI), e.g. a new or repaired server will be completely set-up and functioning within a few minutes after power-on without any further need of touching the machine.
Awards and history
Atlas was designed by Bruce Allen, Carsten Aulbert, and Henning Fehrmann, and is primarily intended for the analysis of gravitational-wave detector data. Atlas was officially launched in May 2008 with 1,344 quad-core compute nodes. One month later it was ranked number 58 on the June 2008 TOP500 list of the world's fastest computers. This also made it the sixth fastest computer in Germany at that time.
Latest news and detailed information about Atlas, its compute nodes, storage servers, and how to use them can be found in the Atlas Wiki.