The Atlas computing cluster
The high-throughput computing cluster Atlas is the world's largest and most powerful resource dedicated to gravitational-wave searches and gravitational-wave data analysis. It offers a unique environment for scientists to solve computational problems too large for a single or a few PCs.
Atlas is located in a 450 square meter basement of the Max Planck Institute’s laboratory building. To allow continuous and secure round-the-clock operations, power is provided by two uninterruptible power supply systems. Four cooling machines transfer heat from the cluster basement to the outside. With a total power rating of one megawatt, these systems are able to provide power and cooling for about six minutes at full load in the event of an external power failure.
In early 2020, the cluster was extended to more than 50,000 physical CPU cores (about 90,000 logical) in 3,000 servers. These servers range from 2,000 older 4-core systems with 16 GB of RAM each, 550 systems with 28 CPU cores and 192 GB of RAM, to the newest 444 with 64 CPU cores and 512 GB of RAM each. In addition, about 350 high-performance, specialized graphics cards (GPUs) have been added in parallel with about 2,000 existing cards for specialized applications. These additions increase Atlas' theoretical peak computing performance to more than 2 PFLOP/s.
All these computers are connected via common Gigabit Ethernet with all other compute and storage servers. To connect all compute nodes, a total of 15 kilometers of Ethernet cables have been used. The total bandwidth is about 20 terabit/s.
Any data set (detector data, intermediate data products, results) has its own class of storage servers using thousands of hard disk and flash drives. While detector data are available on servers optimized for massively parallel read access, temporary or intermediate data products are stored either locally on the compute nodes or on dedicated “scratch” servers optimized for reading and writing quickly.
Final results and everything needed for interactive and development use can be stored in the users' “home” file systems backed by a tiered storage architecture. A large data cache consisting of a mix of hard disk and flash drives offers fast access to often used files. In the background, a robotic tape archive stores up to 4.5 petabytes of old and rarely used data at the expense of quick access.
Gravitational-wave data analysis
The most important research area of the Observational Relativity and Cosmology department is the development and implementation of data analysis algorithms to search for the different expected types of gravitational-wave sources. This includes burst, stochastic, continuous wave, and inspiral signals in data from the LIGO and Virgo gravitational-wave detectors.
Searches for weak gravitational wave signals are very compute-intensive. In some cases, the lack of computing resources makes the searches substantially less sensitive than would be possible using the same data, but with infinite computing power. For this reason, one of the central activities of the department is to maintain and improve Atlas.
Atlas also plays an important role for the distributed volunteer computing project Einstein@Home, which uses computing power donated by the general public to search for gravitational waves and electromagnetic emission from neutron stars. Here, Atlas is used for the preparation of data sets and new search runs, and for the analysis of the results from Einstein@Home.
Operating system and high-throughput computing
Atlas is a high-throughput computer (HTC) cluster, i.e. it is well-suited to efficiently execute a large number of loosely-coupled tasks. The main design goal was to provide very high computing throughput at very low cost, primarily for “trivially parallel” analyses. However, it can also efficiently run highly-parallel low-latency codes such as parameter estimation for gravitational-wave signals.
About 40 users are actively using Atlas at the moment. They submit compute jobs to the Atlas nodes via the batch scheduler HTCondor. Interactive data analysis and job implementation is possible on one of four dedicated machines (head nodes).
The main operating system used is Debian GNU/Linux and thoroughly optimized for full automation, e.g. a new or repaired server will be completely set-up and functioning within a few minutes after power-on without any further need of touching the machine.
Awards and history
Atlas was designed by Bruce Allen, Carsten Aulbert, and Henning Fehrmann, and is primarily intended for the analysis of gravitational-wave detector data. Atlas was officially launched in May 2008 with 1,344 quad-core compute nodes. One month later it was ranked number 58 on the June 2008 TOP500 list of the world's fastest computers. This also made it the sixth fastest computer in Germany at that time.
Latest news and detailed information about Atlas, its compute nodes, storage servers, and how to use them can be found in the Atlas Wiki.