Fastest computer cluster in Berlin-Brandenburg

Science Minister Sabine Kunst unveils new Datura high performance computing cluster at the Albert Einstein Institute

April 05, 2011

2400 processors, 200 servers, 4.8 TeraByte disc space and a maximum computing performance of 25.5 TeraFlops – which equals 25,500 billion calculating operations per second: these are some of the qualities of the Datura high performance computer, which will now help scientists at the Max Planck Institute for Gravitational Physics (Albert Einstein Institute/AEI) to calculate the collisions of black holes and neutron stars.

Thanks to the new, faster computing cluster, scientists in the Numerical Relativity Group will be able to carry out longer calculations and thus expect to track down new phenomena. Most recently, the team of scientists headed by Prof. Dr. Luciano Rezzolla attracted strong attention with their computations concerning the braking behaviour of black holes and concerning merging neutron stars. Datura is now expected to make the simulation of gravitational wave signals even more precise. This will be of great benefit for the international community of gravitational wave researchers who, based on the simulations carried out at the AEI, are combing through the detector data in search of signals.

With the new high-performance computing cluster, the calculations can, on the one hand, be carried out 2-3 times faster than before; on the other hand, much longer simulations are now possible. “When we can observe neutron stars and black holes longer in our ‘virtual lab’, we will presumably also discover new phenomena,” explains Prof. Luciano Rezzolla, head of the Numerical Relativity Group. “What’s more, more precise predictions about gravitational wave forms will be possible, as we will be able to simulate the mutual orbiting of neutron stars and black holes for a longer period of time.”

The official dedication of the high performance computer is the highlight and culmination of a scientific symposium that is being hosted by the AEI on 5 April 2011 entitled “German High Performance Computing in the new Decade”. Representatives of various Brandenburg- and Berlin-based research institutions will gather here to exchange their experiences about applications, management and future strategies in the world of high performance computing.

Background information

Numerical simulations on Datura

The Numerical Relativity Group at the AEI has long been the global leader in the simulation of extreme cosmic phenomena: On the high performance computers of the Institute, neutron stars collapse to black holes, stars explode and black holes inspiral towards each other. All of these processes have in common that gravitational waves are formed: tiny ripples in space-time, which Albert Einstein predicted in his general theory of relativity, but which have yet to be directly measured. The simulated wave signals are intended to help discover the real gravitational waves within the jungle of data. The reason: with a ‘mug shot’ that is as precise as possible, the chances of actually capturing and identifying in the data are greatly increased. The new cluster supplements Damiana, the supercomputer that is already located at the Institute.

Today, there are five interferometric gravitational wave detectors worldwide: The German-British GEO600 project in the vicinity of Hannover, the three LIGO detectors in the USA in Louisiana and Washington, and the French-Italian Virgo project in Pisa, Italy. In addition, the planned LISA space project (Laser Interferometer Space Antenna) is expected to be launched jointly by ESA and NASA in 2020. AEI scientists are playing a leading role at GEO600 and LISA and, within the framework of the LIGO-Virgo collaboration, are working closely with colleagues from other detector projects.

Datura

The Numerical Relativity Group headed by Prof. Rezzolla named the cluster computer after the Jimson weed plant, Datura stramonium. Also often called ‘Devil’s snare’, the member of the nightshade family contains various poisonous and hallucinogenic elements, but also develops very attractive white flowers.

The cluster is particularly suitable for problems that can be well paralleled. These are matrix operations as they are generally also used for simulation computations. To this effect, the individual nodes of the cluster must be able to communicate with each other particularly fast and effectively. The calculation of Einstein equations for astrophysically-interesting cases, like, for example, the merging process of black holes or neutron stars, is the main research area of the Numerical Relativity Group.

About NEC Deutschland GmbH

NEC Deutschland GmbH, founded in 1987 and headquartered in Düsseldorf, is a wholly-owned subsidiary of the NEC Corporation. The product portfolio encompasses supercomputers and high performance computers, telecommunication and IT solutions, as well as biometric security solutions for businesses and governmental institutions. www.nec.com/de

The NEC Corporation is one of the world’s leading integrators of IT and network technologies. NEC fulfils the complex and rapidly changing demands of customers with its highly-developed technologies and a unique combination of products and solutions. NEC thereby benefits from its longstanding experience and the synergistic deployment of global company resources. NEC can boast more than 100 years of experience in technological innovation for the empowerment of individuals, companies and society.
Further information available at http://www.nec.com.

Architecture of the cluster computer

Datura is a high performance Linux computing cluster with a calculated maximum performance of 25.5 TeraFlops. Flops are floating point operations per second: they are a measurement for the speed of the cluster computer.

The cluster consists of 200 compute nodes, with respectively two Intel XEON X5650 Westmere processors with a clock speed of 2.66 GHz each, as well as a capacity of 24 GB RAM and 300 GB local memory. Six storage nodes with an available total capacity of 192 TB store the enormous amount of resulting data of the numerical simulation computations in a parallel file system (LUSTRE). A head node enables users to communicate with the cluster and serves as a management basis for the entire system. Alternatively, three networks are responsible for the communication of the individual computers amongst one another. Each of these networks fulfils its own very particular task.

At the heart of the high performance cluster is the network and thus the corresponding switch (Voltaire Grid Director 4700), which takes care of interprocess communication and the linkage of the storage components. In this case it is an Infiniband switch with a bandwidth of up to 51.8 Tbit/sec. The other two networks serve the system administration of the cluster.

As typical numerical simulations take several days or even weeks, the jobs are administered through a batch system. A user logs in at the head node in order to compile programme codes or have the primarily graphically depicted results displayed. An extremely important element for all computing tasks by scientists at the AEI is carried out by the CACTUS-Code, which was developed at the AEI. This is a flexible range of tools that allow all scientists to formulate problems in a computer-compatible manner and to have the calculations carried out.

Further information

Technical data

200 Compute nodes each with
2 Intel XEON X5650 Westmere processors, each 2.66 GHz 24 GB RAM memory
300 GB storage capacity
3 network connections (2 x Gigabit, 1 x Infiniband)
IPMI 2.0 card

6 Storage nodes each with
2 Intel XEON F5520 Nehalem processors, each 2.27 GHz
8 GB RAM memory
2 x 300 GB internal discs RAID controller with a connection of 30 TB net storage capacity (gross: 2 x 12 SAS HDDs each 2 TB)
3 network connections (2 x Gigabit, 1 x Infiniband)
IPMI 2.0 card
Redundant power supply units

1 Head node (also called registration, access and management nodes) each with 2 Intel XEON X5650 Westmere processors, 2.66 GHz each
24 GB RAM memory
300 GB storage capacity

3 network connections (2 x Gigabit, 1 x Infiniband)
IPMI 2.0 card
Redundant energy supply units

These components are housed in eight air-cooled 19” racks. The CentOS 5.5 operating system has been installed on all computers.

The following system-affiliated software is used:
Compiler: Gnu C++, Intel C++, Intel Fortran
Libraries: BLAS, LAPACK, Intel MKL, Intel MPI, OpenMPI, MVAPICH
Programming tool: Intel Cluster Tool Kit
Batch system: SunGrid Engine
Monitoring: Nagios, Ganglia (Open Source)
Management software: Perceus (Open Source)

Details

Each compute node has three network interfaces for three specific networks. The most important is the interprocess and storage network that the compute nodes interconnect with 80Gbit/s bidirectional via a high performance Infiniband switch. Here, a SilverStorm switch from the company Voltaire is used. It has a backplane (circuit card) capacity of 51.8 Tbits/sec.

The second network serves to ensure that all components of the cluster can be fully operated. Here, five switches from the company Netgear (5 x GSM7352 Sv2) are used. To keep the cable lengths as short as possible, cascading was chosen.

There is an additional network that is not directly in operation but serves to support the system administrator with the early detection of hardware problems. Via IPMI cards, which are installed in all nodes, sensor values like, e.g., CPU temperature and fan speed, can be checked. Should the preset thresholds be crossed, the system automatically sends a message, depending upon the urgency, via e-mail or SMS to the system administrator. The system administrator can then take the necessary precautionary measures to avoid the disturbances.

Intelligent power distributed units (PDU) are yet another tool for the system administrator of the cluster.

Cooling of the cluster

The AEI has at its disposal, since the end of 2006, a large computing room for the high performance computers of the Numerical Relativity Theory Group. The room is ca.127 qm with a height of 3.40 m and an air volume of ca. 433 m3. The room is supplied with cold air through the double bottom. When setting up the cluster, the principle of warm/cold aisles was followed. Cold air is blown from the double bottom through perforated plates into the cold aisle to the racks and flows from front to back through the devices, thereby reaching the warm area in a heated state. Here, the warm air is sucked in from the cold chambers and again cooled, blown again under the double bottom and passed on in the cold aisle.

For the cluster of the Numerical Relativity Theory, a total of 380kW refrigerating capacity is currently available; of this amount, Datura requires ca. 90 kW. The rest is needed for the other clusters Damiana and Peyote, as well as for other systems.

Power supply

The power for the DATURA cluster is supplied via PDUs, which are connected on 32 A-cables. In the event of a power failure, a central uninterruptible power supply ensures, for max. 15 minutes, a continuous power supply to the storage and head nodes. Special software guarantees that these computers are automatically powered down and switched off as soon as a specified level to the availability of USV capacity is reached or if the room temperature exceeds a certain value.