Beagle2 General Design
Beagle is housed in 8 Cray XE6 cabinets:
- 182 blades are compute-node blades (728 nodes) described below under “Compute Resource”.
- There are also Service Nodes:
- 3 job management nodes (MOM nodes, Torque batch system)
- 6 login nodes – 10Gbps NICs
- 16 InfiniBand (IB) nodes, object store server (Lustre OSS nodes)
- 1 boot node (ALPS resource manager, shared filesystem server)
- 1 sdb node (system database, ALPS resource manager, Moab/Torque batch system, NFS file server)
- 6 DSL service nodes (Provide dynamically-linked libraries for CCM mode)
- 1 monitor service node (Nagios, Ganglia)
- 4 fiber channel nodes (Lustre MDS nodes)
- 1 admin service node (Gold allocation system, NAT service for Lustre MDS nodes.)
All nodes do not have local storage, but access a shared file system.
Lustre File System is direct-attached and it is not backed up.
- CPU Nodes: 728
- GPU Nodes: 4
- Peak performance: 212 TFlops/s
- Total memory: 64GB per node * 724 compute nodes = 46336GB, 32GB per node * 4 GPU compute nodes = 128GB
- Max memory bandwidth for the Opteron 6380 is 102.4 GB/s
- An XE6 CPU node consists of 32 cores sharing a total of 64 GB of memory accessible through a NUMA (Non-Uniform Memory Architecture) design.
- The 32 cores are packaged as 2 AMD Operton 6300 series with 16 cores (model 6380), code-named “Abu Dhabi ”.
- “Abu Dhabi ” consists of:
- Abu Dhabi processor is an example of a Directly Connected Multi Chip Module processor. Multi Chip Module (MCM) means that the processor is essentially two hex-core dies connected together within the same socket.
- Each Beagle2 node consists of two processors and each processor consists of two dies. Each die is directly connected to one-quarter of the total Beagle2 node memory. Each die and its memory is called a NUMA (Non-Uniform Memory Architecture) node. There are four NUMA nodes per Beagle2 node. See the figure bellow.
- Each die has direct access to 8GB memory through 2 memory channels. Access to memory connected to other dies must be transferred through a Hyper Transport (HT) link.
Level 1 cache: 8 x 64 KB shared instruction caches, 16 x 16 KB data caches
Level 2 cache: 8 x 2 MB shared exclusive caches
Level 3 cache: 2 x 8 MB shared caches
Accelerator/GPU nodes are equipped with one AMD Opteron 6380 CPU 2.6-GHz processor and one NVIDIA K20X accelerator. The CPU acts as a host processor to the accelerator. The NVIDIA accelerator does not directly interact with the Gemini interconnect. Each GPU node has 32 GB of system memory while the accelerator has 6GB of memory.
MOM nodes are where PBS scripts are executed and the aprun command is launched. All scripts and executables run outside of aprun command are executed on the processor of the MOM node (and usually it is a bad idea).
They do not participate in MPI applications and they are service nodes not compute nodes. These nodes are in the Gemini high speed network (HSN).
Avoid as much as possible to run any computations on MOM nodes as it might seriously affect other user’s computations and it is therefore against the usage policy. Such jobs usually overload MOM nodes and might even kill all the jobs run on that MOM node. Usage is monitored, and violations will not be tolerated.
- “Gemini” network is connected in a 3D torus
- Each pair of nodes (containing a total of 4 processors or 64 cores) is connected to 1 Gemini Interconnect Application-Specific Integrated Circuit (ASIC). Latency between cores is <1 μs for two cores connected to the same Gemini chip, plus a little over 0.1 μs for each Gemini chip that is between the sender and the receiver node.
- Each Gemini chip has 168 GB/s bandwidth of switching capacity. Each Gemini Chip has 64 ports.
- Injection bandwidth per node is 20 GB/s.
- The Gemini chips are arranged on a 3 dimensional torus with 10 links from each router on to the high-performance network. The peak bi-directional bandwidth of each link is 8 GB/s and the latency is around 1-1.5μs.
- MPI: ~1-2 μs latency. ~ 15 Million independent messages per second per NIC (two NICs per Gemini).
- PGAS support: In addition to supporting MPI over the standard programming languages of C, C++ and Fortran, the Gemini interconnect has direct hardware support for partitioned global address space (PGAS) programming models including Unified Parallel C (UPC), Co-array Fortran and Chapel. Gemini allows remote references to be pipelined in these programming models, which can result in orders-of-magnitude performance improvement over library-based message passing models.
- The Lustre filesystem data is stored on two DDN 10000 and two DDN 12000 storage arrays, connected via Infiniband to sixteen dedicated Cray XIO service nodes.
- These storage arrays provide 600TB raw (450TB usable) and 2.0PB (1.6PB usable) of storage for the Lustre fast scratch filesystems.
- The Lustre filesystems metadata is stored on two Fiber Channel storage arrays connected to a dedicated XIO service nodes.
Lustre file system
- Users have to create their own directory on Lustre.
- Compute nodes only have access to Lustre (/lustre/beagle2/*), not your home directory.
- Files which are read or written by compute nodes must be on Lustre.
- Design your applications to use fewer but larger files: LUSTRE filesystem prefers to handle a small number of large files.
- No long term storage of data. It’s a scratch space. Not backed up. In the case of a hardware or software failure we can not guarantee that data will not be lost.
- Lustre allows users to control the striping parameters when storing data on the filesystem.
Beagle2 has a High Speed Network (HSN) with a 10-Gb connection to the internet, specifically to the Argonne Mathematics and Computer Science (MCS) Division’s HPC switch, which has 10-Gb connectivity to MREN (Metropolitan Research & Education Network) and ESNet (Energy Sciences Network). The University of Chicago campus is also connected at 10-Gb to MREN.
Beagle uses Cray Linux Environment v3 (CLE3), a suite of high performance software including a SUSE Linux-based operating system designed to run large, complex applications. The Linux environment features Compute Node Linux (CNL) as the default compute kernel.
- When running highly scalable applications, CNL runs in Extreme Scalability Mode (ESM), which ensures operating system services do not interfere with application scalability. Real world applications have proven that this scales to more than 200,000 cores. To execute an application on Beagle2’s compute nodes in the ESM execution environment, you must invoke the aprun application launch command in your batch job script.
Note: Submitting your batch job to TORQUE (using the qsub command) places your job on one of the aprun service nodes. These nodes have limited resources shared between all users on the system and are not intended for computational use. You must invoke the aprun command in your job script or from the aprun command line to launch your application on one or more compute nodes in the ESM execution environment. To read more about submitting your jobs on Beagle2 please visit out wiki.
Beagle will utilize the Moab Adaptive Computing Suite and Torque.
There are two queues (defined in Torque) on Beagle. Each queue was created for a specific phase of the code development cycle:
|Queue Name||Max Walltime||Max Nodes||Default Nodes|
|Advanced||No max walltime, no max number of nodes. Only certain people are authorized to use it because it has no restrictions.|
Beagle2 is essentially a general purpose resource and so, in general, no special preference is given to any user, job, or group, though if there are research deadlines or other short term needs, we may temporarily increase priorities for specific users or groups.
The normal policy is every user’s job increases in priority by one for every minute that it sits in the queue. Please read the general scheduling FAQ for more specific information about how the scheduler uses priorities.
Each job “slot” correlates to a node, so Beagle2 can accommodate up to 732 concurrently jobs.
User can request up to 500 nodes per one job.
Nodes cannot be shared by other users or even jobs of the same user.
It is possible to start multiple process on a node from one job. Please refer to the Beagle2 FAQ for more information on how to do this. Please refer to Job Management FAQ for more information on general job submission parameters.
- Programming Languages: A fully integrated Cray programming environment with C, C++ and Fortran, plus supported parallel programming models including MPI, OpenMP, Cray SHMEM, UPC and Co-Array Fortran.
- Compilers: Those from GNU and Cray. The default compiler and related environment is Cray.
- Libraries: High performance-optimized math libraries: BLAS, FFTs, LAPACK, ScaLAPACK, SuperLU, and Cray Scientific Libraries.
- MPI: The MPI implementation is compliant with the MPI 2.0 standard and is optimized to take advantage of the Gemini interconnect in the Cray XE6 system.
- Profiling: performance analysis tools CrayPat with Cray Apprentice2 allow users to analyze resource utilization throughout their code at scale and eliminate bottleneck and load imbalance issues