Sunnyvale
From CITA Computing
| Table of contents |
About
Sunnyvale is the beowulf cluster at the Canadian Institute for Theoretical Astrophysics. It is composed of 200 Dell PE1950 compute nodes. Each node contains 2 quad core Intel(R) Xeon(R) E5310 @ 1.60GHz processors, 4GB of RAM, and 2 gigE network interfaces, and a 40 GB disk.
Sunnyvale currently runs CentOS (http://www.centos.org/) 5.5 x86_64 GNU/Linux. OSCAR (http://oscar.openclustergroup.org/) version 5.0 to forms the basis of our server/compute node image with a diskless nfs read-only root partition using the oneSIS (http://onesis.org/index.php) system imager v2.0rc10 release + patches (see their mail archive). This entire software stack is free and open source, with no licensing restrictions.
Naming of the Sunnyvale cluster was inspired by Trailer Park Boys (http://en.wikipedia.org/wiki/Trailer_Park_Boys), a popular Canadian comedy tv/movie series.
User Documentation
Accounts
To request an account on Sunnyvale email requests(at)cita.utoronto.ca
Passwords
At present only administrators can modify user passwords. Please email requests to have your password changed.
When your account is created it will use the same password as your regular CITA account.
Logging In
Once you are on the CITA network, log into Sunnyvale via the 2 login nodes, bubbles or ricky. These nodes also serve as the cluster development nodes (aliased as devel1 & devel2 ) and can be used for compiling codes, running interactive debugging jobs, submitting batch jobs to the queueing system, and transferring data to/from cluster storage. These are the only cluster nodes which you should directly access.
Shell Environment
Home disks on Sunnyvale are not the same as the CITA workstation network, so you will have to copy any required files to your Sunnyvale /home.
The default shell is tcsh. If you would like to have your shell changed please email requests.
Users also need to write their own shell configuration file. Do not copy your shell configuration from your /cita/home directory or from your McKenzie /home directory, since these may contain incompatible configuration statements. For a bare-bones configuration that maximizes shell limits and uses our default MPI (LAM 7.1.4) and Intel compilers, one would put the following into their ~/.tcshrc:
unlimit limit coredumpsize 0 module load intel module load lam
Modules and System Software
All system-installed software is accessable through the use of the module command. A brief summary of how to use modules on McKenzie is explained in the Module page, and the command still works in the exact same way on Sunnyvale (we've even kept the MCKENZIE_ prefix for LIB and INC paths). The major notable difference is that Sunnyvale does not load any default modules, users are responsible for defining them in their shell configuration.
Any requests for further software installation should be sent to requests.
Storage
RAM
There is 4047564 KB of RAM on each of the compute nodes. An active operating system leaves ~ 3941940KB / 3849MB / 3.76GB of RAM free. 8 GB swap space exists on all nodes on local disks.
Disk
The lustre network filesystems scratch-3week, raid-cita, raid-project and raid-project2 are all accessible on Sunnyvale nodes. The mount prefix for these systems is /mnt e.g., /mnt/scratch-3week etc. Batch jobs should primarily run exclusively on the lustre filesystems.
/home
Home directories are currently capped with a 10GB quota.
It is highly recommended that users do not use /home for job output. Since it is served from the cluster head node, having large parallel jobs write to it simultaneously will cause performance issues for the entire cluster. There is no problem with short and infrequent accesses, like reading a small configuration file or submitting your job from it, for example.
/mnt/raid-cita, raid-project, raid-project2
Main page (http://www.cita.utoronto.ca/mediawiki/index.php/Disk_Space#raid-cita)
These are fast parallel filesystems that are available to Sunnvale nodes. All users have at least 10 GB available on /mnt/raid-cita, except for the SRAs, post-docs and staff who have 1TB by default. If you have exceptional data requirements, please contact requests@cita. The spaces raid-project and raid-project2 are overseen by CITA faculty and permission from individual researchers is necessary to get a directory on these filesystems.
You do not need an other=raid-cita directive.
/mnt/scratch-3week
Main page (http://www.cita.utoronto.ca/mediawiki/index.php/Disk_Space#scratch-3week)
This scratch is a large, fast parallel filesystem. It is meant to be used for temporary intermediate and final output of cluster jobs. This filesystem is for short term storage and postprocessing only and archival data should be migrated to raid-cita or raid-project
You do not need an other=scratch-3week directive.
External storage
home-1, home-2, and home-3 are only visible on the head nodes.
/cita/scratch/$WORKSTATION
Workstation scratch directories are visible from the cluster compute nodes, but should only be accessed from the login nodes ricky and bubbles
/tmp
DO NOT use /tmp to store any files! This filesystem is part of the scratch disk on each node and has to have enough space free for various system daemons to function properly. Any files found in /tmp will be deleted without warning.
/mnt/node_scratch
Local scratch disks are available on all nodes at
/mnt/node_scratch/<username>
They are 67 GB in size, although check how much disk is available.
If you are being I/O limited by the cluster file systems then you may be better off using these disks. Use sparingly, as this can severely degrade network performance.
Accessing
They can be accessed on other nodes as /mnt/scratch/<node>., where <node> is the host name of the node, e.g. tpb78.
File Deletion
You MUST tidy up after your jobs. rm all the local_scratch files that you create at the end of your job.
These scratch disks will be purged daily for all files older than _one_week_.
WARNING: if the disks are chronically filled, a more severe method will be used. The system will automatically remove all the user's files after the end of the job.
/mnt/scratch/tpbNNN
All nodes export their scratch disk /mnt/node_scratch to all other nodes. They can be accessed at /mnt/scratch/tpbNNN.
PBS
Sunnyvale uses the TORQUE batch system and the Maui job scheduler to process user jobs. This system allows users to submit a batch script (a shell script with a special header) that describes their computing work, which is then held in a queue and run non-interactively once sufficient resources are available.
This software is also used on McKenzie, as such users should be able to port their batch scripts from Mckenzie to Sunnyvale with little modification.
PBS Limits
The batch system will limit jobs to 96 nodes (784 cores) if the cluster is busy and 128 nodes (1024 cores) if the cluster is idle.
Jobs are limited to 48 hours in duration.
If you have special job requirements please contact requests@cita.utoronto.ca to apply for a reservation on the cluster.
Submitting Jobs
Jobs are submitted to the queue by executing qsub $BATCH_SCRIPT_NAME.
The following is an example of a batch script for a parallel job that would use 40 cpus (5 nodes) for 6 hours using the older LAM-MPI:
#!/bin/csh #PBS -l nodes=5:ppn=8 #PBS -q workq #PBS -r n #PBS -l walltime=06:00:00 #PBS -N my_job_name # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES cd $PBS_O_WORKDIR lamboot mpirun C a.out lamhalt # lamboot/lamhalt are necessary to initialize nodes for lam-mpi
If you are running a simple openmpi job instead that uses all the available cores, your batch script would look like this:
#!/bin/csh #PBS -l nodes=5:ppn=8 #PBS -q workq #PBS -r n #PBS -l walltime=06:00:00 #PBS -N my_job_name # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES cd $PBS_O_WORKDIR mpirun a.out
The first 5 lines are mandatory and must appear in the script. Set the number of nodes in nodes and processors per node in ppn. There are three queues: workq, fastq, and gpuq. Note: if you are running single core serial jobs all you need to do is set ppn=1 above and the batch system will find a node for your code to run on.
For users who just want to use the nodes for serial jobs, an example batch script asking for 1-core on 1-node might be as follows:
#!/bin/csh #PBS -l nodes=1:ppn=1 #PBS -q workq #PBS -r n #PBS -l walltime=06:00:00 #PBS -N my_job_name # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES cd $PBS_O_WORKDIR ./my_serial_programme my_arguments
In principle, you might submit dozens of jobs like this to complete a specific task.
There is a 48 hour limit for jobs running on Sunnyvale, but the walltime option can (and should) be set to less than this as the scheduler will advance your job to the top of the queue faster if it doesn't require as much wallclock time. Efficient job scheduling improves job throughput.
Since there are 8 cores / node, rather than 2 on Mckenzie, users will want to specify ppn=8 if they would like to have nodes fully allocated to their own job. Submitting with ppn<8 will allow for multiple jobs to be assigned to the same node and could result in the memory being oversubscribed and both jobs failing and/or crashing the node. If your serial jobs are memory limited a good rule of thumb is to specify ppn=(total job memory / 485MB) to prevent this.
Based on the current networking configuration, users may desire to have all of their processes run on the same rack. All the nodes (40) which exist on the same rack are connected directly to a single switch and as such they will exhibit better networking performance than a job which spans racks (and hence switches). This can be accomplished by adding the rack designation to your batch script, ie:
#PBS -l nodes=20:ppn=8:rack2
would get you 20 nodes on rack2. Keep in mind that restricting your job to a single rack will likely cause it to spend longer in the queue before the nodes on the particular rack become available.
The following PBS environment variables are useful for tailoring batch scripts:
Variables that contain information about job submission:
PBS_O_HOST The host machine on which the qsub command was run. PBS_O_LOGNAME The login name on the machine on which the qsub was run. PBS_O_HOME The home directory from which the qsub was run. PBS_O_WORKDIR The working directory from which the qsub was run.
Variables that relate to the environment where the job is executing:
PBS_ENVIRONMENT This is set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs. PBS_O_QUEUE The original queue to which the job was submitted. PBS_JOBID The identifier that PBS assigns to the job. PBS_JOBNAME The name of the job. PBS_NODEFILE The file containing the list of nodes assigned to a parallel job.
Queues
Sunnyvale has several batch queues.
workq: General purpose queue with approximately 190 Dell Power Edge 1950 nodes with 8 cores and 4GB of memory. Most jobs are submitted here.
fastq: This is a small 6-node queue made of newer and faster Dell R410 8-core nodes with 2GB of memory. This queue is suitable for smaller jobs requiring more speed.
gpuq: This is a 2-node queue of R410s connected to a TESLA GPU unit. Each node has access to two 4GB GPU cards each with 240 GPU cores.
c6100q: This is a 3-node queue that accesses the large-memory Dell C6100. Each node has 12 X5670 cores with 48GB of memory and has access to 4X NVIDIA Tesla M2050 GPUs.
Remember to add the line e.g., #PBS -q fastq" in your batch script so your job is submitted to these queues.
Submitting Interactive Jobs
To submit an interactive job for 2 nodes, execute the following on a login node:
qsub -I -l nodes=2:ppn=8
This will open a shell on the first node of your job. To see which nodes you have been assigned:
cat $PBS_NODEFILE
One can then start a lamboot by hand, debug, etc. interactively. The job will terminate once you log out of the node that you logged in on, or once the walltime limit has been exceeded.
NOTE: Please use this method sparingly! Most jobs should be submitted non-interactively. If users leave nodes idle while using them interactively the policy on Sunnyvale will change, as it is a shared resource and tying up nodes is unfair to the other users.
Submitting Jobs on Reserved Nodes
If you want a particular job to run on a reservation made for you then make sure to include the following flag
-W x=FLAGS:ADVRES:{$RESERVATION NAME}
Monitoring Jobs
bobMon System Monitor
The web-based graphical System_Monitor is now available for Sunnyvale. If you are local, point your browser to bobMon (http://barb.cita.utoronto.ca/bobMon).
A general page (http://barb.cita.utoronto.ca) for monitoring cluster stats exists for detailed monitoring of the cluster.
If you are logging in remotely you will have to create an ssh tunnel to barb to forward the http port. To do so, when you ssh into gw.cita.utoronto.ca add the following to your ssh command: ssh gw.cita.utoronto.ca -L 10101:sunnyvale:80 -- this connection has to stay active, so don't close it. Then you should be able to view it in your local browser with tunneled bobMon (http://localhost:10101/bobMon/)
Command Line Monitoring
Upon submission users can monitor their jobs in the queue using qstat -a. The showbf command will show resource availability, and the showstart command will suggest when a particular job will begin. Additional queue and total cluster node usage can be displayed with showq. NOTE: there may be a discrepancy with the number of CPUs used vs. the real cluster ussage. This is a bug with showq and the particular method some individuals requests jobs. The scheduler should work as advertised.
A brief synopsis of all the above is available by running freeNodes.
Job Identifier and Deletion
Each job submitted to the queue is assigned a $PBS_JOBID. This is the number on the left-handside of the qstat output. If a user would like to stop a running job or delete it from the queue then they should run qdel $PBS_JOBID.
PBS Output Files
After a job has finished PBS will place an output file and an error file in the directory where the job was submitted. These files are named job_name.o$PBS_JOBID and job_name.e$PBS_JOBID, where the "job_name" is the value supplied with #PBS -N job_name in the batch script.
The *.e* file will contain any output produced on standard error by the job (that was not redirected inside the batch script), and the *.o* file will contain the standard output. In addition, the pbs output file will list the amount of resources requested for the job, the amount of resources used, and a copy of the batch script that was submitted to run the job.
If you don't recieve a PBS job report it is likely that the disk where your script was launched was unavailable when the job terminated ( /cita/d/scratch-3month in particular). These can be found in /var/spool/pbs/undelivered/ on either of the login nodes. It is suggested that users submit all batch jobs from their /home directories to avoid this. If you'd still like your job to run from a different disk then put a change directory command at the start of your script (after the #PBS header!).
What nodes are my job using?
If your job is actively running one can get a list of nodes it is using with jobNodes PBS_JOBID
Alternatively one can put cat $PBS_NODEFILE in their batch script, which will print out the list of nodes that the job is using in the PBS standard output file, and this can then be used.
How do I check the status of my nodes?
One can get a brief summary for each node at the command line by issuing gstat -a | head -10; gstat -a | grep -A1 tpb<tpb#> for each of the nodes (<tbp#> is just the numerical portion of the nodes hostname, ie tpb123.sunnyvale -> <tpb#>=123 ).
It is suggested that users monitor their jobs via bobMon, but if you'd like detailed information you can look at the ganglia (http://julian/ganglia) data for each node. If you are not on the cita network you will have to create an ssh-tunnel to port 80 on julian, as is indicated for bobMon, and using this ganglia via tunneling (http://localhost:10101/ganglia).
Programming Issues
64 bit-isms
Sunnyvale is an em64t/amd64/x86_64 (64 bit) platform. Be aware of the following:
- FFTW plan variables must be 8-bytes
- ld: skipping incompatible message during linking
- the linker is finding 32 bit rather than the required 64 bit libraries. Most or all of the 64bit libraries should be found in /usr/lib64/
- relocation truncated to fit: R_X86_64_PC32
- this only seems to be a problem with the Intel compiler.
- can occur when trying to compile code that contains >2G of static allocations with the Intel compilers. The suggested fix is to compile with -shared-intel
- if this doesn't work, you may have to change the mcmodel as well. (man ifort, search for mcmodel). -mcmodel=medium should suffice for all cases.
Compilers
Intel (ifort/icc/icpc), GCC3 (g77/gcc/g++) and GCC4 (gfortran/gcc4/g++4) compilers are installed. Different versions of the intel compiler will get their own module.
The following compiler flags may be of use:
-O3 for full optimizations, -O0 for none -[a]xT (intel only) for processor specific optimizations -CB (ifort) -fbounds-check (gfortran) for runtime array bound checking -openmp (intel) -fopenmp (GCC4) to use OpenMP 2.0
Run man $COMPILER_NAME for further details.
The ifort version 10.0 beta documentation can be found here (http://www.cita.utoronto.ca/~merz/intel_f10b/Doc_Index.htm).
The icc/icpc version 10.0 beta documentation can be found here (http://www.cita.utoronto.ca/~merz/intel_c10b/Doc_Index.htm).
Debugging
Start by compiling your code with the -g flag to produce a symbol table. It's also a good idea to remove any optimizations. Then you can launch an xterm and a debugger (idbe/gdb) for N processes on a devel node with:
mpirun -np N xterm -e gdb a.out
Parallel Programming
Please see the CITA introductory tutorial (http://www.cita.utoronto.ca/~merz/pi/), which contains a brief presentation on parallel computing, a simple tutorial, and links / suggestions on where to find further information.
A simple MPI util Module (f90) is available at [1] (http://www.astro.utoronto.ca/~zqhuang/academic/papers/MPIutils.f90). Also see [2] (http://www.astro.utoronto.ca/~zqhuang/academic/papers/hello-mpi.f90) to understand how to use this Module.
Using Open MPI
Open MPI (http://www.open-mpi.org) is the successor to LAM/MPI (http://www.lam-mpi.org). While we've had a good history with LAM/MPI, it is now in a maintenance state.
Once you are familiar with LAM it's easy to start Using Open MPI. Note the following:
- only requires the following module:
module load openmpi
This will load the current default version but do a module list to see what other versions may exist. Newer versions might work better for you.
- there are no lamboot/lamhalt commands, mpirun starts and shuts down everything automatically. man mpirun for more details.
- make sure to use the openmpi version of fftw v2.1.5
If you'd like to use the intel compilers, set the Open MPI wrapper compilers to use them in your ~/.tcshrc or ~/.bashrc
setenv OMPI_MPICC icc setenv OMPI_MPICXX icpc setenv OMPI_MPIFC ifort setenv OMPI_MPIF77 ifort
Using Mpich2
NOTE: In the latest upgrade, we have not built the mpich2 module. If you need this flavour of MPI in particular, please send a note to requests@cita.
mpich2 (http://www-unix.mcs.anl.gov/mpi/mpich2/) is another MPI that is installed on the cluster. If you experience problems with LAM/Open MPI, then you can try mpich2 to see if it will work for your code.
- requires the following module (make sure to unload mpi/lam/openmpi first):
module load mpich
- requires an "mpd ring" which is a set of daemons that runs on each node and facilitates MPI.
- this is invoked using the mpdboot command, and cleaned up with mpdallexit
- One also needs to create the mpd configuration file, ~/.mpd.conf. This file should contain an entry like the following:
MPD_SECRETWORD=notthatsecret
- Note that you should not use your login password for this - it can be any string and is basically irrelevant.
- Also note that the mpich commands are broken (mpdboot and mpirun in particular).
- One needs to specify the full path to any files (executables or hostfiles), as it won't find them in the present working directory.
To debug, log into a devel node and;
mpif77 file.f90 mpdboot # note the full path to a.out mpirun -n 8 /home/merz/mpich/a.out mpdallexit
A sample batch script for using mpich2 with PBS is:
#!/bin/csh #PBS -l nodes=8:ppn=8 #PBS -q workq #PBS -r n #PBS -l walltime=00:35:00 set NNODES=8 set NCPUS=64 cd $PBS_O_WORKDIR cat $PBS_NODEFILE > nodes foreach NN (`cat $PBS_NODEFILE | uniq`) echo `echo $NN | cut -f1 -d.` >> machinefile.$PBS_JOBID end mpdboot -n $NNODES -f $PBS_O_WORKDIR/machinefile.$PBS_JOBID -v mpiexec -n $NCPUS $PBS_O_WORKDIR/a.out mpdallexit unset NNODES unset NCPUS
Using ClusterOpenMP
potentially a quick way to write parallel code, as long as care is taken to not randomly access memory.
here is Neal Dalal's experience:
On Wed, Mar 11, 2009 at 01:56:17PM -0400, Neal Dalal wrote:
Hi Mike,
Intel describes cluster-openmp at the web page:
http://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers/
That page has a detailed user manual on it.
On sunnyvale, here's what I did to use cluster-openmp.
0. First you need a license file from Intel. I have a license in
~neal/intel/licenses/CLOMP_20100630.lic , until you guys get a
system-wide license.
1. Compiling your openmp code with cluster-openmp then just involves
replacing -openmp with -cluster-openmp. For example, I compiled the
following hello_world code with:
icc -cluster-openmp hello.c -o hello
______________
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
int i, n=100;
char *string;
#pragma omp parallel default(none) private(i,string) shared(n)
{ // begin parallel region
string = malloc(n);
gethostname(string,n);
i = omp_get_thread_num();
printf("hello from thread %d on host %s\n", i, string);
free(string);
} // end parallel region
return 0;
}
______________
2. To run the program, you need a file kmp_cluster.ini in the same
directory which describes how many threads to use, which hosts to run on,
etc. Here are the contents of the file that I used:
[neal@tpb5 test]$ cat kmp_cluster.ini
--process_threads=3 --hostfile=hosts --launch=ssh --sharable_heap=100M
This says that the list of hosts to use is in the file "hosts" in the
current directory, and that 3 threads per process should be spawned. It
also specifies that 100MB on each node is reserved for "shared" memory.
3. Here's an example pbs script:
[neal@tpb5 test]$ cat test.pbs
#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -q workq
#PBS -r n
#PBS -l walltime=1:00
cd $PBS_O_WORKDIR
uniq < $PBS_NODEFILE > hosts
./hello > hello.out
____________
This first generates the "hosts" file which lists the hosts to run on,
and then runs the program.
Here is the output:
hello from thread 0 on host tpb5
hello from thread 2 on host tpb5
hello from thread 1 on host tpb5
hello from thread 3 on host tpb4
hello from thread 5 on host tpb4
hello from thread 4 on host tpb4
_____________
I hope this helps. There's a lot more information on the web page that I
mentioned above. For example, you can start everything up using mpirun,
instead of making the "hosts" file.
best,
Neal
Loading modules inside non-PBS scripts
Loading modules inside scripts called interactively does not work. Add this line to make it work:
source /etc/profile.d/00-modules.sh
source /etc/profile.d/00-modules.sh
Networking
thin tree
- all 40 nodes on a given rack connect to a 48-port gigabit switch
- racks are linked together by connecting the 5 rack switches to a top-level switch via 4-port trunked links
- all disks, except for the ones named /mnt/scratch/local or /mnt/scratch/rack?, are effectively connected to the top-level switch via a single gigabit link
This has the following implications for network performance:
- you get full, non-blocking gigabit networking if all the nodes in your job are on the same rack
- there is a significant bandwidth hit if the nodes in a job span 2 or more racks;
- e.g. the 40 nodes on rack 1 effectively share a single 4Gb link to all other nodes in the cluster.
- any and all nodes which read or write to a non-local disk (like /cita/d/scratch-3month) are sharing a single gigabit connection
- read/write performance is almost certainly going to be maximized if you can make use of the scratch disks thatare installed in each rack (e.g. /mnt/scratch/rack?).
fat mesh
There is also a fat_mesh network that can be accessed by using the hallowed beer names. This network is composed of switches attached to groups of nodes (5 groups of 8) on each rack (ie, perpendicular to the thin tree). Each of the switches is fully-interconnected with the other 4 fat_mesh switches via 10gigE connections. In theory this should improve off-switch performance by increasing the bandwidth between racks.
The beer names resolve to the thin tree addresses for nodes on the same rack, but use the fat mesh for nodes that exist on other racks.
Which node is in which rack?
- rack 1: tpb 1-40
- rack 2: tpb 41-80
- rack 3: tpb 81-120
- rack 4: tpb121-160
- rack 5: tpb161-200
A Note On Latency
As you might guess by reading the network configurations you can get different MPI latencies between different nodes relative to where they are to each other.
When dealing with nodes on the same rack or same fat mesh stripe you are looking at around 0.0001 seconds of latency for point to point communication. Going between nodes on different stripes and different racks the latency slightly increases to around 0.000107s.
When two nodes under full load and maxing out their network with MPI communication the latency jumps to about 0.000160s.
These tests were conducted by averaging out the latencies of millions MPI sends.
Codes
Please feel free to make your own wiki page for any codes that you have run on Sunnyvale -- these hints and performance results are useful to other users!
