Sunnyvale
From CITA Computing
| Table of contents |
About
Sunnyvale is the beowulf cluster at the Canadian Institute for Theoretical Astrophysics. It is composed of ~190 Dell PE1950 compute nodes and 18 Dell R410 nodes. PE1950 nodes contains 2 quad core Intel(R) Xeon(R) E5310 @ 1.60GHz processors, 4GB of RAM, and 2 gigE network interfaces with an 80GB disk for swap and local scratch storage. R410 nodes have 8 Intel Xeon E5560 2.13GHz cores, 24GB of RAM and 2 gigE network interfaces with ~100GB for local scratch.
Sunnyvale currently runs CentOS (http://www.centos.org/) 6.3 x86_64 GNU/Linux. Our server/compute node image runs from a diskless nfs read-only root partition using the oneSIS (http://onesis.org/index.php) system imager v2.0rc10 release + patches (see their mail archive). This entire software stack is free and open source, with no licensing restrictions.
Naming of the Sunnyvale cluster was inspired by Trailer Park Boys (http://en.wikipedia.org/wiki/Trailer_Park_Boys), a popular Canadian comedy tv/movie series.
User Documentation
Accounts
To request an account on Sunnyvale email requests(at)cita.utoronto.ca
Passwords
When your account is created, it will use the same password as your regular CITA account. To change your password, use the /cita/local/bin/passwd command on your CITA Desktop. It takes about an hour for password changes to migrate to CITA machines and Sunnyvale.
Logging In
Once you are on the CITA network, log into Sunnyvale via the 2 login nodes, bubbles or ricky. These nodes also serve as the cluster development nodes (aliased as devel1 & devel2 ) and can be used for compiling codes, running interactive debugging jobs, submitting batch jobs to the queueing system, and transferring data to/from cluster storage. These are the only cluster nodes which you should directly access.
Shell Environment
Home disks on Sunnyvale are not the same as the CITA workstation network, so you will have to copy any required files to your Sunnyvale /home. For convenience, you can access this filesystem on your desktops at the mount point /cita/d/sunny-home.
The default shell is /bin/tcsh but you may also use /bin/bash. If you would like to have your shell changed to /bin/bash please email requests.
Users also need to write their own shell configuration file. Do not copy your shell configuration from your /cita/home directory, since these may contain incompatible configuration statements. For a bare-bones configuration that maximizes shell limits and uses our default MPI (OpenMPI 1.6.1) and Intel compilers, one would put the following into their ~/.tcshrc:
unlimit limit coredumpsize 0 module load intel module load openmpi
Modules and System Software
All system-installed software is accessible through the use of the module command. A brief summary of how to use modules on Sunnyvale is explained on the Module page. Sunnyvale loads two default modules - torque and maui -- but all other modules must be loaded the user either manually, within their batch scripts or by default via their shell start-up definitions in ~/.tcshrc or ~/.bashrc.
Any requests for further software installation should be sent to requests@cita.
Storage
RAM
There is 4G RAM/8G swap on each of the PE1950 compute nodes tpb[1-200] and 24G RAM/48G swap on the R410 nodes tpb[201-218].
Disk
The lustre network filesystems scratch-3week, raid-cita, and raid-project are all accessible on Sunnyvale nodes. The mount prefix for these systems is /mnt e.g., /mnt/scratch-3week etc. Batch jobs should primarily run exclusively on the lustre filesystems.
/home
Home directories are currently capped with a 10GB quota.
It is highly recommended that users do not use /home for job output. Since it is served from the cluster head node, having large parallel jobs write to it simultaneously will cause performance issues for the entire cluster. There is no problem with short and infrequent accesses, like reading a small configuration file or submitting your job from it, for example.
/mnt/raid-cita, raid-project
Main page (http://www.cita.utoronto.ca/mediawiki/index.php/Disk_Space#raid-cita)
These are fast parallel filesystems that are available to Sunnyvale nodes. All users have at least 10 GB available on /mnt/raid-cita, except for the SRAs, post-docs and staff who have 2TB by default. If you have exceptional data requirements, please contact requests@cita and your quota can be increased. The space raid-project are overseen by CITA faculty and permission from individual researchers is necessary to get a directory on these filesystems.
You no longer need an other=raid-cita directive to mount disks.
/mnt/scratch-3week
Main page (http://www.cita.utoronto.ca/mediawiki/index.php/Disk_Space#scratch-3week)
This scratch is a large, fast parallel filesystem. It is meant to be used for temporary intermediate and final output of cluster jobs. This filesystem is for short term storage and postprocessing only and archival data should be migrated to raid-cita or raid-project
External storage
The desktop home spaces home-1, home-2, and home-3 are only visible on the head nodes.
/cita/scratch/$WORKSTATION
Workstation scratch directories are visible from the cluster compute nodes, but should only be accessed from the login nodes ricky and bubbles
/tmp
DO NOT use /tmp to store any files! This filesystem is part of the scratch disk on each node and has to have enough space free for various system daemons to function properly. Any files found in /tmp will be deleted without warning.
/mnt/node_scratch
Local scratch disks are available on all nodes at
/mnt/node_scratch/<username>
They are 67 GB in size, although check how much disk is available.
If you are I/O limited by the cluster file systems then you may be better off using these disks. Use sparingly, as this can severely degrade network performance.
Accessing
They can be accessed on other nodes as /mnt/scratch/<node>., where <node> is the host name of the node, e.g. tpb78.
File Deletion
You MUST tidy up after your jobs. rm all the local_scratch files that you create at the end of your job.
These scratch disks will be purged daily for all files older than _one_week_.
WARNING: if the disks are chronically filled, a more severe method will be used. The system will automatically remove all the user's files after the end of the job.
/mnt/scratch/tpbNNN
All nodes export their scratch disk /mnt/node_scratch to all other nodes. They can be accessed at /mnt/scratch/tpbNNN.
PBS
Sunnyvale uses the TORQUE batch system and the Maui job scheduler to process user jobs. This system allows users to submit a batch script (a shell script with a special header) that describes their computing work, which is then held in a queue and run non-interactively once sufficient resources are available.
This software is also used at SciNET, as such users should be able to port their batch scripts from Mckenzie to Sunnyvale with little modification.
PBS Limits
The batch system will limit jobs to 96 nodes (784 cores) if the cluster is busy and 128 nodes (1024 cores) if the cluster is idle.
Jobs are limited to 48 hours in duration.
If you have special job requirements please contact requests@cita.utoronto.ca to apply for a reservation on the cluster.
Submitting Jobs
Jobs are submitted to the queue by executing qsub $BATCH_SCRIPT_NAME.
The following is an example of a batch script for a parallel job that would use 40 cpus (5 nodes) for 6 hours using OpenMPI:
#!/bin/csh #PBS -l nodes=5:ppn=8 #PBS -q workq #PBS -r n #PBS -l walltime=06:00:00 #PBS -N my_job_name # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES cd $PBS_O_WORKDIR mpirun a.out
The first 5 lines are mandatory and must appear in the script. Set the number of nodes in nodes and processors per node in ppn. There are two queues: workq and fastq. Note: if you are running single core serial jobs all you need to do is set ppn=1 above and the batch system will find a node for your code to run on.
For users who just want to use the nodes for serial jobs, an example batch script asking for 1-core on 1-node might be as follows:
#!/bin/csh #PBS -l nodes=1:ppn=1 #PBS -q workq #PBS -r n #PBS -l walltime=06:00:00 #PBS -N my_job_name # EVERYTHING ABOVE THIS COMMENT IS NECESSARY, SHOULD ONLY CHANGE nodes,ppn,walltime and my_job_name VALUES cd $PBS_O_WORKDIR ./my_serial_programme my_arguments
In principle, you might submit dozens of jobs like this to complete a specific task.
There is a 48 hour limit for jobs running on Sunnyvale, but the walltime option can (and should) be set to less than this as the scheduler will advance your job to the top of the queue faster if it doesn't require as much wallclock time. Efficient job scheduling improves job throughput.
Since there are 8 cores / node users will want to specify ppn=8 if they would like to have nodes fully allocated to their own job. Submitting with ppn<8 will allow for multiple jobs to be assigned to the same node and could result in the memory being oversubscribed and both jobs failing and/or crashing the node. If your serial jobs are memory limited a good rule of thumb is to specify ppn=(total job memory / 485MB) to prevent this. If each process in your run requires more than 1/8th the memory, the trick is to use the -loadbalance flag with OpenMPI jobs, e.g., say your code could only used 3 cores/node because of memory requirements and you wanted to submit a 8-node/24-core job. You would then specify ppn=8 in the batch script but submit the MPI job with:
mpirun -np 24 -loadbalance your_prog ...
Based on the current networking configuration, users may desire to have all of their processes run on the same rack. All the nodes (40) which exist on the same rack are connected directly to a single switch and as such they will exhibit better networking performance than a job which spans racks (and hence switches). This can be accomplished by adding the rack designation to your batch script, ie:
#PBS -l nodes=20:ppn=8:rack2
would get you 20 nodes on rack2. Keep in mind that restricting your job to a single rack will likely cause it to spend longer in the queue before the nodes on the particular rack become available. (NOTE: this may not work any longer under the current configuration and you might be better off simply letting the system allocate the appropriate collection of nodes for your job automatically.)
The following PBS environment variables are useful for tailoring batch scripts:
Variables that contain information about job submission:
PBS_O_HOST The host machine on which the qsub command was run. PBS_O_LOGNAME The login name on the machine on which the qsub was run. PBS_O_HOME The home directory from which the qsub was run. PBS_O_WORKDIR The working directory from which the qsub was run.
Variables that relate to the environment where the job is executing:
PBS_ENVIRONMENT This is set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs. PBS_O_QUEUE The original queue to which the job was submitted. PBS_JOBID The identifier that PBS assigns to the job. PBS_JOBNAME The name of the job. PBS_NODEFILE The file containing the list of nodes assigned to a parallel job.
Queues
Sunnyvale has several batch queues.
workq: General purpose queue with approximately 190 Dell Power Edge 1950 nodes with 8 cores and 4GB of memory. Most jobs are submitted here.
fastq: This is a queue made of newer and faster Dell R410 8-core nodes with 24GB of memory. This queue is suitable for jobs requiring more memory per node and slightly greater speed.
Remember to add the line e.g., #PBS -q fastq" in your batch script so your job is submitted to these queues.
Submitting Interactive Jobs
To submit an interactive job for 2 nodes, execute the following on a login node:
qsub -I -l nodes=2:ppn=8
This will open a shell on the first node of your job. To see which nodes you have been assigned:
cat $PBS_NODEFILE
One can then start a lamboot by hand, debug, etc. interactively. The job will terminate once you log out of the node that you logged in on, or once the walltime limit has been exceeded.
NOTE: Please use this method sparingly! Most jobs should be submitted non-interactively. If users leave nodes idle while using them interactively the policy on Sunnyvale will change, as it is a shared resource and tying up nodes is unfair to the other users.
Submitting Jobs on Reserved Nodes
If you want a particular job to run on a reservation made for you then make sure to include the following flag
-W x=FLAGS:ADVRES:{$RESERVATION NAME}
Monitoring Jobs
bobMon System Monitor
The web-based graphical System_Monitor is now available for Sunnyvale. If you are local, point your browser to bobMon (http://doug.cita.utoronto.ca/bobMon).
If you are logging in remotely you will have to create an ssh tunnel to barb to forward the http port. To do so, when you ssh into gw.cita.utoronto.ca add the following to your ssh command: ssh gw.cita.utoronto.ca -L 10101:sunnyvale:80 -- this connection has to stay active, so don't close it. Then you should be able to view it in your local browser with tunneled bobMon (http://localhost:10101/bobMon/)
Command Line Monitoring
Upon submission users can monitor their jobs in the queue using qstat -a. The showbf command will show resource availability, and the showstart command will suggest when a particular job will begin. Additional queue and total cluster node usage can be displayed with showq. NOTE: there may be a discrepancy with the number of CPUs used vs. the real cluster ussage. This is a bug with showq and the particular method some individuals requests jobs. The scheduler should work as advertised.
A brief synopsis of all the above is available by running freeNodes.
Job Identifier and Deletion
Each job submitted to the queue is assigned a $PBS_JOBID. This is the number on the left-handside of the qstat output. If a user would like to stop a running job or delete it from the queue then they should run qdel $PBS_JOBID.
PBS Output Files
After a job has finished PBS will place an output file and an error file in the directory where the job was submitted. These files are named job_name.o$PBS_JOBID and job_name.e$PBS_JOBID, where the "job_name" is the value supplied with #PBS -N job_name in the batch script.
The *.e* file will contain any output produced on standard error by the job (that was not redirected inside the batch script), and the *.o* file will contain the standard output. In addition, the pbs output file will list the amount of resources requested for the job, the amount of resources used, and a copy of the batch script that was submitted to run the job.
If you don't recieve a PBS job report it is likely that the disk where your script was launched was unavailable when the job terminated ( /cita/d/scratch-3month in particular). These can be found in /var/spool/pbs/undelivered/ on either of the login nodes. It is suggested that users submit all batch jobs from their /home directories to avoid this. If you'd still like your job to run from a different disk then put a change directory command at the start of your script (after the #PBS header!).
What nodes are my job using?
If your job is actively running one can get a list of nodes it is using with jobNodes PBS_JOBID
Alternatively one can put cat $PBS_NODEFILE in their batch script, which will print out the list of nodes that the job is using in the PBS standard output file, and this can then be used.
How do I check the status of my nodes?
One can get a brief summary for each node at the command line by issuing gstat -a | head -10; gstat -a | grep -A1 tpb<tpb#> for each of the nodes (<tbp#> is just the numerical portion of the nodes hostname, ie tpb123.sunnyvale -> <tpb#>=123 ).
It is suggested that users monitor their jobs via bobMon, but if you'd like detailed information you can look at the ganglia (http://julian/ganglia) data for each node. If you are not on the cita network you will have to create an ssh-tunnel to port 80 on julian, as is indicated for bobMon, and using this ganglia via tunneling (http://localhost:10101/ganglia).
Programming Issues
64 bit-isms
Sunnyvale is an em64t/amd64/x86_64 (64 bit) platform. Be aware of the following:
- FFTW plan variables must be 8-bytes
- ld: skipping incompatible message during linking
- the linker is finding 32 bit rather than the required 64 bit libraries. Most or all of the 64bit libraries should be found in /usr/lib64/
- relocation truncated to fit: R_X86_64_PC32
- this only seems to be a problem with the Intel compiler.
- can occur when trying to compile code that contains >2G of static allocations with the Intel compilers. The suggested fix is to compile with -shared-intel
- if this doesn't work, you may have to change the mcmodel as well. (man ifort, search for mcmodel). -mcmodel=medium should suffice for all cases.
Compilers
Intel (ifort/icc/icpc), GCC3 (g77/gcc/g++) and GCC4 (gfortran/gcc4/g++4) compilers are installed. Different versions of the intel compiler will get their own module.
The following compiler flags may be of use:
-O3 for full optimizations, -O0 for none -[a]xT (intel only) for processor specific optimizations -CB (ifort) -fbounds-check (gfortran) for runtime array bound checking -openmp (intel) -fopenmp (GCC4) to use OpenMP 2.0
Run man $COMPILER_NAME for further details.
The ifort version 10.0 beta documentation can be found here (http://www.cita.utoronto.ca/~merz/intel_f10b/Doc_Index.htm).
The icc/icpc version 10.0 beta documentation can be found here (http://www.cita.utoronto.ca/~merz/intel_c10b/Doc_Index.htm).
Debugging
Start by compiling your code with the -g flag to produce a symbol table. It's also a good idea to remove any optimizations. Then you can launch an xterm and a debugger (idbe/gdb) for N processes on a devel node with:
mpirun -np N xterm -e gdb a.out
Parallel Programming
Please see the CITA introductory tutorial (http://www.cita.utoronto.ca/~merz/pi/), which contains a brief presentation on parallel computing, a simple tutorial, and links / suggestions on where to find further information.
A simple MPI util Module (f90) is available at [1] (http://www.astro.utoronto.ca/~zqhuang/academic/papers/MPIutils.f90). Also see [2] (http://www.astro.utoronto.ca/~zqhuang/academic/papers/hello-mpi.f90) to understand how to use this Module.
Using Open MPI
Open MPI (http://www.open-mpi.org) is the successor to LAM/MPI (http://www.lam-mpi.org). While we've had a good history with LAM/MPI, it is now in a maintenance state.
Once you are familiar with LAM it's easy to start Using Open MPI. Note the following:
- only requires the following module:
module load openmpi
This will load the current default version but do a module list to see what other versions may exist. Newer versions might work better for you.
- there are no lamboot/lamhalt commands, mpirun starts and shuts down everything automatically. man mpirun for more details.
- make sure to use the openmpi version of fftw v2.1.5
If you'd like to use the intel compilers, set the Open MPI wrapper compilers to use them in your ~/.tcshrc or ~/.bashrc
setenv OMPI_MPICC icc setenv OMPI_MPICXX icpc setenv OMPI_MPIFC ifort setenv OMPI_MPIF77 ifort
Using Mpich2
NOTE: In the latest upgrade, we have not built the mpich2 module. If you need this flavour of MPI in particular, please send a note to requests@cita.
mpich2 (http://www-unix.mcs.anl.gov/mpi/mpich2/) is another MPI that is installed on the cluster. If you experience problems with LAM/Open MPI, then you can try mpich2 to see if it will work for your code.
- requires the following module (make sure to unload mpi/lam/openmpi first):
module load mpich
- requires an "mpd ring" which is a set of daemons that runs on each node and facilitates MPI.
- this is invoked using the mpdboot command, and cleaned up with mpdallexit
- One also needs to create the mpd configuration file, ~/.mpd.conf. This file should contain an entry like the following:
MPD_SECRETWORD=notthatsecret
- Note that you should not use your login password for this - it can be any string and is basically irrelevant.
- Also note that the mpich commands are broken (mpdboot and mpirun in particular).
- One needs to specify the full path to any files (executables or hostfiles), as it won't find them in the present working directory.
To debug, log into a devel node and;
mpif77 file.f90 mpdboot # note the full path to a.out mpirun -n 8 /home/merz/mpich/a.out mpdallexit
A sample batch script for using mpich2 with PBS is:
#!/bin/csh #PBS -l nodes=8:ppn=8 #PBS -q workq #PBS -r n #PBS -l walltime=00:35:00 set NNODES=8 set NCPUS=64 cd $PBS_O_WORKDIR cat $PBS_NODEFILE > nodes foreach NN (`cat $PBS_NODEFILE | uniq`) echo `echo $NN | cut -f1 -d.` >> machinefile.$PBS_JOBID end mpdboot -n $NNODES -f $PBS_O_WORKDIR/machinefile.$PBS_JOBID -v mpiexec -n $NCPUS $PBS_O_WORKDIR/a.out mpdallexit unset NNODES unset NCPUS
Using ClusterOpenMP
NOTE: This is not set up with the current Sunnyvale configuration. (Dec-2012)
potentially a quick way to write parallel code, as long as care is taken to not randomly access memory.
here is Neal Dalal's experience:
On Wed, Mar 11, 2009 at 01:56:17PM -0400, Neal Dalal wrote:
Hi Mike,
Intel describes cluster-openmp at the web page:
http://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers/
That page has a detailed user manual on it.
On sunnyvale, here's what I did to use cluster-openmp.
0. First you need a license file from Intel. I have a license in
~neal/intel/licenses/CLOMP_20100630.lic , until you guys get a
system-wide license.
1. Compiling your openmp code with cluster-openmp then just involves
replacing -openmp with -cluster-openmp. For example, I compiled the
following hello_world code with:
icc -cluster-openmp hello.c -o hello
______________
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
int main()
{
int i, n=100;
char *string;
#pragma omp parallel default(none) private(i,string) shared(n)
{ // begin parallel region
string = malloc(n);
gethostname(string,n);
i = omp_get_thread_num();
printf("hello from thread %d on host %s\n", i, string);
free(string);
} // end parallel region
return 0;
}
______________
2. To run the program, you need a file kmp_cluster.ini in the same
directory which describes how many threads to use, which hosts to run on,
etc. Here are the contents of the file that I used:
[neal@tpb5 test]$ cat kmp_cluster.ini
--process_threads=3 --hostfile=hosts --launch=ssh --sharable_heap=100M
This says that the list of hosts to use is in the file "hosts" in the
current directory, and that 3 threads per process should be spawned. It
also specifies that 100MB on each node is reserved for "shared" memory.
3. Here's an example pbs script:
[neal@tpb5 test]$ cat test.pbs
#!/bin/bash
#PBS -l nodes=2:ppn=8
#PBS -q workq
#PBS -r n
#PBS -l walltime=1:00
cd $PBS_O_WORKDIR
uniq < $PBS_NODEFILE > hosts
./hello > hello.out
____________
This first generates the "hosts" file which lists the hosts to run on,
and then runs the program.
Here is the output:
hello from thread 0 on host tpb5
hello from thread 2 on host tpb5
hello from thread 1 on host tpb5
hello from thread 3 on host tpb4
hello from thread 5 on host tpb4
hello from thread 4 on host tpb4
_____________
I hope this helps. There's a lot more information on the web page that I
mentioned above. For example, you can start everything up using mpirun,
instead of making the "hosts" file.
best,
Neal
Networking
NOTE: There have been a lot of network changes since the time this part of the wiki was written - most of this info is obsolete so please disregard it. We keep it here as a reference for admins.
thin tree
- all 40 nodes on a given rack connect to a 48-port gigabit switch
- racks are linked together by connecting the 5 rack switches to a top-level switch via 4-port trunked links
- all disks, except for the ones named /mnt/scratch/local or /mnt/scratch/rack?, are effectively connected to the top-level switch via a single gigabit link
This has the following implications for network performance:
- you get full, non-blocking gigabit networking if all the nodes in your job are on the same rack
- there is a significant bandwidth hit if the nodes in a job span 2 or more racks;
- e.g. the 40 nodes on rack 1 effectively share a single 4Gb link to all other nodes in the cluster.
- any and all nodes which read or write to a non-local disk (like /cita/d/scratch-3month) are sharing a single gigabit connection
- read/write performance is almost certainly going to be maximized if you can make use of the scratch disks thatare installed in each rack (e.g. /mnt/scratch/rack?).
fat mesh
There is also a fat_mesh network that can be accessed by using the hallowed beer names. This network is composed of switches attached to groups of nodes (5 groups of 8) on each rack (ie, perpendicular to the thin tree). Each of the switches is fully-interconnected with the other 4 fat_mesh switches via 10gigE connections. In theory this should improve off-switch performance by increasing the bandwidth between racks.
The beer names resolve to the thin tree addresses for nodes on the same rack, but use the fat mesh for nodes that exist on other racks.
Which node is in which rack?
- rack 1: tpb 1-40
- rack 2: tpb 41-80
- rack 3: tpb 81-120
- rack 4: tpb121-160
- rack 5: tpb161-200
A Note On Latency
As you might guess by reading the network configurations you can get different MPI latencies between different nodes relative to where they are to each other.
When dealing with nodes on the same rack or same fat mesh stripe you are looking at around 0.0001 seconds of latency for point to point communication. Going between nodes on different stripes and different racks the latency slightly increases to around 0.000107s.
When two nodes under full load and maxing out their network with MPI communication the latency jumps to about 0.000160s.
These tests were conducted by averaging out the latencies of millions MPI sends.
Codes
Please feel free to make your own wiki page for any codes that you have run on Sunnyvale -- these hints and performance results are useful to other users!
