GLOW Cluster Users Guide
Logging in to the cluster
Access to the cluster is allowed through ssh and scp to the "submit node" only. To login use:
Submitting Jobs to the GLOW pool using Condor
All jobs are to be submitted to the "compute nodes" using the CONDOR job scheduler only. Condor can schedule serial jobs and parallel jobs. Serial jobs can run in either the vanilla universe or the standard universe. Ideally you want to run in the standard universe to take advantage of Condor's checkpointing, job migration, and automatic file transfer features. To enable this, use the condor_compile command to link or build your code. For example:
condor_compile g77 -o bigcode bigcode.f
condor_compile gcc -o program sub1.o sub2.o main.o
Now you can submit your job using the submit command and a submit description file, for example:
Example submission scripts are in the Condor manual. Additional sample scripts can be viewed below:
If you do not have access to source code or object files, then you may run your executable code in the vanilla universe. In this case, your job cannot checkpoint and migrate if another job of higher priority preempts it, and, you must specifically indicate the files that need to be transfered with your executable.
For running matlab (v6r13) jobs, a special wrapper script has been created to ensure that research groups only use their own group's licenses. Once that wrapper script is enabled for a given user or group, they can submit jobs using this example.
MPI jobs using Condor
Parallel MPI jobs are submitted in the MPI universe and only use compute nodes which are "dedicated" (currently our 95 local compute nodes). There is no need to relink your code, simply perform your normal MPI development. Currently Condor only supports the MPICH-1.2.4 implementation of MPI. Sample scripts for running:
Note that in condor,
"machine_count" refers to number of cpus.
Compiling MPICH code
All code compiling is done on the master node using the standard MPICH procedure for C, fortran 77, and C++:
mpicc -o foo foo.c
mpif77 -o foo foo.f
mpiCC -o foo foo.C
Since there are several versions of MPICH on the submit node, you must be sure your PATH environment variable is set to the version of MPICH you're interested in using. By default, the PATH is set to the GNU compiled version of MPICH v1.2.4. If you want the Portland Group compiled version you should set the following in your .bashrc file:
Complete documentation for MPICH (in html format) and its associated tracing/visualization tools is available locally in /usr/local/mpich-1.2.4/www/index.html. A postscript version of the MPICH user's guide is available in /usr/local/mpich-1.2.4/doc/. Complete UNIX style man pages are also installed. More information including tutorials can be found on the MPICH web site at www.unix.mcs.anl.gov/mpi/mpich/
MPI jobs are submitted with the
condor_submit command as above, but now the submit description file uses the MPI universe.
- An example submit file
- An example for running the mpi version of mcnp5 which has been modified to run under Condor
Note again that in condor,
"machine_count" refers to number of cpus.
Complete documentation for all Condor and MPI commands are available using the standard UNIX/LINUX style man pages (e.g.
Checking the Status of a Condor Job
To check the status of your jobs in the Medical Physics queue type:
Note that job startup may take several minutes at times. Please be patient for the scheduler to find free compute nodes.
To check the status of all jobs in the pool type:
To check how many "busy" medical physics cpus are being used by non-medical physics users:
condor_status -constraint 'Subnet == "128.104.3" && Activity == "Busy" ' -v | grep ^'RemoteUser =' | grep -v medphys | wc -1
To check your priority and total usage type:
condor_userprio -pool -all
To check which compute nodes are actually running your jobs:
condor_status -pool glow.cs.wisc.edu -constraint 'RemoteUser == "email@example.com"' -v | grep ^'Machine'
Other useful commands are in freqcommands.
Terminating Condor Jobs
Jobs may be deleted from the queues using:
jobid is the Condor job identifier (e.g. 121.0)
Note that job deletion may take several minutes for a running job, please be patient.
Filesystem and Backups
The users filesystem for the cluster is the /home directory. Please make an effort to keep this filesystem clean. Because of limited resources, there is no backup system for user files. All users are responsible for deciding what files are important to them, and then backing those files up on their own.