“Large data analysis on someone else’s computers”
For you to know the basics of scripting your analysis
For you to know it is not that hard to get started with HPC
Running out of memory on local computer 👈 today
Analysis takes too long on local computer 👈 today
Too large data files to host on local computer
Makes it easier to collaborate within the group, e.g. common data folder
We can leave it running for days
What’s your reason?
A very large number of Linux computers 👈 today
All computers have identical configuration and software 👈 today
All computers have access to a shared file system
Multiple users are using the cluster at the same time
Not magic!
Higher total throughput - not necessarily lower latency
Wynton HPC | C4 | |
---|---|---|
Since | 2018 | 2020 |
For whom? | All of UCSF | Cancer Center affiliates |
Number of users | ~1,400 | ~300 |
Number of computers | ~500 | ~40 |
Number of cores | ~17,500 | ~2,800 |
Memory (RAM) | 48 – 1,512 GiB RAM | 32 – 1,024 GiB RAM |
Free disk space | 500 GB/user | 1,000 GB/user |
Communal computers | 100% | 25% |
Paying contributors | VIP priority | dedicated machines |
Software | core + shared + DIY | core + shared + DIY |
Linux | Rocky 8 | Rocky 8 |
Automatic backups | home directory (only) | home directory (only) |
GPUs | yes | yes |
Web browser GUI | no | soon (reviving OnDemand) |
Documentation | https://wynton.ucsf.edu/ | https://www.c4.ucsf.edu/ |
Basically everything you will do on a compute cluster involves scripted instructions.
For example, if type date
at the command-line prompt and press ENTER, the machine will reply with the current timestamp, and then wait for your next command:
We can use the echo
command to output a message, e.g.
Instead of manually retyping the commands manually, we can put them in a text file using our favorite text editor. Here is a shell script hello.sh
that contain the above two commands:
We can run these two lines using the bash
command:
We can call any number of times we’d like, e.g.
All scripting language support adding comments in the code. It’s an excellent way to make notes for your future self, for you collaborators, and for that researcher who will build on your work in two years from now. Do yourself a favor and add such notes!
“Running a script on a compute cluster”
is similar to
“Queue the script to be run on the next available compute node”
We’re going to focus on shell scripts and how to run them on a compute cluster.
We can run the script manually at the command line:
Note, the above run sequentially, one after each other.
Instead of running it on the single machine we’re logged into, we can run it on any of the hundreds of machines that is available.
To do this, we use qsub
to submit it to a job queue:
Here I use command-line option -cwd
to tell it to run hello.sh
in the current working directory. You almost always want to do this.
Then we wait for it to get started. We can use qstat
to see the queue status of our jobs:
Here qw
means queued and waiting for a machine to become available.
The job scheduler will send jobs in the queue to available machines. The waiting time depends on how many users and jobs are already running and queued. It also depends on how many resources your job need.
Here r
means running. Now we need to wait …
Job is no longer listed, which means it finished.
But, now what? Where are the results?
A job scheduler directive: A specially formatted comment starting with #$
followed by qsub options.
#$ -cwd # run in current working directory
# This script says hello to the world
# Author: Alice
# Date: 2024-11-04
echo "Hello world"
# Display current timestamp in format yyyy-mm-dd hh:mm:ss
date +"%F %T"
This way we can just do:
To run a non-shell script, for instance an R or a Python script, on a compute cluster, we need to create a little helper shell script that calls our R or Python script.
random.R
:
# This R script outputs a random number
# Author: Alice
# Date: 2024-11-04
cat("A random number in [0,1]:", runif(1), "\n")
To run this from the command line, we can use Rscript
:
random.sh
:
With the random.sh
wrapper script, we can now run our R script random.R
on the cluster.
For the sake of it, let’s submit it twice to the job scheduler:
As usual, it will take some time before the jobs starts:
But eventually, one starts:
And then the other:
[alice@dev1]$ qstat
job-ID prior name user state submit/start at queue slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice r 11/04/2024 11:28:32 long.q@qb3-id3 1
9529919 0.07069 random.sh alice r 11/04/2024 11:28:54 long.q@msg-id19 1
Note how the two jobs run on two different machines.
When the jobs are finished, any output generated by the script is stored in the corresponding random.sh.o<job_id>
output file.
Community Slack (faster)
Email to the cluster team (slower, because fewer people to answer questions)
Rule #1: There is no such thing as “a stupid question”!!!
Help the helper help you
Better with too much details than too little
It is really hard to destroy things for others
(⚠️ but, you can rm
your own files and files in shared folders)
Overload development nodes - or use more job resources than you request
When you’re sloppy or lazy - make sure to read help online, etc.
More seriously, sysadms are great folks too, so never hesitate to ask for help or pointers
Wynton has 1,400 users and C4 has 300 users - many of them on Slack
The Slack channel is welcoming and friendly
Lots of users are willing to help out
There is almost certainly at least one other user who have the same question as you
Sooner than you think, you might be the one helping someone else out
Again, there are no “silly” questions - it’s only silly if you don’t ask!
Example: An R script hello.R
containing:
We can run these two lines using:
Example: A Python script hello.py
containing:
We can run these two lines using:
Common filename extensions for scripting languages:
The extension is non-critical - it’s only for us humans to keep track. A script could be name just hello
, but then we have to peek into the file to figure out what scripting language is used before we can call it;
A shebang is a script comment at the first line with a specific format:
The “-bang” in “shebang” is because !
is pronounced “bang” in the compute world.
The script still works as usual (because it is just a comment):
Now, if we set the executable flag (x
) in this file:
we can call the script as:
~
)To output to current working directory, use -cwd
.
Then we keep checking qstat
to see when it’s done.
we can now call the script as: