High-Performance Computing (HPC)

“Large data analysis on someone else’s computers”

Henrik Bengtsson

henrik.bengtsson@ucsf.edu

(Epidemiology & Biostatistics, Wynton HPC, C4 HPC)

Goals for today

For you to know the basics of scripting your analysis
For you to know it is not that hard to get started with HPC

Why do we use a compute cluster?

Running out of memory on local computer 👈 today
Analysis takes too long on local computer 👈 today
Too large data files to host on local computer
Makes it easier to collaborate within the group, e.g. common data folder
We can leave it running for days
What’s your reason?

What is a compute cluster?

A very large number of Linux computers 👈 today
All computers have identical configuration and software 👈 today
All computers have access to a shared file system
Multiple users are using the cluster at the same time
Not magic!
Higher total throughput - not necessarily lower latency

Two compute clusters at the UCSF

	Wynton HPC	C4
Since	2018	2020
For whom?	All of UCSF	Cancer Center affiliates
Number of users	~1,400	~300
Number of computers	~500	~40
Number of cores	~17,500	~2,800
Memory (RAM)	48 – 1,512 GiB RAM	32 – 1,024 GiB RAM
Free disk space	500 GB/user	1,000 GB/user
Communal computers	100%	25%
Paying contributors	VIP priority	dedicated machines
Software	core + shared + DIY	core + shared + DIY
Linux	Rocky 8	Rocky 8
Automatic backups	home directory (only)	home directory (only)
GPUs	yes	yes
Web browser GUI	no	soon (reviving OnDemand)
Documentation	https://wynton.ucsf.edu/	https://www.c4.ucsf.edu/

Typical workflow using a compute cluster

Log in to compute cluster
Continue by logging in to a development node
Go to project folder
Edit scripts
Submit one or more scripts to job queue
Wait until done
Look at produced files and logs

{ab@notebook}$ ssh alice@log1.wynton.ucsf.edu
[alice@log1]$ ssh dev1
[alice@dev1]$ cd /path/to/amazing_project/
[alice@dev1]$ emacs analysis.sh
[alice@dev1]$ qsub analysis.sh
[alice@dev1]$ qstat
...
[alice@dev1]$ cat analysis.o90303

Scripts

Basically everything you will do on a compute cluster involves scripted instructions.

Scripts are easy

Scripts are text files with commands that are executed line by line
You can edit scripts in a text editor
Scripts are run by an interpreter, e.g. Bash, R, Python, …

For example, if type date at the command-line prompt and press ENTER, the machine will reply with the current timestamp, and then wait for your next command:

[alice@dev1]$ date
Mon Nov 4 11:36:33 PM PST 2024

[alice@dev1]$

We can use the echo command to output a message, e.g.

[alice@dev1]$ echo "Hello world"
Hello world

[alice@dev1]$

Scripts help us to repeat similar tasks many times

Instead of manually retyping the commands manually, we can put them in a text file using our favorite text editor. Here is a shell script hello.sh that contain the above two commands:

echo "Hello world"
date

We can run these two lines using the bash command:

[alice@dev1]$ bash hello.sh
Hello world
Mon Nov 4 11:37:43 PM PST 2024

[alice@dev1]$

We can call any number of times we’d like, e.g.

[alice@dev1]$ bash hello.sh
Hello world
Mon Nov 4 11:38:05 PM PST 2024

[alice@dev1]$

Tip: Comment you code for yourself and others

All scripting language support adding comments in the code. It’s an excellent way to make notes for your future self, for you collaborators, and for that researcher who will build on your work in two years from now. Do yourself a favor and add such notes!

# This script says hello to the world
# Author: Alice

echo "Hello world"

# Display current timestamp in format yyyy-mm-dd hh:mm:ss
date +"%F %T"

Comments are ignored when script is run, so still the same result;

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:43:03

Rule of thumb:
If it runs locally, it runs on the cluster

Running a script on a compute cluster

“Running a script on a compute cluster”

is similar to

“Queue the script to be run on the next available compute node”

We’re going to focus on shell scripts and how to run them on a compute cluster.

Run on the current machine

We can run the script manually at the command line:

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:45:14

We can run it multiple times:

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:45:32

… over and over this way:

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:45:51

Note, the above run sequentially, one after each other.

Run on the cluster (of compute nodes)

Instead of running it on the single machine we’re logged into, we can run it on any of the hundreds of machines that is available.

To do this, we use qsub to submit it to a job queue:

[alice@dev1]$ qsub -cwd hello.sh
Your job 8522736 ("hello.sh") has been submitted

Here I use command-line option -cwd to tell it to run hello.sh in the current working directory. You almost always want to do this.

Then we wait for it to get started. We can use qstat to see the queue status of our jobs:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8522736 0.00000 hello.sh alice  qw    11/04/2024 11:37:17                    1

Here qw means queued and waiting for a machine to become available.

Job is started by scheduler

The job scheduler will send jobs in the queue to available machines. The waiting time depends on how many users and jobs are already running and queued. It also depends on how many resources your job need.

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8522736 0.07069 hello.sh alice  r     11/04/2024 11:38:02 long.q@msg-iogpu6  1

Here r means running. Now we need to wait …

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------

Job is no longer listed, which means it finished.

But, now what? Where are the results?

[alice@dev1]$ cat hello.sh.o8522736
Hello world
2024-11-04 11:46:54

Tip: Set the qsub options inside script

A job scheduler directive: A specially formatted comment starting with #$ followed by qsub options.

#$ -cwd    # run in current working directory

# This script says hello to the world
# Author: Alice
# Date: 2024-11-04

echo "Hello world"

# Display current timestamp in format yyyy-mm-dd hh:mm:ss
date +"%F %T"

This way we can just do:

[alice@dev1]$ qsub hello.sh

Running an R script

To run a non-shell script, for instance an R or a Python script, on a compute cluster, we need to create a little helper shell script that calls our R or Python script.

Run an R script on the job scheduler

random.R:

# This R script outputs a random number
# Author: Alice
# Date: 2024-11-04

cat("A random number in [0,1]:", runif(1), "\n")

To run this from the command line, we can use Rscript:

[alice@dev1]$ Rscript random.R
A random number in [0,1]: 0.09091985

Like shell scripts, we can run R scripts many times:

[alice@dev1]$ Rscript random.R
A random number in [0,1]: 0.12491344

Run an R script on the job scheduler via a shell script

random.sh:

#$ -cwd    # run in current working directory

# This shell script calls the random.R script
# Author: Alice
# Date: 2024-11-04

Rscript random.R

This we can call as:

[alice@dev1]$ bash random.sh
A random number in [0,1]: 0.4785869

Run an R script on the job scheduler

With the random.sh wrapper script, we can now run our R script random.R on the cluster.

For the sake of it, let’s submit it twice to the job scheduler:

[alice@dev1]$ qsub random.sh
Your job 8529912 ("random.sh") has been submitted

[alice@dev1]$ qsub random.sh
Your job 9529919 ("random.sh") has been submitted

As usual, it will take some time before the jobs starts:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice qw    11/04/2024 11:28:02                    1        
9529919 0.07069 random.sh alice qw    11/04/2024 11:28:04                    1

Run an R script on the job scheduler

But eventually, one starts:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice r     11/04/2024 11:28:32 long.q@qb3-id3     1        
9529919 0.07069 random.sh alice qw    11/04/2024 11:28:04                    1

And then the other:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice r     11/04/2024 11:28:32 long.q@qb3-id3     1        
9529919 0.07069 random.sh alice r     11/04/2024 11:28:54 long.q@msg-id19    1

Note how the two jobs run on two different machines.

Run an R script on the job scheduler

When the jobs are finished, any output generated by the script is stored in the corresponding random.sh.o<job_id> output file.

[alice@dev1]$ ls random.sh.*
random.sh.o8529912
random.sh.o9529919

[alice@dev1]$ cat random.sh.o8529912
A random number in [0,1]: 0.1704674

[alice@dev1]$ cat random.sh.o9529919
A random number in [0,1]: 0.9229252

How to ask for help

Community Slack (faster)
Email to the cluster team (slower, because fewer people to answer questions)
Rule #1: There is no such thing as “a stupid question”!!!
Help the helper help you
Better with too much details than too little
It is really hard to destroy things for others
(⚠️ but, you can rm your own files and files in shared folders)

What makes a system administrator grumpy

Overload development nodes - or use more job resources than you request
When you’re sloppy or lazy - make sure to read help online, etc.
More seriously, sysadms are great folks too, so never hesitate to ask for help or pointers

Even better is to ask your peers

Wynton has 1,400 users and C4 has 300 users - many of them on Slack
The Slack channel is welcoming and friendly
Lots of users are willing to help out
There is almost certainly at least one other user who have the same question as you
Sooner than you think, you might be the one helping someone else out
Again, there are no “silly” questions - it’s only silly if you don’t ask!

Appendix (Random Slides)

A1. Scripts - R

Example: An R script hello.R containing:

cat("Hello world\n")
cat("How are you?\n")

We can run these two lines using:

[alice@dev1]$ Rscript hello.R
Hello world
How are you?

[alice@dev1]$

A1. Scripts - Python

Example: A Python script hello.py containing:

print("Hello world\n")
print("How are you?\n")

We can run these two lines using:

[alice@dev1]$ python3 hello.py
Hello world
How are you?

[alice@dev1]$

A1. The filename extension reveals the scripting language

Common filename extensions for scripting languages:

[alice@dev1]$ ls -1 hello.*
hello.sh
hello.R
hello.py
hello.pl

The extension is non-critical - it’s only for us humans to keep track. A script could be name just hello, but then we have to peek into the file to figure out what scripting language is used before we can call it;

[alice@dev1]$ bash hello
Hello world
How are you?

A2. Shebangs - make your script look like any other program

A shebang is a script comment at the first line with a specific format:

# !/usr/bin/env bash

echo "Hello world"
echo "How are you?"

The “-bang” in “shebang” is because ! is pronounced “bang” in the compute world.

The script still works as usual (because it is just a comment):

[alice@dev1]$ bash hello.sh
Hello world
How are you?

A2. Shebangs - make your script look like any other program

Now, if we set the executable flag (x) in this file:

[alice@dev1]$ ls -l hello.sh
-rw-r--r-- 1 alice alice 59 Oct  1 22:35 hello.sh

[alice@dev1]$ chmod ugo+x hello.sh

[alice@dev1]$ ls -l hello.sh
-rwxr-xr-x 1 alice alice 59 Oct  1 22:35 hello.sh

we can call the script as:

[alice@dev1]$ ./hello.sh
Hello world
How are you?

A3. When run via scheduler, scripts output to files

Wynton HPC: By default, output files are saved to your home directory (~)

To output to current working directory, use -cwd.

[alice@dev1]$ qsub -cwd hello.sh
Your job 8522746 ("hello.sh") has been submitted

Then we keep checking qstat to see when it’s done.

[alice@dev1]$ ls
hello.sh
hello.sh.o8522746

[alice@dev1]$ cat hello.sh.o8522746
Hello world
How are you?

A4. Put your common scripts in ~/bin/

[alice@dev1]$ mkdir ~/bin
[alice@dev1]$ cp hello.sh ~/bin
$ ls -l ~/bin/hello.sh
-rwxr-xr-x 1 alice alice 59 Oct  1 22:35 hello.sh

we can now call the script as:

[alice@dev1]$ bash ~/bin/hello.sh
Hello world
How are you?

And, because of the shebang, we don’t have to call it via bash:

[alice@dev1]$ ~/bin/hello.sh
Hello world
How are you?

High-Performance Computing (HPC)

Goals for today

Why do we use a compute cluster?

What is a compute cluster?

Two compute clusters at the UCSF

Typical workflow using a compute cluster

Scripts

Scripts are easy

Scripts help us to repeat similar tasks many times

Tip: Comment you code for yourself and others

Rule of thumb:If it runs locally, it runs on the cluster

Running a script on a compute cluster

Run on the current machine

Run on the cluster (of compute nodes)

Job is started by scheduler

Tip: Set the qsub options inside script

Running an R script

Run an R script on the job scheduler

Run an R script on the job scheduler via a shell script

Run an R script on the job scheduler

Run an R script on the job scheduler

Run an R script on the job scheduler

How to ask for help

What makes a system administrator grumpy

Even better is to ask your peers

Appendix (Random Slides)

A1. Scripts - R

A1. Scripts - Python

A1. The filename extension reveals the scripting language

A2. Shebangs - make your script look like any other program

A2. Shebangs - make your script look like any other program

A3. When run via scheduler, scripts output to files

A4. Put your common scripts in ~/bin/

Rule of thumb:
If it runs locally, it runs on the cluster