High-Performance Computing (HPC)

“Large data analysis on someone else’s computers”

Henrik Bengtsson

(Epidemiology & Biostatistics, Wynton HPC, C4 HPC)

Goals for today

  • For you to know the basics of scripting your analysis

  • For you to know it is not that hard to get started with HPC

Why do we use a compute cluster?

  • Running out of memory on local computer    👈 today

  • Analysis takes too long on local computer    👈 today

  • Too large data files to host on local computer

  • Makes it easier to collaborate within the group, e.g. common data folder

  • We can leave it running for days

  • What’s your reason?

What is a compute cluster?

  • A very large number of Linux computers    👈 today

  • All computers have identical configuration and software    👈 today

  • All computers have access to a shared file system

  • Multiple users are using the cluster at the same time

  • Not magic!

  • Higher total throughput - not necessarily lower latency

Two compute clusters at the UCSF

Wynton HPC C4
Since 2018 2020
For whom? All of UCSF Cancer Center affiliates
Number of users ~1,400 ~300
Number of computers ~500 ~40
Number of cores ~17,500 ~2,800
Memory (RAM) 48 – 1,512 GiB RAM 32 – 1,024 GiB RAM
Free disk space 500 GB/user 1,000 GB/user
Communal computers 100% 25%
Paying contributors VIP priority dedicated machines
Software core + shared + DIY core + shared + DIY
Linux Rocky 8 Rocky 8
Automatic backups home directory (only) home directory (only)
GPUs yes yes
Web browser GUI no soon (reviving OnDemand)
Documentation https://wynton.ucsf.edu/ https://www.c4.ucsf.edu/

Typical workflow using a compute cluster

  1. Log in to compute cluster
  2. Continue by logging in to a development node
  3. Go to project folder
  4. Edit scripts
  5. Submit one or more scripts to job queue
  6. Wait until done
  7. Look at produced files and logs
{ab@notebook}$ ssh alice@log1.wynton.ucsf.edu
[alice@log1]$ ssh dev1
[alice@dev1]$ cd /path/to/amazing_project/
[alice@dev1]$ emacs analysis.sh
[alice@dev1]$ qsub analysis.sh
[alice@dev1]$ qstat
...
[alice@dev1]$ cat analysis.o90303

Scripts

Basically everything you will do on a compute cluster involves scripted instructions.

Scripts are easy

  • Scripts are text files with commands that are executed line by line
  • You can edit scripts in a text editor
  • Scripts are run by an interpreter, e.g. Bash, R, Python, …

For example, if type date at the command-line prompt and press ENTER, the machine will reply with the current timestamp, and then wait for your next command:

[alice@dev1]$ date
Mon Nov 4 11:36:33 PM PST 2024

[alice@dev1]$

We can use the echo command to output a message, e.g.

[alice@dev1]$ echo "Hello world"
Hello world

[alice@dev1]$

Scripts help us to repeat similar tasks many times

Instead of manually retyping the commands manually, we can put them in a text file using our favorite text editor. Here is a shell script hello.sh that contain the above two commands:

echo "Hello world"
date

We can run these two lines using the bash command:

[alice@dev1]$ bash hello.sh
Hello world
Mon Nov 4 11:37:43 PM PST 2024

[alice@dev1]$

We can call any number of times we’d like, e.g.

[alice@dev1]$ bash hello.sh
Hello world
Mon Nov 4 11:38:05 PM PST 2024

[alice@dev1]$

Tip: Comment you code for yourself and others

All scripting language support adding comments in the code. It’s an excellent way to make notes for your future self, for you collaborators, and for that researcher who will build on your work in two years from now. Do yourself a favor and add such notes!

# This script says hello to the world
# Author: Alice

echo "Hello world"

# Display current timestamp in format yyyy-mm-dd hh:mm:ss
date +"%F %T"

Comments are ignored when script is run, so still the same result;

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:43:03

Rule of thumb:
If it runs locally, it runs on the cluster

Running a script on a compute cluster

“Running a script on a compute cluster”

is similar to

“Queue the script to be run on the next available compute node”



We’re going to focus on shell scripts and how to run them on a compute cluster.

Run on the current machine

We can run the script manually at the command line:

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:45:14

We can run it multiple times:

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:45:32

… over and over this way:

[alice@dev1]$ bash hello.sh
Hello world
2024-11-04 11:45:51

Note, the above run sequentially, one after each other.

Run on the cluster (of compute nodes)

Instead of running it on the single machine we’re logged into, we can run it on any of the hundreds of machines that is available.

To do this, we use qsub to submit it to a job queue:

[alice@dev1]$ qsub -cwd hello.sh
Your job 8522736 ("hello.sh") has been submitted

Here I use command-line option -cwd to tell it to run hello.sh in the current working directory. You almost always want to do this.

Then we wait for it to get started. We can use qstat to see the queue status of our jobs:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8522736 0.00000 hello.sh alice  qw    11/04/2024 11:37:17                    1

Here qw means queued and waiting for a machine to become available.

Job is started by scheduler

The job scheduler will send jobs in the queue to available machines. The waiting time depends on how many users and jobs are already running and queued. It also depends on how many resources your job need.

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8522736 0.07069 hello.sh alice  r     11/04/2024 11:38:02 long.q@msg-iogpu6  1        

Here r means running. Now we need to wait …

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------

Job is no longer listed, which means it finished.

But, now what? Where are the results?

[alice@dev1]$ cat hello.sh.o8522736
Hello world
2024-11-04 11:46:54

Tip: Set the qsub options inside script

A job scheduler directive: A specially formatted comment starting with #$ followed by qsub options.

#$ -cwd    # run in current working directory

# This script says hello to the world
# Author: Alice
# Date: 2024-11-04

echo "Hello world"

# Display current timestamp in format yyyy-mm-dd hh:mm:ss
date +"%F %T"

This way we can just do:

[alice@dev1]$ qsub hello.sh

Running an R script

To run a non-shell script, for instance an R or a Python script, on a compute cluster, we need to create a little helper shell script that calls our R or Python script.

Run an R script on the job scheduler

random.R:

# This R script outputs a random number
# Author: Alice
# Date: 2024-11-04

cat("A random number in [0,1]:", runif(1), "\n")

To run this from the command line, we can use Rscript:

[alice@dev1]$ Rscript random.R
A random number in [0,1]: 0.09091985

Like shell scripts, we can run R scripts many times:

[alice@dev1]$ Rscript random.R
A random number in [0,1]: 0.12491344

Run an R script on the job scheduler via a shell script

random.sh:

#$ -cwd    # run in current working directory

# This shell script calls the random.R script
# Author: Alice
# Date: 2024-11-04

Rscript random.R

This we can call as:

[alice@dev1]$ bash random.sh
A random number in [0,1]: 0.4785869

Run an R script on the job scheduler

With the random.sh wrapper script, we can now run our R script random.R on the cluster.

For the sake of it, let’s submit it twice to the job scheduler:

[alice@dev1]$ qsub random.sh
Your job 8529912 ("random.sh") has been submitted

[alice@dev1]$ qsub random.sh
Your job 9529919 ("random.sh") has been submitted

As usual, it will take some time before the jobs starts:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice qw    11/04/2024 11:28:02                    1        
9529919 0.07069 random.sh alice qw    11/04/2024 11:28:04                    1        

Run an R script on the job scheduler

But eventually, one starts:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice r     11/04/2024 11:28:32 long.q@qb3-id3     1        
9529919 0.07069 random.sh alice qw    11/04/2024 11:28:04                    1        

And then the other:

[alice@dev1]$ qstat
job-ID  prior   name     user   state submit/start at     queue              slots
----------------------------------------------------------------------------------
8529912 0.07069 random.sh alice r     11/04/2024 11:28:32 long.q@qb3-id3     1        
9529919 0.07069 random.sh alice r     11/04/2024 11:28:54 long.q@msg-id19    1        

Note how the two jobs run on two different machines.

Run an R script on the job scheduler

When the jobs are finished, any output generated by the script is stored in the corresponding random.sh.o<job_id> output file.

[alice@dev1]$ ls random.sh.*
random.sh.o8529912
random.sh.o9529919


[alice@dev1]$ cat random.sh.o8529912
A random number in [0,1]: 0.1704674


[alice@dev1]$ cat random.sh.o9529919
A random number in [0,1]: 0.9229252 

How to ask for help

  • Community Slack (faster)

  • Email to the cluster team (slower, because fewer people to answer questions)

  • Rule #1: There is no such thing as “a stupid question”!!!

  • Help the helper help you

  • Better with too much details than too little

  • It is really hard to destroy things for others
    (⚠️ but, you can rm your own files and files in shared folders)

What makes a system administrator grumpy

  • Overload development nodes - or use more job resources than you request

  • When you’re sloppy or lazy - make sure to read help online, etc.

  • More seriously, sysadms are great folks too, so never hesitate to ask for help or pointers

Even better is to ask your peers

  • Wynton has 1,400 users and C4 has 300 users - many of them on Slack

  • The Slack channel is welcoming and friendly

  • Lots of users are willing to help out

  • There is almost certainly at least one other user who have the same question as you

  • Sooner than you think, you might be the one helping someone else out

  • Again, there are no “silly” questions - it’s only silly if you don’t ask!

Appendix (Random Slides)

A1. Scripts - R

Example: An R script hello.R containing:

cat("Hello world\n")
cat("How are you?\n")

We can run these two lines using:

[alice@dev1]$ Rscript hello.R
Hello world
How are you?

[alice@dev1]$

A1. Scripts - Python

Example: A Python script hello.py containing:

print("Hello world\n")
print("How are you?\n")

We can run these two lines using:

[alice@dev1]$ python3 hello.py
Hello world
How are you?

[alice@dev1]$

A1. The filename extension reveals the scripting language

Common filename extensions for scripting languages:

[alice@dev1]$ ls -1 hello.*
hello.sh
hello.R
hello.py
hello.pl

The extension is non-critical - it’s only for us humans to keep track. A script could be name just hello, but then we have to peek into the file to figure out what scripting language is used before we can call it;

[alice@dev1]$ bash hello
Hello world
How are you?

A2. Shebangs - make your script look like any other program

A shebang is a script comment at the first line with a specific format:

# !/usr/bin/env bash

echo "Hello world"
echo "How are you?"

The “-bang” in “shebang” is because ! is pronounced “bang” in the compute world.

The script still works as usual (because it is just a comment):

[alice@dev1]$ bash hello.sh
Hello world
How are you?

A2. Shebangs - make your script look like any other program

Now, if we set the executable flag (x) in this file:

[alice@dev1]$ ls -l hello.sh
-rw-r--r-- 1 alice alice 59 Oct  1 22:35 hello.sh

[alice@dev1]$ chmod ugo+x hello.sh

[alice@dev1]$ ls -l hello.sh
-rwxr-xr-x 1 alice alice 59 Oct  1 22:35 hello.sh

we can call the script as:

[alice@dev1]$ ./hello.sh
Hello world
How are you?

A3. When run via scheduler, scripts output to files

  • Wynton HPC: By default, output files are saved to your home directory (~)

To output to current working directory, use -cwd.

[alice@dev1]$ qsub -cwd hello.sh
Your job 8522746 ("hello.sh") has been submitted

Then we keep checking qstat to see when it’s done.

[alice@dev1]$ ls
hello.sh
hello.sh.o8522746

[alice@dev1]$ cat hello.sh.o8522746
Hello world
How are you?

A4. Put your common scripts in ~/bin/

[alice@dev1]$ mkdir ~/bin
[alice@dev1]$ cp hello.sh ~/bin
$ ls -l ~/bin/hello.sh
-rwxr-xr-x 1 alice alice 59 Oct  1 22:35 hello.sh

we can now call the script as:

[alice@dev1]$ bash ~/bin/hello.sh
Hello world
How are you?

And, because of the shebang, we don’t have to call it via bash:

[alice@dev1]$ ~/bin/hello.sh
Hello world
How are you?