Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
100 changes: 99 additions & 1 deletion docs/clusters/alpine/alpine-hardware.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ All Alpine nodes are available to all users. For full details about node access,
| {{ alpine_ucb_total_64_core_256GB_cpu_nodes_atesting }} Milan CPU test nodes; pulls from CU amilan pool | atesting | x86_64 AMD Milan | 1 or 2 | 64 | 1 | 3.8 | N/A | 0 | 416G SSD | HDR-100 InfiniBand (200Gb inter-node fabric) | RHEL 8.4 |
| {{ alpine_ucb_total_atesting_a100_gpu_nodes }} Milan NVIDIA GPU testing node | atesting_a100 | x86_64 AMD Milan | 2 | 64 | 1 | 3.8 | NVIDIA A100 | 3 (each split by MIG) | 416G SSD | 2x25 Gb Ethernet +RoCE | RHEL 8.4 |
| {{ alpine_ucb_total_atesting_mi100_gpu_nodes }} Milan AMD GPU testing nodes; pulls from ami100 pool | atesting_mi100 | x86_64 AMD Milan | 2 | 64 | 1 | 3.8 | AMD MI100 | 3 | 416G SSD | 2x25 Gb Ethernet +RoCE | RHEL 8.4 |
| {{ alpine_ucb_total_dtn_nodes }} data transfer nodes (DTNs) | dtn | x86_64 Intel Haswell | 2 | 24 | 1 | 3.8 | N/A | 0 | NA | 2x100 Gb Ethernet | RHEL 8.10 |


:::

Expand Down Expand Up @@ -204,7 +206,7 @@ All users, regardless of institution, should specify partitions as follows:

#### Special-Purpose Partitions

To help users test out their workflows, CURC provides several special-purpose partitions on Alpine. These partitions enable users to quickly test or compile code on CPU and GPU compute nodes. To ensure equal access to these special-purpose partitions, the amount of resources (such as CPUs, GPUs, and runtime) are limited.
To help users test out their workflows, CURC provides several special-purpose partitions on Alpine. These partitions enable users to quickly test or compile code on CPU and GPU compute nodes, and to transfer data to/from CURC systems. To ensure equal access to these special-purpose partitions, the amount of resources (such as CPUs, GPUs, and runtime) are limited.

```{important}
Compiling and testing partitions are, as their name implies, only meant for compiling code and testing workflows. They are not to be used outside of compiling or testing. Please utilize the appropriate partitions when running code.
Expand Down Expand Up @@ -321,5 +323,101 @@ acompile --ntasks=2 --time=02:00:00

`````

##### `dtn` usage examples:

`dtn` provides access to the CURC data transfer nodes (DTNs) for the purpose of conducting performant data transfers with command-line tools including `rsync`, `Rclone`, `scp`, `sftp` `globus-cli`, `curl`, and `wget`.

__Typical Use Cases:__

* When you need to integrate performant data transfers as dependencies of computational jobs using the Slurm `--dependency` flag.
* When automated (e.g, daily, monthly) performant data transfers are required.
* When you want the convenience of scheduling a long-running data transfer as a batch job that won’t time-out due to a spotty `ssh` session connection.

__Types of data transfers supported:__

* _Onsite-to-onsite_ transfers from one filesystem to another (e.g., copying a folder from scratch to PetaLibrary with `rsync`)
* _Downloading_ large datasets from websites, ftp servers, cloud providers, etc. (e.g., downloading an LLM from OpenAI with `wget` or `curl`).
* _Onsite-to-offsite_ transfers that allow external access via ssh-based protocols (e.g,. a transfer from your PetaLibrary allocation to your lab’s linux server with `scp`, or to your lab's AWS S3 bucket with `rclone`).
```{note}
If the offsite machine does not support external ssh access (e.g., your Mac or Windows laptop), you can conduct the transfer on the `dtns` using the [Globus Command Line Interface (CLI)](https://docs.globus.org/cli/) via the `globus` command _(requires a [Globus Connect Personal](https://www.globus.org/globus-connect-personal) endpoint on the offsite machine)_.
```

__Resource constraints:__

Users may consume a maximum of four cores across all simultaneous jobs on the `dtn` partition. For example, a user may run 4 simultaneous single-core data transfer jobs, or they may run 1 four-core job. We recommend requesting a single core per job for most transfers, as most data transfer protocols can only take advantage of a single core (only `globus-cli` and `curl` can parallelize transfers across multiple cores).

(tabset-ref-dtn-use)=
`````{tab-set}
:sync-group: tabset-dtn-use

````{tab-item} Example 1
:sync: dtn-use-ex1

**Request one core `dtn` job for 12 hours.**

```bash
sinteractive --partition=dtn --qos=dtn --nodes=1 --ntasks=1 --time=12:00:00
```

````


```` {tab-item} Example 2
:sync: dtn-use-ex2

**Create a job script that requests a 1 core job for 12 hours and conducts an `rsync` transfer and a wget download.**

Create a job script called `mytransfer.sh`:

```bash
#!/bin/bash

#SBATCH --partition=dtn
#SBATCH --qos=dtn
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --job-name=transfer
#SBATCH --time=12:00:00
#SBATCH --output=transfer.%j.out
#SBATCH --mail-type=ALL
#SBATCH [email protected]

# transfer the "bigfiles" folder from my /scratch directory to my petalibrary allocation:

rsync -r /scratch/alpine/ralphie/bigfiles/ /pl/active/buffsfan/ralphie/

# now download an additional file the "bigfiles" folder:

cd /pl/active/buffsfan/ralphie/bigfiles
wget https://cubuffs.org/another_huge_file.zip
```
Now schedule the job:

```bash
sbatch mytransfer.sh
```

````

```` {tab-item} Example 3
:sync: dtn-use-ex3

**Schedule a transfer that is a dependency of another job**

Use the `--dependency` flag if you need to ensure your data transfer runs after a related job completes. For example, to run `mytransfer.sh` (see Example 2) only after a previously scheduled batch job with ID `73798236` has successfully completed, schedule it as follows:

```bash
sbatch --dependency=afterok:73798236 mytransfer.sh
```
You can learn more about `--dependency` in Slurm's [documentation for sbatch](https://slurm.schedmd.com/sbatch.html).

````

`````

```{warning}
Due to the manner in which the DTNs are configured, the Slurm `salloc` command cannot be used to start jobs on the `dtn` partition. Please instead use the `sbatch` or `sinteractive` commands per the examples above.
```

Alpine is jointly funded by the University of Colorado Boulder, the University of Colorado Anschutz, Colorado State University, and the National Science Foundation (award 2201538).

28 changes: 15 additions & 13 deletions docs/clusters/blanca/blanca.md
Original file line number Diff line number Diff line change
Expand Up @@ -292,7 +292,7 @@ The interactive job won't start until the resources that it needs are available,

## Blanca Preemptable QOS

Each partner group has its own high-priority QoS (`blanca-<group identifier>`) for jobs that will run on nodes that it has contributed. High-priority jobs can run for up to 7 days. All partners also have access to a low-priority QoS (“preemptable”) that can run on any Blanca nodes that are not already in use by the partners who contributed them. Low-priority jobs will have a maximum time limit of 24 hours, and can be preempted at any time by high-priority jobs that request the same compute resources being used by the low-priority job. To facilitate equitable access to preemptable resources, at any given time each user is limited to consuming a maximum of 2000 cores (roughly 25% of all resoures on Blanca) across all of their preemptable jobs. The preemption process will terminate the low-priority job with a grace period of up to 120-seconds. Preempted low-priority jobs will then be requeued by default. Additional details follow.
Each partner group has its own high-priority QoS (`blanca-<group identifier>`) for jobs that will run on nodes that it has contributed. High-priority jobs can run for up to 7 days. All partners also have access to a low-priority QoS (“preemptable”) that can run on any Blanca nodes that are not already in use by the partners who contributed them. Low-priority jobs will have a maximum time limit of 24 hours, and can be preempted at any time by high-priority jobs that request the same compute resources being used by the low-priority job. To facilitate equitable access to preemptable resources, at any given time each user is limited to consuming a maximum of 2000 cores (roughly 25% of all resources on Blanca) across all of their preemptable jobs. The preemption process will terminate the low-priority job with a grace period of up to 120-seconds. Preempted low-priority jobs will then be requeued by default. Additional details follow.

### Usage

Expand Down Expand Up @@ -332,13 +332,24 @@ Batch jobs that are preempted will automatically requeue if the exit code is non

Interactive jobs will not requeue if preempted.

### Best practices
## Using the Data Transfer Nodes in Blanca jobs

CURC provides data transfer nodes (DTNs) to facilitate performant transfers of files to/from CURC, across CURC filesystems, and downloads from remote data providers. The `dtn` partition is part of the Alpine cluster, however it is accessible from jobs running on Blanca nodes by preceding the `sinteractive` or `sbatch` commands in [dtn partition usage examples 1 and 2](../alpine/alpine-hardware.md#dtn-usage-examples) with the Alpine `SLURM_CONF` environment, e.g.,

```bash
SLURM_CONF=/curc/slurm/alpine/etc/slurm.conf sbatch mytransfer.sh
```
```{note}
It is not possible to schedule a data transfer job on the Alpine `dtn` partition as a direct _dependency_ of a previous Blanca job, because the `--dependency` flag must reference a job on the same cluster. A workaround is to schedule a second Blanca job that is a dependency of the previous Blanca job, and within that second job schedule the Alpine `dtn` data transfer job per the example shown above.
```

## Best practices

Checkpointing: Given that preemptable jobs can request wall times up to 24 hours in duration, there is the possibility that users may lose results if they do not checkpoint. Checkpointing is the practice of incrementally saving computed results such that -- if a job is preempted, killed, canceled or crashes -- a given software package or model can continue from the most recent checkpoint in a subsequent job, rather than starting over from the beginning. For example, if a user implements hourly checkpointing and their 24 hour simulation job is preempted after 22.5 hours, they will be able to continue their simulation from the most recent checkpoint data that was written out at 22 hours, rather than starting over. Checkpointing is an application-dependent process, not something that can be automated on the system end; many popular software packages have checkpointing built in (e.g., ‘restart’ files). In summary, users of the preemptable QoS should implement checkpointing if at all possible to ensure they can pick up where they left off in the event their job is preempted.

Requeuing: Users running jobs that do not require requeuing if preempted should specify the `--no-requeue` flag noted above to avoid unnecessary use of compute resources.

### Example Job Scripts
## Example Job Scripts

(tabset-ref-blanca-job-scripts)=
`````{tab-set}
Expand Down Expand Up @@ -414,22 +425,13 @@ python myscript.py

`````

### Other considerations
## Other considerations

Grace period upon preemption: When jobs are preempted, a 120 second grace period is available to enable users to save and exit their jobs should they have the ability to do so. The preempted job is immediately sent `SIGCONT` and `SIGTERM` signals by Slurm in order to provide notification of its imminent termination. This is followed by the `SIGCONT`, `SIGTERM` and `SIGKILL` signal sequence upon reaching the end of the 120 second grace period. Users wishing to do so can monitor the job for the SIGTERM signal and, when detected, take advantage of this 120 second grace period to save and exit their jobs.




## FAQ

### How would going from 480GB to 2TB SSD affect the price?

Commonly, additional RAM will increase pricing substantially, whereas increased local SSD will do so only slightly (perhaps by a few hundred dollars). However, we recommend using RC’s dedicated storage resources rather than adding persistent storage local to a single node. This increases functionality, redundancy, and our capacity to recover data in the event of issues.

### Can you tell us more about the Service Level Agreement (SLA) this hardware comes with?

The hardware comes with a 5-year warranty that covers all hardware-related issues during that period. No additional node-related costs will be incurred by the owner during this period. After the 5-year period, if the owner wishes to continue using the nodes, and if RC can still support the hardware, the owner will be responsible for purchasing hardware if/when it fails (e.g., SSDs, RAM, etc.).

### Do you offer a percent uptime guarantee for the duration of the SLA?

Expand Down
14 changes: 11 additions & 3 deletions docs/compute/data-transfer.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,8 +110,12 @@ scp <path-to-file> <username>@dtn.rc.colorado.edu:<target-path>
scp <username>@dtn.rc.colorado.edu:<path-to-file> <target-path>
```

Once you've typed the `scp` command, press Enter. If prompted, enter your password and then accept your Duo 2-Factor notification. If the connection succeeds you will see your transfer begin.

```{note}
Windows users can access scp through PowerShell or using a GUI
application like [WinSCP](https://winscp.net/eng/docs/protocols).
```

For more information on secure copy take a [look at some of our listed
resources](#more-reading) or consult the scp manual page.
Expand Down Expand Up @@ -143,10 +147,14 @@ rsync -r <path-to-directory> <username>@dtn.rc.colorado.edu:<target-path>
rsync -r <username>@dtn.rc.colorado.edu:<path-to-directory> <target-path>
```

Once you've typed the `rsync` command, press Enter. If prompted, enter your password and then accept your Duo 2-Factor notification. If the connection succeeds you will see your transfer begin.

```{note}
rsync is not available on Windows by default, but [may be installed
individually](https://www.itefix.net/cwrsync) or as part of [Windows
Subsystem for Linux
(WSL)](https://docs.microsoft.com/en-us/windows/wsl/install-win10).
```

For more information on rsync [check out some of our listed
resources](#more-reading) or consult the rsync manual page.
Expand Down Expand Up @@ -248,7 +256,7 @@ scp -v ./myfile23.txt dtn.rc.colorado.edu:/pl/active/crdds/myfile.txt # usi

## Rclone

Rclone is a command line program to manage files on cloud storage. It is a feature rich alternative to cloud vendors' web storage interfaces. [Over 40 cloud storage products](https://rclone.org/#providers) support rclone including S3 object stores, business & consumer file storage services, as well as standard transfer protocols. Rclone has powerful cloud equivalents to the unix commands rsync, cp, mv, mount, ls, ncdu, tree, rm, and cat. Rclone's familiar syntax includes shell pipeline support, and `--dry-run` protection. It can be used at the command line, in scripts or via its [API](https://rclone.org/rc/).
Rclone is a command line program to manage files on cloud storage. It is a feature rich alternative to cloud vendors' web storage interfaces. [Over 40 cloud storage products](https://rclone.org/#providers) support rclone including S3 object stores, business & consumer file storage services, as well as standard transfer protocols. Rclone has powerful cloud equivalents to the unix commands `rsync`, `cp`, `mv`, `mount`, `ls`, `ncdu`, `tree`, `rm`, and `cat`. Rclone's familiar syntax includes shell pipeline support, and `--dry-run` protection. It can be used at the command line, in scripts or via its [API](https://rclone.org/rc/).

### What can rclone do for you?

Expand All @@ -258,7 +266,7 @@ Rclone is a command line program to manage files on cloud storage. It is a featu
- Mirror cloud data to other cloud services or locally
- Migrate data to the cloud, or between cloud storage vendors
- Mount multiple, encrypted, cached or diverse cloud storage as a disk
- Analyse and account for data held on cloud storage using [lsf](https://rclone.org/commands/rclone_lsf/), [ljson](https://rclone.org/commands/rclone_lsjson/), [size](https://rclone.org/commands/rclone_size/), [ncdu](https://rclone.org/commands/rclone_ncdu/)
- Analyze and account for data held on cloud storage using [lsf](https://rclone.org/commands/rclone_lsf/), [ljson](https://rclone.org/commands/rclone_lsjson/), [size](https://rclone.org/commands/rclone_size/), [ncdu](https://rclone.org/commands/rclone_ncdu/)
- [Union](https://rclone.org/union/) file systems together to present multiple local and/or cloud file systems as one

### Features
Expand Down Expand Up @@ -404,7 +412,7 @@ Enter a value. Press Enter to leave empty.
secret_access_key> <SECRET-KEY>
```

- You will then be prompted to enter your region, here we are going to enter the value of ‘1' (you can set your appropreate region):
- You will then be prompted to enter your region, here we are going to enter the value of ‘1' (you can set your appropriate region):
```
Option region.
Region to connect to.
Expand Down
3 changes: 2 additions & 1 deletion docs/compute/node-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,5 +33,6 @@ This is where jobs are executed after being passed to the scheduler.

## Data Transfer Nodes
* Data Transfer Nodes (DTNs) are nodes which support [data transfer](data-transfer.md#data-transfer) on CURC systems.
* When transferring files using `scp`, `sftp`, or `ssh`, you can choose to host your transfers on a DTN.
* When transferring data to/from CURC using `scp`, `sftp`, `rsync` or `Rclone`, you can choose to conduct your transfers on a DTN regardless of whether you initiate the transfer external to CURC or internal to CURC.
* If you are working internal to CURC, the Alpine [dtn partition](../clusters/alpine/alpine-hardware.md#dtn-usage-examples) provides a convenient way to start data transfer jobs on the DTNs from CURC clusters.

3 changes: 2 additions & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
'alpine_ucb_total_64_core_256GB_cpu_nodes_atesting': '2',
'alpine_ucb_total_atesting_a100_gpu_nodes': '1',
'alpine_ucb_total_atesting_mi100_gpu_nodes': '1',
'alpine_ucb_total_dtn_nodes': '4',
## AMC contributions
'alpine_amc_total_64_core_256GB_cpu_nodes': '26',
'alpine_amc_total_64_core_1TB_cpu_nodes': '2',
Expand All @@ -65,7 +66,7 @@
'alpine_total_atesting_cpu_nodes': '2',
'alpine_total_atesting_a100_nodes': '1',
'alpine_total_atesting_mi100_nodes': '1',

'alpine_total_dtn_nodes': '4',
# Alpine Array Jobs
'alpine_max_number_array_jobs' : '1,000'
}
Expand Down