Skip to content

Imageomics/hpc-inference

Repository files navigation

Imageomics Logo ABC Logo

HPC-Inference

Batch inference solution for large-scale image datasets on HPC.

About

Problem: Many batch inference workflows waste GPU resources due to I/O bottlenecks and sequential processing, leading to poor GPU utilization and longer processing times.

Key Bottlenecks:

  • Slow sequential large file loading (Disk → RAM)
  • Single-threaded image preprocessing
  • Data transfer delays (CPU ↔ GPU)
  • GPU idle time waiting for data
  • Sequential output writing

HPC-Inference solves this by:

  • Parallel data loading: Eliminates disk I/O bottlenecks with optimized dataset loaders
  • Asynchronous preprocessing: Keeps GPUs fed with continuous data queues
  • SLURM integration: Deploy seamlessly on HPC clusters
  • Multi-GPU distribution: Scales across HPC nodes for maximum throughput
  • Resource profiling: Logs timing metrics and CPU/GPU usage rates to help optimize your configuration

Getting Started

Setup with uv

The hpc_inference package's core functionality is the customized PyTorch datasets:

  • ParquetImageDataset for image data stored as compressed binary columns across multiple large Parquet files.
  • ImageFolderDataset for image data stored in a folder using open file format such as JPG, PNG, TIFF, etc.
# Clone Repo
git clone https://github.com/Imageomics/hpc-inference.git
cd hpc-inference

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create virtual environment and install package
uv venv hpc-inference-env
source hpc-inference-env/bin/activate

# Install base package
uv pip install -e .

Verify installation:

# Test the installation
python -c "import hpc_inference; print('✓ HPC-Inference installed successfully')"

The package also comes with a suite of ready-to-use job scripts to perform efficient batch inference using pretrained models on HPCs. To use these scripts, you'll need to install additional dependencies based on use cases:

# Check installation status and available features
python -c "from hpc_inference import print_installation_guide; print_installation_guide()"

uv pip install -e ".[openclip]"     # For CLIP embedding
uv pip install -e ".[detection]"    # For face/animal detection  
uv pip install -e ".[all]"          # Install dependency for all use cases

Use Cases Guide

ImageFolderDataset

For a comprehensive tutorial on using the ImageFolderDataset class, please see this notebook: ImageFolderDataset Guide.

This guide demonstrates working with the NEON Beetle dataset and covers basic usage, validation, multi-model preprocessing, distributed processing, and performance optimization.

Use case 1:

  • Image Folder Dataset
  • Parquet Dataset
  • Self-specified task

Use case 2:

  • Large scale CLIP embedding

Use case 3:

  • Large scale face detection

Use case 4:

  • Large scale animal detection using megadetector

Use case 5:

  • Grid search profiling

Project Structure

Acknowledgement

This project is a joint effort between the Imageomics Institute and the ABC Global Center.

About

A high-performance computing solution for efficient batch inference on large-scale image datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published