Skip to content

Commit cb4c693

Browse files
authored
Merge branch 'master' into master
2 parents a24ca62 + 7d61b4e commit cb4c693

File tree

7 files changed

+127
-1
lines changed

7 files changed

+127
-1
lines changed

.github/actions/spelling/allow/terms.txt

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
AARCH
2+
BGZF
23
CINT
34
CMSSW
45
Cppyy
@@ -11,13 +12,16 @@ JIT'd
1112
Jacobians
1213
LLVM
1314
NVIDIA
15+
NVMe
1416
PTX
17+
Slib
1518
Softsusy
1619
Superbuilds
1720
TFormula
1821
TTree
1922
aarch
2023
bioinformatics
24+
blogs
2125
consteval
2226
cppyy
2327
cytokine
@@ -27,6 +31,7 @@ gsoc
2731
linkedin
2832
microenvironments
2933
pythonized
34+
ramview
3035
samtools
3136
sitemap
3237
softsusy

.github/actions/spelling/expect.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
Reoptimization
2+
genomics
23
reoptimization
34
sustainability
45
transitioning

_data/contributors.yml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -158,8 +158,22 @@
158158
github: "https://github.com/SahilPatidar"
159159
active: 1
160160
projects:
161-
- title: "Out-Of-Process execution for Clang-Repl"
161+
- title: "Advanced symbol resolution and reoptimization for clang-repl"
162162
status: Ongoing
163+
description: |
164+
This project aims to enhance Clang-Repl, an interactive C++ interpreter built
165+
on top of LLVM’s ORC JIT infrastructure. Currently, Clang-Repl lacks a
166+
mechanism to automatically load dynamic libraries when encountering unresolved
167+
symbols. As a result, users must manually load the appropriate libraries if a
168+
symbol used in their code resides in a specific dynamic library. To address
169+
this limitation, we propose a solution that enables automatic library loading
170+
for unresolved symbols. Additionally, the second goal of this project is to
171+
introduce support for re-optimization within Clang-Repl, allowing code to
172+
benefit from improved performance through dynamic optimization techniques.
173+
mentors: Vassil Vassilev
174+
proposal: /assets/docs/SahilPatidar_GSoC2025_Proposal.pdf
175+
- title: "Out-Of-Process execution for Clang-Repl"
176+
status: Completed
163177
description: |
164178
This project focuses on enhancing Clang-Repl, an interactive C++ interpreter
165179
that leverages LLVM's JIT infrastructure. The current in-process execution model
Lines changed: 106 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,106 @@
1+
---
2+
title: "Using ROOT in the field of genome sequencing"
3+
layout: post
4+
excerpt: "A GSoC 2025 project aiming to advance genomic data management by implementing ROOT's next-generation RNTuple format for sequence alignment storage."
5+
sitemap: false
6+
author: Aditya Pandey
7+
permalink: blogs/gsoc25_aditya_pandey_introduction_blog/
8+
banner_image: /images/blog/genome_project_banner.jpeg
9+
date: 2025-05-13
10+
tags: gsoc root genome bioinformatics
11+
---
12+
13+
### Introduction
14+
15+
I am Aditya Pandey currently a Bachelor of Technology student with experience in C++, Python,
16+
and algorithm optimization. During Google Summer of Code 2025, I'll be working on the project
17+
"Using ROOT in the field of genome sequencing" with CERN-HSF.
18+
19+
**Mentors**:Martin Vassilev, Jonas Rembser, Fons Rademakers, Vassil Vassilev
20+
21+
22+
### The Challenge of Genomic Data
23+
24+
Genomic sequencing data volumes are growing exponentially, creating performance bottlenecks in
25+
traditional storage formats. A single human genome sequencing project can generate files ranging
26+
from 10-30GB, and large-scale initiatives involve thousands of samples. This data tsunami requires
27+
more efficient storage and query solutions than traditional formats like BAM and CRAM can provide.
28+
Previous work with the GeneROOT project (a CERN initiative to use ROOT for genomics) has shown
29+
promising results with the TTree format, demonstrating approximately 4x performance improvements.
30+
My project aims to build on this foundation by implementing the next-generation RNTuple format,
31+
which promises even greater efficiency.
32+
33+
### Why RNTuple for Genomics?
34+
RNTuple is ROOT's successor to TTree columnar data storage, offering several advantages for genomic data:
35+
36+
Improved Memory Efficiency: RNTuple's design allows uncompressed data to be directly mapped to memory without further copies due to the clear separation between offset/index data and payload data. This matches the in-memory layout on modern architectures and reduces RAM requirements when processing large genomic datasets.
37+
Type Safety: RNTuple provides compile-time type-safe interfaces through the use of templates, reducing common programming errors in genomic data processing. This is particularly valuable when handling complex nested data types common in genomic sequence information.
38+
Enhanced Storage Efficiency: Recent benchmarks show RNTuple achieving 20-35% storage space savings compared to TTree, which already outperforms traditional genomic formats. This translates to significant storage cost reductions for large-scale genomic datasets.
39+
Optimized Performance: RNTuple demonstrates multiple times faster read throughput than TTree, along with better write performance and multicore scalability. It can fully harness the performance of modern NVMe drives and object stores.
40+
Columnar Access Pattern: The columnar structure is ideal for genomic region queries that often only access chromosome and position information, avoiding unnecessary data loading. This is particularly important for genomic data, where analysts frequently need to examine specific regions rather than entire sequences.
41+
42+
43+
### Project Description
44+
My project extends GeneROOT by implementing ROOT's next-generation RNTuple format for genomic data storage and analysis through two main stages:
45+
46+
#### Stage 1: Reproduction and Baseline Establishment
47+
48+
Reproduce and validate previous GeneROOT benchmarks showing 4x performance gains with TTree
49+
Establish reliable baseline metrics for comparison
50+
Identify and address performance bottlenecks in the current implementation
51+
Optimize the existing code before transitioning to RNTuple
52+
Analyze and compare compression strategies from Samtools/HTSlib and ROOT
53+
54+
#### Stage 2: RNTuple Implementation
55+
56+
Implement a genomic data model using RNTuple's templated field system
57+
Develop efficient converters between standard genomic formats (BAM/CRAM) and RNTuple
58+
Create advanced file splitting strategies (by chromosome, region, or read group)
59+
Implement high-performance query tools leveraging RNTuple's columnar structure
60+
61+
### Compression Strategy Analysis
62+
63+
A key component of this project involves analyzing the compression techniques used by Samtools/HTSlib and comparing them with ROOT's compression capabilities:
64+
65+
#### BGZF (Blocked GZIP Format) in BAM Files
66+
67+
- I'll study the 64KB block architecture that enables random access while maintaining gzip compatibility
68+
- Test the nine compression levels (1-9) to determine optimal settings for genomic data
69+
- Analyze the multi-threading implementation for parallel compression/decompression
70+
71+
#### CRAM Advanced Codecs
72+
73+
- Investigate rANS (Asymmetric Numeral Systems) implementations
74+
- Examine CRAM transforms including interleaving, RLE, bit-packing, and striped encoding
75+
- Analyze integration techniques for external codecs like bzip2 and LZMA
76+
77+
#### Implementation Strategy
78+
79+
The findings from this analysis will inform the implementation of:
80+
81+
- Codec library integration with HTSlib's compression libraries where possible
82+
- ROOT-native implementations of key algorithms where direct integration isn't possible
83+
- Reference-based compression similar to CRAM
84+
- Adaptive selection of optimal compression methods based on data characteristics
85+
86+
87+
88+
### Project Architecture
89+
<embed src="/images/blog/genome_sequencing.pdf" type="application/pdf" style="display: block; margin-left: auto; margin-right: auto;" width="100%" height="600px" />
90+
91+
### Implementation Progress
92+
I have already made significant progress optimizing the existing GeneROOT codebase. My initial work on ramview.C has shown impressive performance gains through:
93+
94+
Replacing linear search with a two-phase approach combining exponential and binary search
95+
Implementing dynamic batch processing to reduce I/O operations
96+
Adding selective branch management to focus resources on necessary data
97+
Implementing resource optimization that scales based on file size
98+
99+
### Expected Benefits
100+
This project will deliver tools for handling rapidly growing genomic datasets with significantly improved performance:
101+
102+
Faster genomic region queries through RNTuple's columnar structure
103+
Better memory efficiency when processing large genomic files
104+
Enhanced type safety through RNTuple's templated interfaces
105+
Optimized storage through specialized compression and splitting strategies
106+
A potential new standard for high-performance genomic data analysis.
523 KB
Binary file not shown.
98.3 KB
Loading

images/blog/genome_sequencing.pdf

64.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)