Skip to content

Conversation

@timosachsenberg
Copy link
Contributor

@timosachsenberg timosachsenberg commented Jan 22, 2026

Summary

This PR adds comprehensive multi-file support to the onsite package, enabling processing of multiple mzML/idXML file pairs with all three algorithms (AScore, PhosphoRS, LucXor) and global FLR calculation across all files.

New Features

  • onsite batch command - Process multiple file pairs in a single command
  • File validation - Comprehensive sanity checks for input/output files
  • Global FLR calculation - Cross-file false localization rate estimation using KDE
  • Summary reports - TSV reports with per-file and aggregate statistics
  • FLR curve plotting - Cumulative FLR curve visualization (optional matplotlib)
  • Cross-file FLR metadata - Inject global FLR values back into idXML files

Usage Guide

Basic Multi-File Processing

Process multiple file pairs with all three algorithms:

onsite batch \
  -in sample1.mzML -id sample1.idXML \
  -in sample2.mzML -id sample2.idXML \
  -in sample3.mzML -id sample3.idXML

This will:

  1. Validate all input files exist and are readable
  2. Generate output files as <input>_localized.idXML
  3. Run AScore, PhosphoRS, and LucXor on each file pair
  4. Merge results into a single output per file

Specifying Output Files

onsite batch \
  -in sample1.mzML -id sample1.idXML -out results/sample1_scored.idXML \
  -in sample2.mzML -id sample2.idXML -out results/sample2_scored.idXML

With Global FLR Calculation

Enable FLR estimation with decoys and generate a summary report:

onsite batch \
  -in sample1.mzML -id sample1.idXML \
  -in sample2.mzML -id sample2.idXML \
  -in sample3.mzML -id sample3.idXML \
  --add-decoys \
  --global-flr-report flr_summary.tsv \
  --global-flr-threshold 0.01

With FLR Curve Plot

Generate a cumulative FLR curve (requires matplotlib):

# Install matplotlib first
pip install matplotlib

onsite batch \
  -in sample1.mzML -id sample1.idXML \
  -in sample2.mzML -id sample2.idXML \
  --add-decoys \
  --global-flr-report flr_summary.tsv \
  --global-flr-plot flr_curve.png

Update idXML Files with Cross-File FLR

Inject the calculated cross-file global FLR values back into the output idXML files:

onsite batch \
  -in sample1.mzML -id sample1.idXML \
  -in sample2.mzML -id sample2.idXML \
  --add-decoys \
  --update-with-global-flr

This adds two new metadata fields to each peptide hit:

  • Luciphor_cross_file_global_flr
  • Luciphor_cross_file_local_flr

Full Example with All Options

onsite batch \
  -in file1.mzML -id file1.idXML -out output/file1.idXML \
  -in file2.mzML -id file2.idXML -out output/file2.idXML \
  -in file3.mzML -id file3.idXML -out output/file3.idXML \
  --fragment-mass-tolerance 0.05 \
  --fragment-mass-unit Da \
  --threads 4 \
  --add-decoys \
  --global-flr-report summary.tsv \
  --global-flr-threshold 0.01 \
  --global-flr-plot flr_curve.png \
  --update-with-global-flr \
  --debug

Command Line Options

Option Description
-in, --in-file PATH Input mzML file(s). Use multiple times for batch processing. Required
-id, --id-file PATH Input idXML file(s). Use multiple times for batch processing. Required
-out, --out-file PATH Output idXML file(s). If not specified, generates <input>_localized.idXML
--global-flr-report PATH Output path for global FLR summary report (TSV format)
--global-flr-threshold FLOAT Global FLR threshold for filtering (default: 0.01)
--global-flr-plot PATH Output path for cumulative FLR curve plot (PNG/PDF, requires matplotlib)
--update-with-global-flr Update output idXML files with cross-file global FLR metadata
--fragment-mass-tolerance FLOAT Fragment mass tolerance value (default: 0.05)
--fragment-mass-unit [Da|ppm] Tolerance unit (default: Da)
--threads INTEGER Number of parallel threads (default: 1)
--add-decoys Include A (PhosphoDecoy) as potential phosphorylation site for FLR estimation
--debug Enable debug output

Global FLR Report Format

The TSV report contains:

# Global FLR Summary Report
# Files processed: 3
# Total PSMs: 15000
# Total target PSMs: 14500
#
# Per-file Summary
file_path       total_psms  target_psms  decoy_psms  passing_0.001  passing_0.005  passing_0.010  passing_0.050  passing_0.100
sample1.idXML   5000        4800         200         3500           4000           4200           4500           4600
sample2.idXML   5000        4850         150         3600           4100           4300           4600           4700
sample3.idXML   5000        4850         150         3550           4050           4250           4550           4650
TOTAL           15000       14500        500         10650          12150          12750          13650          13950

# Threshold Analysis
threshold   total_passing   percentage
0.001       10650           73.4%
0.005       12150           83.8%
0.010       12750           87.9%
0.050       13650           94.1%
0.100       13950           96.2%

File Validation

The batch command performs comprehensive validation:

  1. File count matching - Equal number of mzML and idXML files
  2. File existence - All input files exist
  3. File readability - All input files are readable
  4. File extensions - Correct extensions (.mzML, .idXML)
  5. PyOpenMS validation - Files can be parsed by PyOpenMS
  6. Output path validation - Parent directories exist and are writable

Example error messages:

Error: Mismatched file counts: 3 mzML files vs 2 idXML files. Each mzML file must have a corresponding idXML file.

Error: File pair 2: mzML file not found: /path/to/missing.mzML

Error: File pair 1: Cannot read idXML file with PyOpenMS: /path/to/corrupted.idXML

Installation

The plotting feature requires matplotlib as an optional dependency:

# Install with plotting support
pip install pyonsite[plot]

# Or install matplotlib separately
pip install matplotlib

Files Changed

File Description
onsite/validation.py New - FileValidator class with sanity checks
onsite/global_flr.py New - GlobalFLRCalculator for cross-file FLR
onsite/onsitec.py Modified - Added batch command
pyproject.toml Modified - Added optional matplotlib dependency

Test Plan

  • Verify onsite batch --help shows all options correctly
  • Verify validation module imports successfully
  • Verify global_flr module imports successfully
  • Verify all existing tests pass
  • Test with real multi-file dataset
  • Test global FLR report generation
  • Test FLR curve plotting (with matplotlib)
  • Test cross-file FLR metadata injection

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added a batch processing command for running localization across multiple spectrum/ID file pairs.
    • Built cross-file global FLR analysis with cumulative FLR visualization, threshold reports, CSV/Excel export, and a summary report.
    • Input file validation for spectra and ID files before processing.
    • Experimental support to embed cross-file FLR metadata into output files.
  • Chores

    • Added plotting as an optional dependency (matplotlib).

✏️ Tip: You can customize this high-level summary in your review settings.

This adds support for processing multiple mzML/idXML file pairs in a single
command, with cross-file global FLR calculation capabilities.

New features:
- `onsite batch` command for multi-file processing
- File validation module with sanity checks
- Global FLR calculator for cross-file analysis
- TSV summary report generation
- Optional cumulative FLR curve plotting (requires matplotlib)
- Cross-file FLR metadata injection into idXML files

New files:
- onsite/validation.py: FileValidator class for input/output validation
- onsite/global_flr.py: GlobalFLRCalculator for cross-file FLR

Modified files:
- onsite/onsitec.py: Added batch command
- pyproject.toml: Added optional matplotlib dependency

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 22, 2026

📝 Walkthrough

Walkthrough

Adds cross-file global FLR computation, a batch CLI to run localization algorithms across mzML/idXML pairs, and a file validation utility. New plotting/export/reporting and idXML metadata update hooks are included; optional plotting dependency added to pyproject.toml.

Changes

Cohort / File(s) Summary
Global FLR Calculation
onsite/global_flr.py
New module providing PeptideRecord and FileSummary dataclasses and GlobalFLRCalculator for extracting peptide/PSM data from idXML, computing cumulative/global FLR (ordering, monotonicity, smoothing), exporting CSV/Excel, plotting (matplotlib optional), generating TSV summary reports, threshold queries, and a placeholder for writing global FLR metadata back to idXML files.
Batch CLI Command
onsite/onsitec.py
Adds batch() CLI entry to process multiple -in/-id pairs: validates pairs, runs AScore/PhosphoRS/LucXor per pair in temporary workspaces, merges per-file results, and optionally computes global FLR (reports, exports, plots, idXML updates). Registers command on CLI group.
File Validation
onsite/validation.py
New FileValidator and FileValidationError to validate input mzML/idXML pairs (existence, extension, readability, PyOpenMS header/load checks) and to validate or synthesize output idXML paths with structured errors/warnings.
Dependency Configuration
pyproject.toml
Adds optional dependency group plot = ["matplotlib>=3.5.0"] to enable plotting features.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant CLI as Batch CLI
    participant Validator as FileValidator
    participant AlgoRunner as AScore/PhosphoRS/LucXor
    participant GlobalFLR as GlobalFLRCalculator
    participant idXML as idXML Files

    User->>CLI: invoke batch(mzML/idXML pairs, options)
    CLI->>Validator: validate_file_pairs(in_files, id_files)
    Validator-->>CLI: validated pairs
    CLI->>Validator: validate_output_paths(out_files, in_files)
    Validator-->>CLI: resolved outputs

    loop per pair
        CLI->>AlgoRunner: run localization algorithms (temp workspace)
        AlgoRunner-->>CLI: per-file localized idXML
        CLI->>idXML: merge per-file results into output
    end

    opt global_flr enabled
        CLI->>GlobalFLR: init (min_sites, options)
        loop per idXML
            CLI->>GlobalFLR: extract_from_idxml(idxml_path)
            GlobalFLR-->>CLI: peptide/PSM records
        end
        CLI->>GlobalFLR: calculate_global_flr()
        GlobalFLR-->>CLI: cumulative FLR data
        CLI->>GlobalFLR: generate_summary_report(), export_flr_data(), plot_cumulative_flr_curve()
        GlobalFLR-->>CLI: report/plot/files
        CLI->>GlobalFLR: update_idxml_files(idxml_paths)
        GlobalFLR-->>idXML: write global FLR metadata
    end

    CLI-->>User: processing complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Suggested reviewers

  • ypriverol

Poem

🐰 I hopped through idXML rows with care,
Counting sites from here and there,
Batch by batch the files aligned,
Global FLR stats now combined,
A carrot-sized plot to celebrate the share! 🥕📈

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding a new batch processing capability with global FLR calculation across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@onsite/global_flr.py`:
- Around line 464-491: The code builds file_records mapping psm_records by float
delta_score which is fragile and can drop duplicates or fail due to float
precision; instead change matching to use a stable composite key (for example
spectrum reference and hit rank or peptide sequence + charge) rather than
delta_score alone: when iterating pep_id and its hit (pep_id.getHits()[0])
compute the same composite key from the hit (e.g.,
hit.getMetaValue("spectrum_reference") and hit.getMetaValue("rank") or use
pep_id properties) and use that to look up the corresponding record in
psm_records (do not rely on float equality), or alternatively iterate
psm_records for matches by comparing spectrum reference/hit id to reliably
assign record to variable record. Ensure references to psm_records,
file_records, pep_id, getHits, Luciphor_delta_score and idxml_path are updated
to use the composite key or direct matching logic.
- Around line 370-383: The generate_summary_report method currently ignores the
threshold parameter; update it so the supplied threshold is included as the
"primary" FLR to report instead of being unused: inside generate_summary_report
(look for the thresholds variable) incorporate the threshold argument by adding
it to the thresholds list (ensuring no duplicates and preserving desired order,
e.g., put threshold first or merge and sort) or validate/remove it from the
signature if you prefer not to support customization; adjust any downstream code
that relies on the hardcoded thresholds accordingly.
🧹 Nitpick comments (9)
onsite/validation.py (4)

22-24: Consider using frozenset for immutable constants.

The extension sets are intended to be constants but are mutable. Using frozenset makes the immutability explicit and avoids the Ruff warning about mutable class attributes.

♻️ Suggested fix
-    # Supported file extensions
-    SPECTRUM_EXTENSIONS = {'.mzml'}
-    ID_EXTENSIONS = {'.idxml'}
+    # Supported file extensions (immutable)
+    SPECTRUM_EXTENSIONS: frozenset[str] = frozenset({'.mzml'})
+    ID_EXTENSIONS: frozenset[str] = frozenset({'.idxml'})

66-66: Consider adding strict=True to zip().

While the length check above ensures equal counts, using strict=True adds a defensive layer and is more explicit about the expected behavior.

♻️ Suggested fix
-        for i, (mzml_path, idxml_path) in enumerate(zip(in_files, id_files), 1):
+        for i, (mzml_path, idxml_path) in enumerate(zip(in_files, id_files, strict=True), 1):

119-123: Chain the exception for better debugging context.

Using raise ... from e preserves the original traceback, which helps diagnose PyOpenMS parsing failures.

♻️ Suggested fix
         except Exception as e:
             raise FileValidationError(
                 f"File pair {pair_index}: Cannot read mzML file with PyOpenMS: "
-                f"{file_path}. Error: {str(e)}"
-            )
+                f"{file_path}. Error: {e!s}"
+            ) from e

167-171: Chain the exception for better debugging context.

Same as above: use raise ... from e to preserve the original traceback.

♻️ Suggested fix
         except Exception as e:
             raise FileValidationError(
                 f"File pair {pair_index}: Cannot read idXML file with PyOpenMS: "
-                f"{file_path}. Error: {str(e)}"
-            )
+                f"{file_path}. Error: {e!s}"
+            ) from e
onsite/global_flr.py (3)

174-178: Remove extraneous f prefix from strings without placeholders.

Lines 175-176 use f-strings but contain no placeholders.

♻️ Suggested fix
             logger.warning(
-                f"Insufficient PSMs for global FLR calculation "
-                f"(need at least 2 targets and 2 decoys)"
+                "Insufficient PSMs for global FLR calculation "
+                "(need at least 2 targets and 2 decoys)"
             )

313-319: Chain the original ImportError for better diagnostics.

When re-raising, preserve the original exception context.

♻️ Suggested fix
         try:
             import matplotlib.pyplot as plt
-        except ImportError:
+        except ImportError as err:
             raise ImportError(
                 "matplotlib is required for plotting. "
                 "Install it with: pip install matplotlib"
-            )
+            ) from err

345-345: Prefix unused variable with underscore.

The fig variable from plt.subplots() is unused; prefix it with _ to indicate this is intentional.

♻️ Suggested fix
-        fig, ax = plt.subplots(figsize=(10, 6))
+        _fig, ax = plt.subplots(figsize=(10, 6))
onsite/onsitec.py (2)

570-576: Clarify that --threads applies per-algorithm, not for parallel file processing.

The --threads option is passed to individual algorithms but files are processed sequentially. Users might expect parallel file processing. Consider updating the help text to clarify this behavior.

♻️ Suggested help text update
 `@click.option`(
     "--threads",
     "threads",
     type=int,
     default=1,
-    help="Number of parallel threads (default: 1).",
+    help="Number of threads for each algorithm (default: 1). Files are processed sequentially.",
 )

656-657: Consider adding strict=True to zip() for defensive coding.

Although the lengths are validated above, using strict=True makes the invariant explicit.

♻️ Suggested fix
-        for i, ((mzml_path, idxml_path), out_path) in enumerate(
-            zip(file_pairs, output_paths), 1
+        for i, ((mzml_path, idxml_path), out_path) in enumerate(
+            zip(file_pairs, output_paths, strict=True), 1
         ):

Comment on lines 370 to 383
def generate_summary_report(
self,
output_path: str,
threshold: float = 0.01
) -> None:
"""
Generate a TSV summary report with per-file and total counts.
Args:
output_path: Path to save the report (TSV)
threshold: Primary FLR threshold to report (default: 0.01)
"""
# Calculate counts at multiple thresholds
thresholds = [0.001, 0.005, 0.01, 0.05, 0.1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

The threshold parameter is unused.

The threshold argument is documented as "Primary FLR threshold to report" but is never used in the method body. The method hardcodes thresholds = [0.001, 0.005, 0.01, 0.05, 0.1] instead. Consider either using the parameter to customize the report or removing it from the signature.

🔧 Possible fix: include the user's threshold in the report
     def generate_summary_report(
         self,
         output_path: str,
         threshold: float = 0.01
     ) -> None:
         ...
         # Calculate counts at multiple thresholds
-        thresholds = [0.001, 0.005, 0.01, 0.05, 0.1]
+        thresholds = sorted(set([0.001, 0.005, 0.01, 0.05, 0.1, threshold]))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def generate_summary_report(
self,
output_path: str,
threshold: float = 0.01
) -> None:
"""
Generate a TSV summary report with per-file and total counts.
Args:
output_path: Path to save the report (TSV)
threshold: Primary FLR threshold to report (default: 0.01)
"""
# Calculate counts at multiple thresholds
thresholds = [0.001, 0.005, 0.01, 0.05, 0.1]
def generate_summary_report(
self,
output_path: str,
threshold: float = 0.01
) -> None:
"""
Generate a TSV summary report with per-file and total counts.
Args:
output_path: Path to save the report (TSV)
threshold: Primary FLR threshold to report (default: 0.01)
"""
# Calculate counts at multiple thresholds
thresholds = sorted(set([0.001, 0.005, 0.01, 0.05, 0.1, threshold]))
🧰 Tools
🪛 Ruff (0.14.13)

373-373: Unused method argument: threshold

(ARG002)

🤖 Prompt for AI Agents
In `@onsite/global_flr.py` around lines 370 - 383, The generate_summary_report
method currently ignores the threshold parameter; update it so the supplied
threshold is included as the "primary" FLR to report instead of being unused:
inside generate_summary_report (look for the thresholds variable) incorporate
the threshold argument by adding it to the thresholds list (ensuring no
duplicates and preserving desired order, e.g., put threshold first or merge and
sort) or validate/remove it from the signature if you prefer not to support
customization; adjust any downstream code that relies on the hardcoded
thresholds accordingly.

@github-actions
Copy link

github-actions bot commented Jan 22, 2026

Algorithm Comparison Test Results

Click to expand test results
============================= test session starts ==============================
platform linux -- Python 3.11.14, pytest-9.0.2, pluggy-1.6.0 -- /opt/hostedtoolcache/Python/3.11.14/x64/bin/python
cachedir: .pytest_cache
rootdir: /home/runner/work/onsite/onsite
configfile: pyproject.toml
plugins: cov-7.0.0
collecting ... collected 3 items

tests/test_algorithm_comparison.py::TestAlgorithmComparison::test_lucxor_comparison 
================================================================================
LucXor Comparison Results (q-value < 0.01)
================================================================================

STRICT (Local FLR < 0.01):
  New results: 883
  Reference results: 848
  Overlap: 787 (92.8%)
  Recall: 92.8% (new found 787/848 reference sites)
  Gain rate: 10.9% (96 new-only sites)
  Lost sites: 61
  Count ratio: 1.04x

MODERATE (Local FLR < 0.05):
  New results: 1093
  Reference results: 1075
  Overlap: 1026 (95.4%)
  Recall: 95.4% (new found 1026/1075 reference sites)
  Gain rate: 6.1% (67 new-only sites)
  Lost sites: 49
  Count ratio: 1.02x

LENIENT (Local FLR < 0.1):
  New results: 1108
  Reference results: 1089
  Overlap: 1042 (95.7%)
  Recall: 95.7% (new found 1042/1089 reference sites)
  Gain rate: 6.0% (66 new-only sites)
  Lost sites: 47
  Count ratio: 1.02x
PASSED
tests/test_algorithm_comparison.py::TestAlgorithmComparison::test_ascore_comparison 
================================================================================
AScore Comparison Results (q-value < 0.01)
================================================================================

STRICT (AScore >= 20):
  New results: 914
  Reference results: 914
  Overlap: 885 (96.8%)
  Recall: 96.8% (new found 885/914 reference sites)
  Gain rate: 3.2% (29 new-only sites)
  Lost sites: 29
  Count ratio: 1.00x

MODERATE (AScore >= 15):
  New results: 1015
  Reference results: 1015
  Overlap: 986 (97.1%)
  Recall: 97.1% (new found 986/1015 reference sites)
  Gain rate: 2.9% (29 new-only sites)
  Lost sites: 29
  Count ratio: 1.00x

LENIENT (AScore >= 3):
  New results: 1067
  Reference results: 1067
  Overlap: 1038 (97.3%)
  Recall: 97.3% (new found 1038/1067 reference sites)
  Gain rate: 2.7% (29 new-only sites)
  Lost sites: 29
  Count ratio: 1.00x
PASSED
tests/test_algorithm_comparison.py::TestAlgorithmComparison::test_phosphors_comparison 
================================================================================
PhosphoRS Comparison Results (q-value < 0.01)
================================================================================

STRICT (Site probability > 99%):
  New results: 1241
  Reference results: 1234
  Overlap: 1056 (85.6%)
  Recall: 85.6% (new found 1056/1234 reference sites)
  Gain rate: 14.9% (185 new-only sites)
  Lost sites: 178
  Count ratio: 1.01x

MODERATE (Site probability > 90%):
  New results: 1241
  Reference results: 1234
  Overlap: 1056 (85.6%)
  Recall: 85.6% (new found 1056/1234 reference sites)
  Gain rate: 14.9% (185 new-only sites)
  Lost sites: 178
  Count ratio: 1.01x

LENIENT (Site probability > 75%):
  New results: 1241
  Reference results: 1234
  Overlap: 1056 (85.6%)
  Recall: 85.6% (new found 1056/1234 reference sites)
  Gain rate: 14.9% (185 new-only sites)
  Lost sites: 178
  Count ratio: 1.01x
PASSED

============================== 3 passed in 40.65s ==============================

Key changes:
- Use direct (Phospho)/(PhosphoDecoy) counting from sequences
- Process peptides in order: 1-site -> 2-site -> 3-site
- Add unambiguous site detection (exclude peptides where all STYA sites are modified)
- Add sophisticated smoothing algorithms (early surge, Gaussian filter)
- Add Nature journal styling for publication-quality plots
- Export both PNG and PDF plot formats
- Export CSV and Excel FLR data files

FLR calculation: PhosphoDecoy / (Phospho + PhosphoDecoy)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@onsite/global_flr.py`:
- Around line 680-704: The update_idxml_files method currently logs a warning
and silently does nothing when sequence tracking is missing; instead fail fast
by raising a clear error (e.g., NotImplementedError or RuntimeError) from
update_idxml_files so callers immediately know FLR injection is unsupported
until sequence tracking is added. In the update_idxml_files implementation
(referencing update_idxml_files, cumulative_data, and
calculate_cumulative_flr_ordered) check for the missing sequence-tracking state
after calling calculate_cumulative_flr_ordered and raise with a descriptive
message instructing to implement sequence tracking or use the summary report;
this makes the missing implementation explicit rather than a no-op.
- Around line 185-220: The function calculate_cumulative_flr_ordered currently
never applies the configured min_sites and silently skips peptides with
total_sites > 3 during grouping; before building peptides_phospho /
peptides_phosphodecoy / peptides_mixed, filter records using self.min_sites
(e.g., records = [r for r in records if r.total_sites >= self.min_sites]) and
separately detect/log/track how many records were excluded because total_sites <
min_sites or total_sites > 3 so reporting/cumulative counts remain consistent;
update the logger message to include these exclusion counts and ensure the
grouping loop only sees the already-filtered records.

In `@onsite/onsitec.py`:
- Around line 538-555: The CLI help text is misleading: option
global_flr_threshold is only used for reporting (not for filtering/updating) and
update_with_global_flr is not implemented; update the click.option help strings
for "global_flr_threshold", "global_flr_plot", and "update_with_global_flr" to
accurately describe behavior (e.g., global_flr_threshold: "Threshold used for
reporting only — does not filter results"; global_flr_plot: "Output path for
cumulative FLR curve plot (reporting only, requires matplotlib)";
update_with_global_flr: either remove the flag or change its help to "Not
implemented" and add a runtime check in the code that raises a clear error or
warning when update_with_global_flr is passed). Reference the click.option
declarations for global_flr_threshold, global_flr_plot, and
update_with_global_flr when making these changes.
- Around line 784-790: The code builds flr_data_path with
global_flr_report.replace('.tsv', '_data.csv') which fails when the report
filename doesn't end with '.tsv' and can overwrite the summary; change the logic
around global_flr_report and flr_data_path (the block that calls
global_flr_calc.generate_summary_report and global_flr_calc.export_flr_data) to
derive the CSV name using a safe split-extension approach (e.g.,
os.path.splitext or pathlib.Path.stem/suffix) so you always append '_data.csv'
to the base name instead of relying on replace; update references to
flr_data_path accordingly and ensure imports (os or pathlib) are present.
♻️ Duplicate comments (1)
onsite/global_flr.py (1)

582-595: threshold parameter is still unused.

The report ignores the provided threshold and always uses hardcoded values.

🔧 Include the provided threshold in the report
-        thresholds = [0.001, 0.005, 0.01, 0.05, 0.1]
+        thresholds = sorted(set([threshold, 0.001, 0.005, 0.01, 0.05, 0.1]))

Comment on lines +185 to +220
def calculate_cumulative_flr_ordered(self, exclude_unambiguous: bool = True) -> Tuple[List, List, List, List]:
"""
Calculate cumulative FLR ordered by phosphorylation site count.
Processing order:
1. All 1-site Phospho peptides
2. All 2-site Phospho peptides
3. All 3-site Phospho peptides
4. Mixed peptides (both Phospho and PhosphoDecoy)
5. All 1-site PhosphoDecoy peptides
6. All 2-site PhosphoDecoy peptides
7. All 3-site PhosphoDecoy peptides
Args:
exclude_unambiguous: Whether to exclude unambiguous peptides
Returns:
Tuple of (site_counts, cumulative_flr, cumulative_phospho, cumulative_phosphodecoy)
"""
logger.info("Calculating cumulative FLR ordered by phosphorylation site count...")

# Filter records
records = self.peptide_records
if exclude_unambiguous:
records = [r for r in records if not r.is_unambiguous]
logger.info(f"Excluded {len(self.peptide_records) - len(records)} unambiguous peptides")

# Group peptides by type and site count
peptides_phospho = {1: [], 2: [], 3: []} # Pure Phospho peptides
peptides_phosphodecoy = {1: [], 2: [], 3: []} # Pure PhosphoDecoy peptides
peptides_mixed = [] # Peptides with both Phospho and PhosphoDecoy

for record in records:
if record.total_sites not in [1, 2, 3]:
continue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Apply min_sites (and handle >3-site peptides) before grouping.

min_sites is documented but never applied, and peptides with >3 sites are silently skipped later. That can skew totals vs. cumulative FLR outputs. Please filter upfront (and log/track excluded counts) so reporting is consistent.

🔧 Proposed fix (apply min_sites + explicit filtering)
-        records = self.peptide_records
-        if exclude_unambiguous:
-            records = [r for r in records if not r.is_unambiguous]
+        records = self.peptide_records
+        if self.min_sites > 0:
+            records = [r for r in records if r.total_sites >= self.min_sites]
+        if exclude_unambiguous:
+            records = [r for r in records if not r.is_unambiguous]
+
+        unsupported = [r for r in records if r.total_sites not in (1, 2, 3)]
+        if unsupported:
+            logger.warning("Excluded %d peptides with >3 sites from global FLR", len(unsupported))
+        records = [r for r in records if r.total_sites in (1, 2, 3)]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def calculate_cumulative_flr_ordered(self, exclude_unambiguous: bool = True) -> Tuple[List, List, List, List]:
"""
Calculate cumulative FLR ordered by phosphorylation site count.
Processing order:
1. All 1-site Phospho peptides
2. All 2-site Phospho peptides
3. All 3-site Phospho peptides
4. Mixed peptides (both Phospho and PhosphoDecoy)
5. All 1-site PhosphoDecoy peptides
6. All 2-site PhosphoDecoy peptides
7. All 3-site PhosphoDecoy peptides
Args:
exclude_unambiguous: Whether to exclude unambiguous peptides
Returns:
Tuple of (site_counts, cumulative_flr, cumulative_phospho, cumulative_phosphodecoy)
"""
logger.info("Calculating cumulative FLR ordered by phosphorylation site count...")
# Filter records
records = self.peptide_records
if exclude_unambiguous:
records = [r for r in records if not r.is_unambiguous]
logger.info(f"Excluded {len(self.peptide_records) - len(records)} unambiguous peptides")
# Group peptides by type and site count
peptides_phospho = {1: [], 2: [], 3: []} # Pure Phospho peptides
peptides_phosphodecoy = {1: [], 2: [], 3: []} # Pure PhosphoDecoy peptides
peptides_mixed = [] # Peptides with both Phospho and PhosphoDecoy
for record in records:
if record.total_sites not in [1, 2, 3]:
continue
def calculate_cumulative_flr_ordered(self, exclude_unambiguous: bool = True) -> Tuple[List, List, List, List]:
"""
Calculate cumulative FLR ordered by phosphorylation site count.
Processing order:
1. All 1-site Phospho peptides
2. All 2-site Phospho peptides
3. All 3-site Phospho peptides
4. Mixed peptides (both Phospho and PhosphoDecoy)
5. All 1-site PhosphoDecoy peptides
6. All 2-site PhosphoDecoy peptides
7. All 3-site PhosphoDecoy peptides
Args:
exclude_unambiguous: Whether to exclude unambiguous peptides
Returns:
Tuple of (site_counts, cumulative_flr, cumulative_phospho, cumulative_phosphodecoy)
"""
logger.info("Calculating cumulative FLR ordered by phosphorylation site count...")
# Filter records
records = self.peptide_records
if self.min_sites > 0:
records = [r for r in records if r.total_sites >= self.min_sites]
if exclude_unambiguous:
records = [r for r in records if not r.is_unambiguous]
logger.info(f"Excluded {len(self.peptide_records) - len(records)} unambiguous peptides")
unsupported = [r for r in records if r.total_sites not in (1, 2, 3)]
if unsupported:
logger.warning("Excluded %d peptides with >3 sites from global FLR", len(unsupported))
records = [r for r in records if r.total_sites in (1, 2, 3)]
# Group peptides by type and site count
peptides_phospho = {1: [], 2: [], 3: []} # Pure Phospho peptides
peptides_phosphodecoy = {1: [], 2: [], 3: []} # Pure PhosphoDecoy peptides
peptides_mixed = [] # Peptides with both Phospho and PhosphoDecoy
for record in records:
if record.total_sites not in [1, 2, 3]:
continue
🤖 Prompt for AI Agents
In `@onsite/global_flr.py` around lines 185 - 220, The function
calculate_cumulative_flr_ordered currently never applies the configured
min_sites and silently skips peptides with total_sites > 3 during grouping;
before building peptides_phospho / peptides_phosphodecoy / peptides_mixed,
filter records using self.min_sites (e.g., records = [r for r in records if
r.total_sites >= self.min_sites]) and separately detect/log/track how many
records were excluded because total_sites < min_sites or total_sites > 3 so
reporting/cumulative counts remain consistent; update the logger message to
include these exclusion counts and ensure the grouping loop only sees the
already-filtered records.

Comment on lines +680 to +704
def update_idxml_files(self, idxml_paths: List[str]) -> None:
"""
Update idXML files with cross-file global FLR metadata.
Note: This calculates FLR based on the cumulative position of each peptide
in the ordered processing.
Args:
idxml_paths: List of idXML file paths to update
"""
from pyopenms import IdXMLFile, PeptideIdentificationList

if not self.cumulative_data:
self.calculate_cumulative_flr_ordered()

# Build a lookup from sequence to FLR
# Use the final FLR value for each unique sequence
seq_to_flr = {}
for data in self.cumulative_data:
# We don't have sequence here, so we'll use a different approach
pass

logger.warning("update_idxml_files: Cross-file FLR injection requires sequence tracking. "
"Consider using the summary report instead.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid a silent no‑op in update_idxml_files.

The method advertises updates but only warns and returns. Consider failing fast until sequence tracking is implemented.

🔧 Make the missing implementation explicit
-        if not self.cumulative_data:
-            self.calculate_cumulative_flr_ordered()
-
-        # Build a lookup from sequence to FLR
-        # Use the final FLR value for each unique sequence
-        seq_to_flr = {}
-        for data in self.cumulative_data:
-            # We don't have sequence here, so we'll use a different approach
-            pass
-
-        logger.warning("update_idxml_files: Cross-file FLR injection requires sequence tracking. "
-                      "Consider using the summary report instead.")
+        raise NotImplementedError(
+            "update_idxml_files is not implemented yet. "
+            "Use generate_summary_report for cross-file FLR analysis for now."
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def update_idxml_files(self, idxml_paths: List[str]) -> None:
"""
Update idXML files with cross-file global FLR metadata.
Note: This calculates FLR based on the cumulative position of each peptide
in the ordered processing.
Args:
idxml_paths: List of idXML file paths to update
"""
from pyopenms import IdXMLFile, PeptideIdentificationList
if not self.cumulative_data:
self.calculate_cumulative_flr_ordered()
# Build a lookup from sequence to FLR
# Use the final FLR value for each unique sequence
seq_to_flr = {}
for data in self.cumulative_data:
# We don't have sequence here, so we'll use a different approach
pass
logger.warning("update_idxml_files: Cross-file FLR injection requires sequence tracking. "
"Consider using the summary report instead.")
def update_idxml_files(self, idxml_paths: List[str]) -> None:
"""
Update idXML files with cross-file global FLR metadata.
Note: This calculates FLR based on the cumulative position of each peptide
in the ordered processing.
Args:
idxml_paths: List of idXML file paths to update
"""
from pyopenms import IdXMLFile, PeptideIdentificationList
raise NotImplementedError(
"update_idxml_files is not implemented yet. "
"Use generate_summary_report for cross-file FLR analysis for now."
)
🧰 Tools
🪛 Ruff (0.14.13)

680-680: Unused method argument: idxml_paths

(ARG002)


697-697: Local variable seq_to_flr is assigned to but never used

Remove assignment to unused variable seq_to_flr

(F841)


698-698: Loop control variable data not used within loop body

Rename unused data to _data

(B007)

🤖 Prompt for AI Agents
In `@onsite/global_flr.py` around lines 680 - 704, The update_idxml_files method
currently logs a warning and silently does nothing when sequence tracking is
missing; instead fail fast by raising a clear error (e.g., NotImplementedError
or RuntimeError) from update_idxml_files so callers immediately know FLR
injection is unsupported until sequence tracking is added. In the
update_idxml_files implementation (referencing update_idxml_files,
cumulative_data, and calculate_cumulative_flr_ordered) check for the missing
sequence-tracking state after calling calculate_cumulative_flr_ordered and raise
with a descriptive message instructing to implement sequence tracking or use the
summary report; this makes the missing implementation explicit rather than a
no-op.

Comment on lines +538 to +555
"--global-flr-threshold",
"global_flr_threshold",
type=float,
default=0.01,
help="Global FLR threshold for filtering (default: 0.01).",
)
@click.option(
"--global-flr-plot",
"global_flr_plot",
type=click.Path(),
help="Output path for cumulative FLR curve plot (requires matplotlib).",
)
@click.option(
"--update-with-global-flr",
"update_with_global_flr",
is_flag=True,
help="Update idXML files with cross-file global FLR metadata.",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Align CLI help text with behavior.

--global-flr-threshold is used for reporting only, and --update-with-global-flr is not implemented. The help text currently implies filtering/updates.

✏️ Suggested help text clarification
-    help="Global FLR threshold for filtering (default: 0.01).",
+    help="Global FLR threshold for reporting (default: 0.01).",
@@
-    help="Update idXML files with cross-file global FLR metadata.",
+    help="(Not yet supported) Intended to update idXML files with cross-file global FLR metadata.",
🤖 Prompt for AI Agents
In `@onsite/onsitec.py` around lines 538 - 555, The CLI help text is misleading:
option global_flr_threshold is only used for reporting (not for
filtering/updating) and update_with_global_flr is not implemented; update the
click.option help strings for "global_flr_threshold", "global_flr_plot", and
"update_with_global_flr" to accurately describe behavior (e.g.,
global_flr_threshold: "Threshold used for reporting only — does not filter
results"; global_flr_plot: "Output path for cumulative FLR curve plot (reporting
only, requires matplotlib)"; update_with_global_flr: either remove the flag or
change its help to "Not implemented" and add a runtime check in the code that
raises a clear error or warning when update_with_global_flr is passed).
Reference the click.option declarations for global_flr_threshold,
global_flr_plot, and update_with_global_flr when making these changes.

Comment on lines +784 to +790
if global_flr_report:
click.echo(f"\n Generating FLR report: {global_flr_report}")
global_flr_calc.generate_summary_report(global_flr_report, global_flr_threshold)

# Also export raw FLR data
flr_data_path = global_flr_report.replace('.tsv', '_data.csv')
global_flr_calc.export_flr_data(flr_data_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid overwriting the summary when the report name isn’t .tsv.

replace('.tsv', ...) leaves the path unchanged if the extension differs, so the CSV export can overwrite the summary report.

🔧 Use splitext to build a safe data filename
-                flr_data_path = global_flr_report.replace('.tsv', '_data.csv')
+                report_root, _ = os.path.splitext(global_flr_report)
+                flr_data_path = f"{report_root}_data.csv"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if global_flr_report:
click.echo(f"\n Generating FLR report: {global_flr_report}")
global_flr_calc.generate_summary_report(global_flr_report, global_flr_threshold)
# Also export raw FLR data
flr_data_path = global_flr_report.replace('.tsv', '_data.csv')
global_flr_calc.export_flr_data(flr_data_path)
if global_flr_report:
click.echo(f"\n Generating FLR report: {global_flr_report}")
global_flr_calc.generate_summary_report(global_flr_report, global_flr_threshold)
# Also export raw FLR data
report_root, _ = os.path.splitext(global_flr_report)
flr_data_path = f"{report_root}_data.csv"
global_flr_calc.export_flr_data(flr_data_path)
🤖 Prompt for AI Agents
In `@onsite/onsitec.py` around lines 784 - 790, The code builds flr_data_path with
global_flr_report.replace('.tsv', '_data.csv') which fails when the report
filename doesn't end with '.tsv' and can overwrite the summary; change the logic
around global_flr_report and flr_data_path (the block that calls
global_flr_calc.generate_summary_report and global_flr_calc.export_flr_data) to
derive the CSV name using a safe split-extension approach (e.g.,
os.path.splitext or pathlib.Path.stem/suffix) so you always append '_data.csv'
to the base name instead of relying on replace; update references to
flr_data_path accordingly and ensure imports (os or pathlib) are present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants