Skip to content

Reducing mgf file size #5

@antortjim

Description

@antortjim

Hi Niels

I was wondering why are the output mgf files so big compared to the raw files. If I convert the raw file from #4 , which is 2.8GB big, I get an mgf output file with 13GB. I have checked that most of the rows correspond to peaks with intensity equal to 0.0000000000, and removing them with grep
grep -v 0.0000000000 PD7505-GDTHP1-A_C2.mgf > PD7505-GDTHP1-A_C2_non_zero.mgf
leaves a 3.2 GB file.
There are still many peaks with very low intensities like 0.0000000277 that I am guessing really are noise and thus make the file unnecessarily big. I tried running msconvert on it

msconvert PD7505-GDTHP1-A_C2_non_zero.mgf --filter "peakPicking true [2,3] zeroSamples removeExtra" --mgf -o msconvert_output

but the file size remains the same.

If I further trim the mgf file to keep intensities > 0.1 with awk by printing lines starting with a capital letter (to keep spectra headers) or where the second field is > 0.1, I get a 1.7 GB file.

awk '{ if ($2 > 0.1 || $1 ~ /^[A-Z]/) {print} }' msconvert_output/PD7505-GDTHP1-A_C2_non_zero.mgf > trimmed.mgf

While I achieved a lot of file size reduction, this is still much bigger than what one gets by running the msconvertGUI on Windows with the same raw file. If I run the program with filters:

Filter Parameter
peakPicking vendor msLevel=1-
zeroSamples removeExtra 1-

I get a 297 MB file.

From ThermoRawFileParser's README.md

It takes a thermo RAW file as input and outputs a metadata file and the MS2 spectra (centroided) in MGF format.

Could you please provide more information about how the tool performs the centroiding of the spectra and how to exactly emulate the output one would get by running msconvert on Windows? Ideally, mgf files with sizes similar to those produced by standard msconvert calls should be easy to produce, either using ThermoRawFileParser alone or in combination with msconvert filtering.

Thank you very much for your help!

Cheers

Antonio

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions