-
Notifications
You must be signed in to change notification settings - Fork 56
Description
Hi Niels
I was wondering why are the output mgf files so big compared to the raw files. If I convert the raw file from #4 , which is 2.8GB big, I get an mgf output file with 13GB. I have checked that most of the rows correspond to peaks with intensity equal to 0.0000000000, and removing them with grep
grep -v 0.0000000000 PD7505-GDTHP1-A_C2.mgf > PD7505-GDTHP1-A_C2_non_zero.mgf
leaves a 3.2 GB file.
There are still many peaks with very low intensities like 0.0000000277 that I am guessing really are noise and thus make the file unnecessarily big. I tried running msconvert on it
msconvert PD7505-GDTHP1-A_C2_non_zero.mgf --filter "peakPicking true [2,3] zeroSamples removeExtra" --mgf -o msconvert_output
but the file size remains the same.
If I further trim the mgf file to keep intensities > 0.1 with awk by printing lines starting with a capital letter (to keep spectra headers) or where the second field is > 0.1, I get a 1.7 GB file.
awk '{ if ($2 > 0.1 || $1 ~ /^[A-Z]/) {print} }' msconvert_output/PD7505-GDTHP1-A_C2_non_zero.mgf > trimmed.mgf
While I achieved a lot of file size reduction, this is still much bigger than what one gets by running the msconvertGUI on Windows with the same raw file. If I run the program with filters:
| Filter | Parameter |
|---|---|
| peakPicking | vendor msLevel=1- |
| zeroSamples | removeExtra 1- |
I get a 297 MB file.
From ThermoRawFileParser's README.md
It takes a thermo RAW file as input and outputs a metadata file and the MS2 spectra (centroided) in MGF format.
Could you please provide more information about how the tool performs the centroiding of the spectra and how to exactly emulate the output one would get by running msconvert on Windows? Ideally, mgf files with sizes similar to those produced by standard msconvert calls should be easy to produce, either using ThermoRawFileParser alone or in combination with msconvert filtering.
Thank you very much for your help!
Cheers
Antonio