Reduce memory usage for very large files (high page count and large file size) #115

jbarlow83 · 2016-12-09T22:27:16Z

ocrmypdf uses excessive memory for files for very high page counts (hundreds), enough it might consume all available on temporary storage on small devices, e.g. 100 MB PDF produces >2GB of intermediates.

In the maximal case we need O(number of pages * number of intermediates per page). Currently we get major savings because some intermediates are just soft links to other intermediates.

Opportunities for savings:

~~PDF page split is breadth first, producing one file per page before some files are needed. It would be reasonable to have a group of 25 splitter.~~ Implemented in v7.
Intermediate temporary files could be deleted after every dependency interested in them has been consumed. Each object produced by the pipeline could have a reference count. Whenever a task finishes the reference count of each input is decreased. When a task is finished we ask how many input tasks want to see each output file and set the reference count accordingly.
Prioritize depth over breadth when a worker process is free to select a new task, if ruffus or its replacement doesn't already do this. Depth first topo ordering might get this for free.

jbarlow83 · 2017-03-01T20:36:55Z

Reference counting may be sufficient is probably the easiest way to get better performance here without complications.

One option would be to hardlink the input files for every time a task is run and delete them when it is done. That delegates refcounting to file system. Could fallback to cp if ln does not work.

/tmp/ocrmypdf-somefile/
    triage/
         origin
         origin.pdf
    repair/
         input.pdf  # hardlinked to origin.pdf
         input.repaired.pdf

jbarlow83 · 2018-03-23T18:56:01Z

Ruffus is not capable of a depth-first/greedy exploration of the pipeline DAG.

jbarlow83 · 2018-06-23T09:14:33Z

Improved for v7

jbarlow83 · 2019-06-04T09:06:23Z

Improved again for v8.4/v9

installgentoo · 2023-05-31T15:20:56Z

Why does Ocrmypdf create all the pdf files in one batch? And then it doesn't clean up its working directory until it's done.

I would expect it to extract pages as they are consumed by threads, and to delete everything besides the finalized page *.pdf as the page finishes processing.

As a result of this, if you have /tmp mounted on ram, you absolutely can OOM because of too many temporary files.

bertholdm · 2023-05-31T17:08:09Z

That would not be good for me. Since I'm using OCRmyPDF in a (at the moment private) Plugin for Calibre with a proofreading step after the OCR process on scanned PDF documents, I depend on the existence of all temporary hocr files until I delete them by myself after proofreading.

jbarlow83 · 2023-06-01T07:04:38Z

@installgentoo
The working directory is definitely cleaned up at the end of processing. However, some intermediate resources are not deleted as soon as they could be. This hasn't been implemented because it mainly people with low temporary storage who are processing a lot of files, in most cases adding more storage is a good-enough workaround. As @bertholdm mentions, the introduction of plugins has made a lot of changes more complex.

Andres-Chandia · 2025-04-26T13:47:47Z

I had to solve this in the following way
cd "device with more space" → mkdir tmp
sudo su
cd /
mv tmp origtmp
ln -s /root/to/"device with more space"/tmp
when process is finished
cd /
rm tmp
mv origtmp tmp
cd "device with more space"
rm -rf tmp
I hope it helps!!

jbarlow83 mentioned this issue Aug 1, 2017

High page count files: speed up split_pages #117

Closed

stumpylog mentioned this issue Apr 5, 2023

[BUG] v1.14.0 RC1 Bug memory leak loading PDFs paperless-ngx/paperless-ngx#3022

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage for very large files (high page count and large file size) #115

Reduce memory usage for very large files (high page count and large file size) #115

jbarlow83 commented Dec 9, 2016 •

edited

Loading

jbarlow83 commented Mar 1, 2017

jbarlow83 commented Mar 23, 2018

jbarlow83 commented Jun 23, 2018

jbarlow83 commented Jun 4, 2019

installgentoo commented May 31, 2023

bertholdm commented May 31, 2023 via email

jbarlow83 commented Jun 1, 2023

Andres-Chandia commented Apr 26, 2025 •

edited

Loading

Reduce memory usage for very large files (high page count and large file size) #115

Reduce memory usage for very large files (high page count and large file size) #115

Comments

jbarlow83 commented Dec 9, 2016 • edited Loading

jbarlow83 commented Mar 1, 2017

jbarlow83 commented Mar 23, 2018

jbarlow83 commented Jun 23, 2018

jbarlow83 commented Jun 4, 2019

installgentoo commented May 31, 2023

bertholdm commented May 31, 2023 via email

jbarlow83 commented Jun 1, 2023

Andres-Chandia commented Apr 26, 2025 • edited Loading

jbarlow83 commented Dec 9, 2016 •

edited

Loading

Andres-Chandia commented Apr 26, 2025 •

edited

Loading