Skip to content

Reduce memory usage for very large files (high page count and large file size) #115

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
1 of 3 tasks
jbarlow83 opened this issue Dec 9, 2016 · 8 comments
Open
1 of 3 tasks

Comments

@jbarlow83
Copy link
Collaborator

jbarlow83 commented Dec 9, 2016

ocrmypdf uses excessive memory for files for very high page counts (hundreds), enough it might consume all available on temporary storage on small devices, e.g. 100 MB PDF produces >2GB of intermediates.

In the maximal case we need O(number of pages * number of intermediates per page). Currently we get major savings because some intermediates are just soft links to other intermediates.

Opportunities for savings:

  • PDF page split is breadth first, producing one file per page before some files are needed. It would be reasonable to have a group of 25 splitter. Implemented in v7.
  • Intermediate temporary files could be deleted after every dependency interested in them has been consumed. Each object produced by the pipeline could have a reference count. Whenever a task finishes the reference count of each input is decreased. When a task is finished we ask how many input tasks want to see each output file and set the reference count accordingly.
  • Prioritize depth over breadth when a worker process is free to select a new task, if ruffus or its replacement doesn't already do this. Depth first topo ordering might get this for free.
@jbarlow83
Copy link
Collaborator Author

Reference counting may be sufficient is probably the easiest way to get better performance here without complications.

One option would be to hardlink the input files for every time a task is run and delete them when it is done. That delegates refcounting to file system. Could fallback to cp if ln does not work.

/tmp/ocrmypdf-somefile/
    triage/
         origin
         origin.pdf
    repair/
         input.pdf  # hardlinked to origin.pdf
         input.repaired.pdf

@jbarlow83
Copy link
Collaborator Author

Ruffus is not capable of a depth-first/greedy exploration of the pipeline DAG.

@jbarlow83
Copy link
Collaborator Author

Improved for v7

@jbarlow83
Copy link
Collaborator Author

Improved again for v8.4/v9

@installgentoo
Copy link

Why does Ocrmypdf create all the pdf files in one batch? And then it doesn't clean up its working directory until it's done.

I would expect it to extract pages as they are consumed by threads, and to delete everything besides the finalized page *.pdf as the page finishes processing.

As a result of this, if you have /tmp mounted on ram, you absolutely can OOM because of too many temporary files.

@bertholdm
Copy link

bertholdm commented May 31, 2023 via email

@jbarlow83
Copy link
Collaborator Author

@installgentoo
The working directory is definitely cleaned up at the end of processing. However, some intermediate resources are not deleted as soon as they could be. This hasn't been implemented because it mainly people with low temporary storage who are processing a lot of files, in most cases adding more storage is a good-enough workaround. As @bertholdm mentions, the introduction of plugins has made a lot of changes more complex.

@Andres-Chandia
Copy link

Andres-Chandia commented Apr 26, 2025

I had to solve this in the following way
cd "device with more space" → mkdir tmp
sudo su
cd /
mv tmp origtmp
ln -s /root/to/"device with more space"/tmp
when process is finished
cd /
rm tmp
mv origtmp tmp
cd "device with more space"
rm -rf tmp
I hope it helps!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants