-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Reduce memory usage for very large files (high page count and large file size) #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Reference counting may be sufficient is probably the easiest way to get better performance here without complications. One option would be to hardlink the input files for every time a task is run and delete them when it is done. That delegates refcounting to file system. Could fallback to cp if ln does not work.
|
Ruffus is not capable of a depth-first/greedy exploration of the pipeline DAG. |
Improved for v7 |
Improved again for v8.4/v9 |
Why does Ocrmypdf create all the pdf files in one batch? And then it doesn't clean up its working directory until it's done. I would expect it to extract pages as they are consumed by threads, and to delete everything besides the finalized page *.pdf as the page finishes processing. As a result of this, if you have /tmp mounted on ram, you absolutely can OOM because of too many temporary files. |
That would not be good for me. Since I'm using OCRmyPDF in a (at the moment private) Plugin for Calibre with a proofreading step after the OCR process on scanned PDF documents, I depend on the existence of all temporary hocr files until I delete them by myself after proofreading.
|
@installgentoo |
I had to solve this in the following way |
ocrmypdf uses excessive memory for files for very high page counts (hundreds), enough it might consume all available on temporary storage on small devices, e.g. 100 MB PDF produces >2GB of intermediates.
In the maximal case we need O(number of pages * number of intermediates per page). Currently we get major savings because some intermediates are just soft links to other intermediates.
Opportunities for savings:
PDF page split is breadth first, producing one file per page before some files are needed. It would be reasonable to have a group of 25 splitter.Implemented in v7.The text was updated successfully, but these errors were encountered: