-
Notifications
You must be signed in to change notification settings - Fork 1.2k
parallel_bulk leaks memory and retries forever while still consuming the input iterator #1077
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A TransportError or BulkIndexError should be raised if a 503 is returned. Perhaps you are catching this elsewhere in your code and retrying? |
Sorry for the late reply, what happens if the server does not respond at all? What is the default timeout setting for the bulk_insert method? I did not have any retry logic as far as I remember but I did start multiple parallel_bulk actions in 4 separate processes. |
the default is driven by the client instance passed in as the first parameter. The default timeout for that is 10 seconds after which an operation will timeout and the helper, by default, will raise an exception. |
1 similar comment
the default is driven by the client instance passed in as the first parameter. The default timeout for that is 10 seconds after which an operation will timeout and the helper, by default, will raise an exception. |
Description
Since there is no
max_retry
configuration on helpers.parallel_bulk (#645) the default case seems to retry forever and never stop. This is a very strange default behavior and caused my batch processing script to run out of memory.This situation is caused by having the elasticsearch database run out of storage space which I easily achieved by filling a default elasticsearch docker container with documents until my partition was filled.
That lead to the following error with basic curl insertion.
Since the parallel_bulk method retries on 503 responses and it retires an infinite amount of times this becomes a major issue. In my case I need to ingest a large amount of small documents into ES on periodic schedule. To do this quickly I increased the queue_size and chunk_size of the parallel_bulk call according to the Elasticsearch documentation. In my case an optimal configuration looked like this.
Despite having
raise_on_exception
andraise_on_error
set to true this call continues to consume my iterator and fill up my memory despite every single insertion attempt being stuck on infinite retries. I did especially not expect the iterator to continue to be consumed in such a situation.Environment:
Linux - Debian Stretch
Python 3.7 with
pip install elasticsearch==7.1.0
Elasticsearch database running with
docker run --rm -p 9200:9200 elasticsearch:7.4.2
Expected outcome
That the default configuration would be to retry a limited amount of times and that the iterator stops being consumed until the insertion is either aborted or starts working again. I would also very much appreciate getting #645 fixed so that we have control over the amount of retires.
The text was updated successfully, but these errors were encountered: