Skip to content

The throughput of ingester causes a lot of remote write latency #3093

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
storyicon opened this issue Aug 26, 2020 · 5 comments
Closed

The throughput of ingester causes a lot of remote write latency #3093

storyicon opened this issue Aug 26, 2020 · 5 comments
Labels

Comments

@storyicon
Copy link
Contributor

storyicon commented Aug 26, 2020

I used cortex in production environment, which was stable, but there was a lot of latency after adding some new remote writes from Prometheus.

I use the following expression to calculate the delay of remote write:

(time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds{instance_name=~"$instance_name"}) < 100000

It's measured in seconds, and it looks terrible:
image

Through monitoring, I have good reasons to believe that Distributor and Ingester have not encountered resource bottlenecks (CPU/Memory)
Through the distributor's log, I can see that a large number of push requests take more than 5 seconds, or even more than 40 seconds, which makes me surprised. As far as I know, when the ingester receives the data, it only writes to the memory without any IO operation. Data persistence is done periodically and asynchronously. It should be very fast。
I conducted a detailed time-consuming analysis (by recording the time-consuming of the core steps in the context and passing them layer by layer). Finally, I found that 90% of the processing time of the push request (for example, 8s out of 10s) is generated in the following code:
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L118-L151

The time spent in this code is mainly concentrated in two places:

  1. Lock
    https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L119
  2. Copy
    https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L144

According to actual measurement, a certain Push takes 30s (including 10000 series), among which Copy takes 14427ms and Lock takes 15313ms.

As a time series database, throughput is an extremely core indicator, this problem should be solved.


My remote write rate is 830000 timeseries/s, each series contains only one sample and about 10 labels in every request.

@pracucci
Copy link
Contributor

I use the following expression to calculate the delay of remote write:

Unrelated to this issue, but please be aware that the expected way to accurately calculate it is:

(max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds[5m]) - on(job, instance) group_right() max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds[5m]))

@storyicon
Copy link
Contributor Author

Thanks @pracucci ! I will try the new formula you provided. I am about to submit a PR to fix this issue. To be honest, I spent some time on it

@pracucci
Copy link
Contributor

Through the distributor's log, I can see that a large number of push requests take more than 5 seconds, or even more than 40 seconds, which makes me surprised.

Few random thoughts:

  1. Do you see the same looking at cortex_request_duration_seconds_bucket exposed by ingesters?
  2. Are you sure ingester CPU is not the bottleneck? Are ingesters running in a pod with CPU limit? If yes, is the CPU throttled? Have you tried to remove the CPU limit at all?

I conducted a detailed time-consuming analysis (by recording the time-consuming of the core steps in the context and passing them layer by layer)

I would suggest to use pprof. Cortex exposes CPU profiling at /debug/pprof/cpu?seconds=<duration>. For example, if your ingesters runs on localhost:8080 you could profile 10 seconds of CPU execution using:

go tool pprof -http localhost:7000 http://localhost:8080/debug/pprof/cpu?seconds=10

@storyicon
Copy link
Contributor Author

They do take up so much time that it's easy to think of CPU overload.
I used 5 * 252GB, 48Core machines to cope with this level of data. I saw from the monitoring that the CPU usage did not exceed 60% (I did not attach any cgroup restrictions to cortex, they were deployed on physical machines instead of Kubernetes), I was also surprised by the time-consuming result.

@stale
Copy link

stale bot commented Nov 2, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Nov 2, 2020
@stale stale bot closed this as completed Nov 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants