-
Notifications
You must be signed in to change notification settings - Fork 817
The throughput of ingester causes a lot of remote write latency #3093
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Unrelated to this issue, but please be aware that the expected way to accurately calculate it is:
|
Thanks @pracucci ! I will try the new formula you provided. I am about to submit a PR to fix this issue. To be honest, I spent some time on it |
Few random thoughts:
I would suggest to use pprof. Cortex exposes CPU profiling at
|
They do take up so much time that it's easy to think of CPU overload. |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions. |
Uh oh!
There was an error while loading. Please reload this page.
I used cortex in production environment, which was stable, but there was a lot of latency after adding some new remote writes from Prometheus.
I use the following expression to calculate the delay of remote write:
It's measured in seconds, and it looks terrible:

Through monitoring, I have good reasons to believe that Distributor and Ingester have not encountered resource bottlenecks (CPU/Memory)
Through the distributor's log, I can see that a large number of push requests take more than 5 seconds, or even more than 40 seconds, which makes me surprised. As far as I know, when the ingester receives the data, it only writes to the memory without any IO operation. Data persistence is done periodically and asynchronously. It should be very fast。
I conducted a detailed time-consuming analysis (by recording the time-consuming of the core steps in the context and passing them layer by layer). Finally, I found that 90% of the processing time of the push request (for example, 8s out of 10s) is generated in the following code:
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L118-L151
The time spent in this code is mainly concentrated in two places:
Lock
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L119
Copy
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L144
According to actual measurement, a certain Push takes 30s (including 10000 series), among which
Copy
takes14427ms
andLock
takes15313ms
.As a time series database, throughput is an extremely core indicator, this problem should be solved.
My remote write rate is 830000 timeseries/s, each series contains only one sample and about 10 labels in every request.
The text was updated successfully, but these errors were encountered: