Description
I used cortex in production environment, which was stable, but there was a lot of latency after adding some new remote writes from Prometheus.
I use the following expression to calculate the delay of remote write:
(time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds{instance_name=~"$instance_name"}) < 100000
It's measured in seconds, and it looks terrible:
Through monitoring, I have good reasons to believe that Distributor and Ingester have not encountered resource bottlenecks (CPU/Memory)
Through the distributor's log, I can see that a large number of push requests take more than 5 seconds, or even more than 40 seconds, which makes me surprised. As far as I know, when the ingester receives the data, it only writes to the memory without any IO operation. Data persistence is done periodically and asynchronously. It should be very fast。
I conducted a detailed time-consuming analysis (by recording the time-consuming of the core steps in the context and passing them layer by layer). Finally, I found that 90% of the processing time of the push request (for example, 8s out of 10s) is generated in the following code:
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L118-L151
The time spent in this code is mainly concentrated in two places:
Lock
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L119Copy
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L144
According to actual measurement, a certain Push takes 30s (including 10000 series), among which Copy
takes 14427ms
and Lock
takes 15313ms
.
As a time series database, throughput is an extremely core indicator, this problem should be solved.
My remote write rate is 830000 timeseries/s, each series contains only one sample and about 10 labels in every request.