The throughput of ingester causes a lot of remote write latency

I used cortex in production environment, which was stable, but there was a lot of latency after adding some new remote writes from Prometheus.

I use the following expression to calculate the delay of remote write:
```
(time() - prometheus_remote_storage_queue_highest_sent_timestamp_seconds{instance_name=~"$instance_name"}) < 100000
```
It's measured in seconds, and it looks terrible:
![image](https://user-images.githubusercontent.com/29772821/91322867-ce738300-e7f2-11ea-9c7a-a375dadd64b1.png)

Through monitoring, I have good reasons to believe that Distributor and Ingester have not encountered resource bottlenecks (CPU/Memory)
Through the distributor's log, I can see that a large number of push requests take more than 5 seconds, or even more than 40 seconds, which makes me surprised. As far as I know, when the ingester receives the data, it only writes to the memory without any IO operation. Data persistence is done periodically and asynchronously. It should be very fast。
I conducted a detailed time-consuming analysis (by recording the time-consuming of the core steps in the context and passing them layer by layer). Finally, I found that 90% of the processing time of the push request (for example, 8s out of 10s) is generated in the following code:
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L118-L151


The time spent in this code is mainly concentrated in two places:
1. `Lock`
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L119
2. `Copy`
https://github.com/cortexproject/cortex/blob/master/pkg/ingester/index/index.go#L144

According to actual measurement, a certain Push takes 30s (including 10000 series), among which `Copy` takes `14427ms` and `Lock` takes `15313ms`.

As a time series database, throughput is an extremely core indicator, this problem should be solved.

---
My remote write rate is 830000 timeseries/s, each series contains only one sample and about 10 labels in every request.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

The throughput of ingester causes a lot of remote write latency #3093

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

The throughput of ingester causes a lot of remote write latency #3093

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions