Skip to content

Conversation

kjnilsson
Copy link
Contributor

@kjnilsson kjnilsson commented Nov 3, 2023

Listing queues with the HTTP API when there are many (1000s) of
quorum queues could be excessively slow compared to the same scenario
with classic queues.

This optimises various aspects of HTTP API queue listings.
For QQs it removes the expensive cluster wide rpcs used to get the
"online" status of each quorum queue. This was previously done before
paging and thus would perform a cluster-wide query for each quorum queue in
the vhost/system. This accounted for most of the slowness compared to
classic queues.

Secondly the query to separate the running from the down queues
consisted of two separate queries that later were combined when a single
query would have sufficed.

This commit also includes a variety of other improvements and minor
fixes discovered during testing and optimisation.

MINOR BREAKING CHANGE: quorum queues would previously only display one
of two states: running or down. Now there is a new state called minority
which is emitted when the queue has at least one member running but
cannot commit entries due to lack of quorum.

Also the quorum queue may transiently enter the down state when a node
goes down and before its elected a new leader.

@kjnilsson kjnilsson changed the title Optimise mgmt HTTP API /queues endpoint Optimise HTTP API /queues endpoint Nov 3, 2023
@kjnilsson kjnilsson force-pushed the opt-mgmt-queue-listings branch from ea0adf8 to f2ab9b4 Compare November 6, 2023 14:44
@kjnilsson kjnilsson marked this pull request as ready for review November 6, 2023 14:44
@kjnilsson kjnilsson force-pushed the opt-mgmt-queue-listings branch from f2ab9b4 to 7935e4d Compare November 6, 2023 14:44
@kjnilsson kjnilsson added this to the 3.13.0 milestone Nov 6, 2023
@kjnilsson
Copy link
Contributor Author

kjnilsson commented Nov 6, 2023

5k quorum queues on a 3 node dev box.

On this branch:

karl@crake:~$ time curl 'http://crake:15672/api/queues?page=25&page_size=100&name=&use_regex=false&pagination=true' -H 'authorization: Basic Z3Vlc3Q6Z3Vlc3Q=' &>/dev/null

real	0m0.080s
user	0m0.003s
sys	0m0.007s

On main:

karl@crake:~$ time curl 'http://crake:15672/api/queues?page=25&page_size=100&name=&use_regex=false&pagination=true' -H 'authorization: Basic Z3Vlc3Q6Z3Vlc3Q=' &>/dev/null

real	0m1.231s
user	0m0.000s
sys	0m0.006s

I expect on a system with a real network and some load to benefit even more.

@kjnilsson kjnilsson force-pushed the opt-mgmt-queue-listings branch from 7935e4d to 64494e6 Compare November 6, 2023 15:33
Listing queues with the HTTP API when there are many (1000s) of
quorum queues could be excessively slow compared to the same scenario
with classic queues.

This optimises various aspects of HTTP API queue listings.
For QQs it removes the expensive cluster wide rpcs used to get the
"online" status of each quorum queue. This was previously done _before_
paging and thus would perform a cluster-wide query for _each_ quorum queue in
the vhost/system. This accounted for most of the slowness compared to
classic queues.

Secondly the query to separate the running from the down queues
consisted of two separate queries that later were combined when a single
query would have sufficed.

This commit also includes a variety of other improvements and minor
fixes discovered during testing and optimisation.

MINOR BREAKING CHANGE: quorum queues would previously only display one
of two states: running or down. Now there is a new state called minority
which is emitted when the queue has at least one member running but
cannot commit entries due to lack of quorum.

Also the quorum queue may transiently enter the down state when a node
goes down and before its elected a new leader.
@kjnilsson kjnilsson force-pushed the opt-mgmt-queue-listings branch from 64494e6 to c2cd60b Compare November 6, 2023 15:34
To allow callers to specify a subset of fields they'd like.
There is no need to list all queues to check if the vhost
exists.
@tubemeister
Copy link

Thanks, that should make a big difference. A 10s timeout is workable, a 1m timeout isn't.

Any chance of this getting a backport to 3.12?

(PS: I've been locked out of my own bugreport. I've been getting more responses there from both of you but can't respond.)

@michaelklishin michaelklishin merged commit e527720 into main Nov 7, 2023
@michaelklishin michaelklishin deleted the opt-mgmt-queue-listings branch November 7, 2023 19:30
@michaelklishin
Copy link
Collaborator

There is no consensus on whether we want to backport this. #9892 is ready for review but we make no promises.

@michaelklishin
Copy link
Collaborator

It was decided that this should be a 3.13-specific change.

@tubemeister
Copy link

Yeah I suspected as much from the "breaking change" aspect, that makes sense.

Thanks anyway, something to look forward to in 3.13. ;-)

@tubemeister
Copy link

Can confirm, the long timeout when a node is offline is now fixed in 3.13.0

@tubemeister
Copy link

Might be useful to note this as fixed this in #9522 as well. I'd do it myself but I've been locked out of my own bugreport. ;-)

@mkuratczyk
Copy link
Contributor

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants