-
Notifications
You must be signed in to change notification settings - Fork 1
Description
(This bug has morphed a couple of times but has mostly been about optimizing the filtering pipeline in the sonarlog and db layers. What remains now is making use of available parallelism - this is not high priority, perf is pretty good already.)
For the heavy-users script, we end up running the profile command for every job that passes the initial filtering. The profile command is actually pretty slow. I don't know why precisely, but there is one main loop that is quadratic-ish (though there's an assertion that "the inner loop will tend to be very short") and then when we do bucketing, which I do, there's a bucketing loop that is cubic-ish. One would want to profile this. File I/O should not be a factor because this is with the caching daemon.
I'm running this on four weeks of fox data: heavy-users.py fox 4w
, which would be roughly may 24 through june 21. There are 38 output lines unless I miscounted.
- Move filter earlier in the pipeline
- Avoid allocating a tremendously long array for all the samples
- Avoid computing bounds when they are not needed, as this is expensive (it must be done before filtering)
- Maybe: optimize the filter by specializing it for common cases (one value of one attribute + time) (see below + branch larstha-526-better-filter for WIP, this looks hard in general and the best we can do may be to implement special logic for particularly important cases)
- Maybe: push filtering down into the reading pipeline to do it concurrently
- Maybe: parallelize postprocessing, though this matters a lot less than efficient reading and filtering