Skip to content

Implement concurrent filtering and postprocessing #526

@lars-t-hansen

Description

@lars-t-hansen

(This bug has morphed a couple of times but has mostly been about optimizing the filtering pipeline in the sonarlog and db layers. What remains now is making use of available parallelism - this is not high priority, perf is pretty good already.)

For the heavy-users script, we end up running the profile command for every job that passes the initial filtering. The profile command is actually pretty slow. I don't know why precisely, but there is one main loop that is quadratic-ish (though there's an assertion that "the inner loop will tend to be very short") and then when we do bucketing, which I do, there's a bucketing loop that is cubic-ish. One would want to profile this. File I/O should not be a factor because this is with the caching daemon.

I'm running this on four weeks of fox data: heavy-users.py fox 4w, which would be roughly may 24 through june 21. There are 38 output lines unless I miscounted.

  • Move filter earlier in the pipeline
  • Avoid allocating a tremendously long array for all the samples
  • Avoid computing bounds when they are not needed, as this is expensive (it must be done before filtering)
  • Maybe: optimize the filter by specializing it for common cases (one value of one attribute + time) (see below + branch larstha-526-better-filter for WIP, this looks hard in general and the best we can do may be to implement special logic for particularly important cases)
  • Maybe: push filtering down into the reading pipeline to do it concurrently
  • Maybe: parallelize postprocessing, though this matters a lot less than efficient reading and filtering

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions