Fix performance issues with some complex queries

This is an example of a query that times out on LmdbStore, but not on NativeStore, when run with the `full` repository:

```
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix fip: <https://w3id.org/fair/fip/terms/>
prefix dct: <http://purl.org/dc/terms/>
prefix dce: <http://purl.org/dc/elements/1.1/>
prefix npa: <http://purl.org/nanopub/admin/>
prefix npx: <http://purl.org/nanopub/x/>
prefix np: <http://www.nanopub.org/nschema#>

select ?fip_index ?fip_title ?decl_np where {
  graph npa:graph {
    ?fip_index npx:hasNanopubType npx:NanopubIndex .
    ?fip_index npa:hasValidSignatureForPublicKey ?pubkey .
    filter not exists { ?index_np_x npx:invalidates ?fip_index ; npa:hasValidSignatureForPublicKey ?pubkey . }
    ?fip_index np:hasAssertion ?index_a .
    ?fip_index rdfs:label ?fip_title .
    ?fip_index dct:created ?index_date .
    ?decl_np npa:hasValidSignatureForPublicKey ?decl_pubkey .
    filter not exists { ?decl_np_x npx:invalidates ?decl_np ; npa:hasValidSignatureForPublicKey ?decl_pubkey . }
    ?decl_np npx:hasNanopubType fip:FIP-Declaration .
    ?decl_np dct:created ?date .
  }
  graph ?index_a {
    ?fip_index npx:includesElement ?decl_np .
  }
  filter not exists {
    graph npa:graph {
      ?fip_newer_index npa:hasValidSignatureForPublicKey ?pubkey .
      filter not exists { ?fip_newer_index_x npx:invalidates ?fip_newer_index ; npa:hasValidSignatureForPublicKey ?pubkey . }
      ?fip_newer_index dct:created ?newer_date .
      # Matching on the title string is an ugly hack:
      ?fip_newer_index rdfs:label ?fip_title .
    }
    filter(?newer_date > ?index_date).
  }
}
```

As @tkuhn pointed out, this can be fixed by bringing `?fip_newer_index rdfs:label ?fip_title .` out of the `graph npa:graph` clause. Then, it completes in a second or so.

I initially thought that this is an issue with LmdbStore picking the wrong index, but no, the selected index is fine (cspo). The problem is that RDF4J does this index lookup in a loop, buried deep in the execution plan, instead of doing a join right away. In fact, if you replace the last part of the query with:

```
  filter not exists {
    # Matching on the title string is an ugly hack:
    graph npa:graph {
      ?fip_newer_index rdfs:label ?fip_title . 
    }
    graph npa:graph {
      ?fip_newer_index npa:hasValidSignatureForPublicKey ?pubkey .
      filter not exists { ?fip_newer_index_x npx:invalidates ?fip_newer_index ; npa:hasValidSignatureForPublicKey ?pubkey . }
      ?fip_newer_index dct:created ?newer_date .
    }
    filter(?newer_date > ?index_date).
  }
```

It also completes in a second or so, using an index starting with `c`.

I think the issue is somewhere with join order estimation. For some reason, LmdbStore makes a worse cardinality estimation on the size of some statement patterns, resulting in a suboptimal join order. Unfortunately, this is not as easy to fix as I initially thought. Unless it's an obvious bug (it might be, but who knows), it would require tuning the cardinality estimation algorithm, which is pretty delicate. Correcting it here may break it elsewhere. :( Some DBs (like the now-defunct Blazegraph) had a dynamic optimizer that profiled the query as it went and reordered joins. I don't think RDF4J has that.

I will try to look into this further, but I can't make any promises. I see a general workaround pattern, though: **factor out highly selective triple patterns to separate query blocks, to force running them first.** So it's not that bad.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance issues with some complex queries #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix performance issues with some complex queries #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions