-
Notifications
You must be signed in to change notification settings - Fork 4
Description
This is an example of a query that times out on LmdbStore, but not on NativeStore, when run with the full repository:
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix fip: <https://w3id.org/fair/fip/terms/>
prefix dct: <http://purl.org/dc/terms/>
prefix dce: <http://purl.org/dc/elements/1.1/>
prefix npa: <http://purl.org/nanopub/admin/>
prefix npx: <http://purl.org/nanopub/x/>
prefix np: <http://www.nanopub.org/nschema#>
select ?fip_index ?fip_title ?decl_np where {
graph npa:graph {
?fip_index npx:hasNanopubType npx:NanopubIndex .
?fip_index npa:hasValidSignatureForPublicKey ?pubkey .
filter not exists { ?index_np_x npx:invalidates ?fip_index ; npa:hasValidSignatureForPublicKey ?pubkey . }
?fip_index np:hasAssertion ?index_a .
?fip_index rdfs:label ?fip_title .
?fip_index dct:created ?index_date .
?decl_np npa:hasValidSignatureForPublicKey ?decl_pubkey .
filter not exists { ?decl_np_x npx:invalidates ?decl_np ; npa:hasValidSignatureForPublicKey ?decl_pubkey . }
?decl_np npx:hasNanopubType fip:FIP-Declaration .
?decl_np dct:created ?date .
}
graph ?index_a {
?fip_index npx:includesElement ?decl_np .
}
filter not exists {
graph npa:graph {
?fip_newer_index npa:hasValidSignatureForPublicKey ?pubkey .
filter not exists { ?fip_newer_index_x npx:invalidates ?fip_newer_index ; npa:hasValidSignatureForPublicKey ?pubkey . }
?fip_newer_index dct:created ?newer_date .
# Matching on the title string is an ugly hack:
?fip_newer_index rdfs:label ?fip_title .
}
filter(?newer_date > ?index_date).
}
}
As @tkuhn pointed out, this can be fixed by bringing ?fip_newer_index rdfs:label ?fip_title . out of the graph npa:graph clause. Then, it completes in a second or so.
I initially thought that this is an issue with LmdbStore picking the wrong index, but no, the selected index is fine (cspo). The problem is that RDF4J does this index lookup in a loop, buried deep in the execution plan, instead of doing a join right away. In fact, if you replace the last part of the query with:
filter not exists {
# Matching on the title string is an ugly hack:
graph npa:graph {
?fip_newer_index rdfs:label ?fip_title .
}
graph npa:graph {
?fip_newer_index npa:hasValidSignatureForPublicKey ?pubkey .
filter not exists { ?fip_newer_index_x npx:invalidates ?fip_newer_index ; npa:hasValidSignatureForPublicKey ?pubkey . }
?fip_newer_index dct:created ?newer_date .
}
filter(?newer_date > ?index_date).
}
It also completes in a second or so, using an index starting with c.
I think the issue is somewhere with join order estimation. For some reason, LmdbStore makes a worse cardinality estimation on the size of some statement patterns, resulting in a suboptimal join order. Unfortunately, this is not as easy to fix as I initially thought. Unless it's an obvious bug (it might be, but who knows), it would require tuning the cardinality estimation algorithm, which is pretty delicate. Correcting it here may break it elsewhere. :( Some DBs (like the now-defunct Blazegraph) had a dynamic optimizer that profiled the query as it went and reordered joins. I don't think RDF4J has that.
I will try to look into this further, but I can't make any promises. I see a general workaround pattern, though: factor out highly selective triple patterns to separate query blocks, to force running them first. So it's not that bad.