-
Notifications
You must be signed in to change notification settings - Fork 3
perf: only descend into hamt subshard that has the target entry #10
perf: only descend into hamt subshard that has the target entry #10
Conversation
Since hash results are stable based on the file input, we can predict which subshard will contain a given file based on how deep we are in the hamt, so there's no need to traverse the entire shard to find one file. Fixes #9
return { | ||
depth: p ? depth + 1 : depth, | ||
name: p, | ||
path: pp, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, variable names.
src/dir-hamt-sharded.js
Outdated
return addLinksToHamtBucket(node.links, options.lastBucket, options.rootBucket, cb) | ||
}, | ||
(cb) => findPosition(targetFile, options.rootBucket, cb), | ||
({ position }, cb) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
destructing is great except when it hides away completely what else could be inside the object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful to limit the scope of the handler. In this case it was unnecessary, have removed.
@@ -6,14 +6,19 @@ chai.use(require('dirty-chai')) | |||
const expect = chai.expect | |||
const IPLD = require('ipld') | |||
const UnixFS = require('ipfs-unixfs') | |||
const pull = require('pull-stream') | |||
const pull = require('pull-stream/pull') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean require('pull-stream/pull')
instead of require('pull-stream')
?
If so: https://github.com/pull-stream/pull-stream#minimal-bundle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! We should do this everywhere then!
test/exporter-sharded.spec.js
Outdated
|
||
exported.forEach(file => expect(file.type).to.equal('file')) | ||
|
||
cb() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't call callbacks inside try/catch blocks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the time I found it made the tests fail faster. Of course now it doesn't seem to make any difference, have removed them all.
0af3b42
to
b530f53
Compare
if (link.name.length === 2) { | ||
const pos = parseInt(link.name, 16) | ||
|
||
return bucket._putObjectAt(pos, new Bucket({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention underscore prefixed properties are not normally part of the public API, and are likely to change without notice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but we control that module. I can make them all public, but they aren't really of use unless you're trying to predict shard entry indices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH all of this code used to be in the same module but we decomposed it. It's all really tightly coupled still, could definitely be improved in future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, but the exact problem I want to avoid happened with ipfs and dag-pb when the _multihash
property was removed - we control both of those modules too!
If you want to deal with it in a separate PR that's cool, but lets at least make sure it does finally get done, maybe raise an issue?
callback(err) | ||
callback = null | ||
}) | ||
.then(() => callback && callback()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason behind this unusual structure? A promise can only resolve/reject once, why would this not work in this case?:
Promise.all(/*... */)
.then(callback)
.catch(callback)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If .then
is invoked and callback
throws, .catch
is invoked which calls callback
. If callback
is from async.js
it'll be wrapped in once
which throws another error saying callback already invoked!
and smothering the actual error which is not terribly helpful.
All this will go away when we start using async/await/iterators everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, gotcha
} | ||
|
||
const findPosition = (file, bucket, cb) => { | ||
bucket._findNewBucketAndPos(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another "private" method!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
} | ||
|
||
cb(null, position) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another unusual promise chain for calling a callback!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See previous comment.
const path = [] | ||
|
||
while (bucket._parent) { | ||
path.push(bucket) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: unshift
so you don't have to reverse afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be slower: https://jsperf.com/push-reverse-vs-unshift (95% slower when I run it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That link crashes my chrome for some reason! On firefox it's faster. I found a different jsperf when I commented and it was faster also, that's the only reason for the suggestion. It's not important so lets forget this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah? Weird - in FF using that link unshift is 94% slower for me.
It took a couple of goes to get Chrome to finish the test 🤷♂️
const values = require('pull-stream/sources/values') | ||
const filter = require('pull-stream/throughs/filter') | ||
const map = require('pull-stream/throughs/map') | ||
const cat = require('pull-cat') | ||
const Bucket = require('hamt-sharding/src/bucket') | ||
const DirSharded = require('ipfs-unixfs-importer/src/importer/dir-sharded') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: pull out as shared dependency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, on the todo list as a future improvement.
delete options.rootBucket | ||
delete options.lastBucket | ||
delete options.hamtDepth | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait so...is the change of the values of these options
depended on outside of shardedDirExporter
? That doesn't sound good...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, but each invocation of the shardedDirExporter
is invoked independently of the others. The hashing algorithm used by HAMT needs to know about more than the current level because the buckets aren't related to directories so we need some way of passing that information along during a graph traversal.
Once we've found the target node we can remove the data as we're done with the traversal. We want to remove it in case we traverse from a HAMT to a normal dir and then back to a HAMT.
Since a HAMT is only made up of sub shards, files and directories, once we hit a file or directory we are done. That is, there's no chance of us going from one HAMT directly to another HAMT in a traversal, even if the directory we resolve to happens to be a HAMT (since it's a directory and not a sub shard).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok understood - do you think there's an improvement to be made here in terms of understandability? Can we raise an issue for a future resolution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's kind of bad form to modify the incoming arguments anyway. When we do the refactor in #7 it'd be a good time to revisit this as I don't think the original approach expected any traversals to require state.
This PR decreases the time it takes to read a file from the HAMT shard that backs our npm-on-ipfs registry from about three minutes (😱) down to about a second, so I'm going to merge this and address the outstanding issues in further PRs. |
Since bucket indexes are stable based on the file input and current tree state, we can predict which subshard will contain a given file based on how deep we are in the hamt, so there's no need to traverse the entire shard to find one file.
Fixes #9