Skip to content
This repository was archived by the owner on Apr 29, 2020. It is now read-only.

perf: only descend into hamt subshard that has the target entry #10

Merged

Conversation

achingbrain
Copy link
Collaborator

@achingbrain achingbrain commented Dec 3, 2018

Since bucket indexes are stable based on the file input and current tree state, we can predict which subshard will contain a given file based on how deep we are in the hamt, so there's no need to traverse the entire shard to find one file.

Fixes #9

Since hash results are stable based on the file input, we can predict
which subshard will contain a given file based on how deep we are
in the hamt, so there's no need to traverse the entire shard to
find one file.

Fixes #9
@ghost ghost assigned achingbrain Dec 3, 2018
@ghost ghost added the in progress label Dec 3, 2018
@daviddias daviddias requested a review from pgte December 3, 2018 08:26
return {
depth: p ? depth + 1 : depth,
name: p,
path: pp,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, variable names.

return addLinksToHamtBucket(node.links, options.lastBucket, options.rootBucket, cb)
},
(cb) => findPosition(targetFile, options.rootBucket, cb),
({ position }, cb) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

destructing is great except when it hides away completely what else could be inside the object.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful to limit the scope of the handler. In this case it was unnecessary, have removed.

@@ -6,14 +6,19 @@ chai.use(require('dirty-chai'))
const expect = chai.expect
const IPLD = require('ipld')
const UnixFS = require('ipfs-unixfs')
const pull = require('pull-stream')
const pull = require('pull-stream/pull')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean require('pull-stream/pull') instead of require('pull-stream')?

If so: https://github.com/pull-stream/pull-stream#minimal-bundle

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! We should do this everywhere then!


exported.forEach(file => expect(file.type).to.equal('file'))

cb()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't call callbacks inside try/catch blocks

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the time I found it made the tests fail faster. Of course now it doesn't seem to make any difference, have removed them all.

@achingbrain achingbrain force-pushed the only-descend-into-hamt-shards-that-have-our-entry branch from 0af3b42 to b530f53 Compare December 3, 2018 09:33
if (link.name.length === 2) {
const pos = parseInt(link.name, 16)

return bucket._putObjectAt(pos, new Bucket({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention underscore prefixed properties are not normally part of the public API, and are likely to change without notice.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but we control that module. I can make them all public, but they aren't really of use unless you're trying to predict shard entry indices.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH all of this code used to be in the same module but we decomposed it. It's all really tightly coupled still, could definitely be improved in future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but the exact problem I want to avoid happened with ipfs and dag-pb when the _multihash property was removed - we control both of those modules too!

If you want to deal with it in a separate PR that's cool, but lets at least make sure it does finally get done, maybe raise an issue?

callback(err)
callback = null
})
.then(() => callback && callback())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason behind this unusual structure? A promise can only resolve/reject once, why would this not work in this case?:

Promise.all(/*... */)
   .then(callback)
   .catch(callback)

Copy link
Collaborator Author

@achingbrain achingbrain Dec 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If .then is invoked and callback throws, .catch is invoked which calls callback. If callback is from async.js it'll be wrapped in once which throws another error saying callback already invoked! and smothering the actual error which is not terribly helpful.

All this will go away when we start using async/await/iterators everywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, gotcha

}

const findPosition = (file, bucket, cb) => {
bucket._findNewBucketAndPos(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another "private" method!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

}

cb(null, position)
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another unusual promise chain for calling a callback!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment.

const path = []

while (bucket._parent) {
path.push(bucket)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: unshift so you don't have to reverse afterwards?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be slower: https://jsperf.com/push-reverse-vs-unshift (95% slower when I run it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That link crashes my chrome for some reason! On firefox it's faster. I found a different jsperf when I commented and it was faster also, that's the only reason for the suggestion. It's not important so lets forget this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah? Weird - in FF using that link unshift is 94% slower for me.

It took a couple of goes to get Chrome to finish the test 🤷‍♂️

const values = require('pull-stream/sources/values')
const filter = require('pull-stream/throughs/filter')
const map = require('pull-stream/throughs/map')
const cat = require('pull-cat')
const Bucket = require('hamt-sharding/src/bucket')
const DirSharded = require('ipfs-unixfs-importer/src/importer/dir-sharded')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: pull out as shared dependency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, on the todo list as a future improvement.

delete options.rootBucket
delete options.lastBucket
delete options.hamtDepth
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait so...is the change of the values of these options depended on outside of shardedDirExporter? That doesn't sound good...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but each invocation of the shardedDirExporter is invoked independently of the others. The hashing algorithm used by HAMT needs to know about more than the current level because the buckets aren't related to directories so we need some way of passing that information along during a graph traversal.

Once we've found the target node we can remove the data as we're done with the traversal. We want to remove it in case we traverse from a HAMT to a normal dir and then back to a HAMT.

Since a HAMT is only made up of sub shards, files and directories, once we hit a file or directory we are done. That is, there's no chance of us going from one HAMT directly to another HAMT in a traversal, even if the directory we resolve to happens to be a HAMT (since it's a directory and not a sub shard).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok understood - do you think there's an improvement to be made here in terms of understandability? Can we raise an issue for a future resolution?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's kind of bad form to modify the incoming arguments anyway. When we do the refactor in #7 it'd be a good time to revisit this as I don't think the original approach expected any traversals to require state.

@achingbrain
Copy link
Collaborator Author

achingbrain commented Dec 3, 2018

This PR decreases the time it takes to read a file from the HAMT shard that backs our npm-on-ipfs registry from about three minutes (😱) down to about a second, so I'm going to merge this and address the outstanding issues in further PRs.

@achingbrain achingbrain merged commit aeede0b into master Dec 3, 2018
@achingbrain achingbrain deleted the only-descend-into-hamt-shards-that-have-our-entry branch December 3, 2018 12:56
@ghost ghost removed the in progress label Dec 3, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants