Skip to content

DBMStore #186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Nov 16, 2017
Merged

DBMStore #186

merged 7 commits into from
Nov 16, 2017

Conversation

alimanfoo
Copy link
Member

This PR adds a DBMStore class, which is a compatibility wrapper around any DBM-style database object, which includes the DBM-style objects available from standard library as well as Berkeley DB and more. Resolves #133.

@alimanfoo alimanfoo added this to the v2.2 milestone Nov 16, 2017
@alimanfoo
Copy link
Member Author

Added tests against berkeleydb. These are run in travis (linux) only, I don't think it's worth trying to get bsddb3 built on appveyor.

@alimanfoo
Copy link
Member Author

OK, I think this is ready to go.

@alimanfoo
Copy link
Member Author

Here's an example using Berkeley DB B-tree:

In [1]: import zarr

In [2]: import bsddb3

In [3]: store = zarr.DBMStore('example.bdb', open=bsddb3.btopen)

In [4]: grp = zarr.group(store)

In [5]: z = grp.create_dataset('foo', shape=100000000, dtype='i8')

In [7]: import numpy as np

In [8]: z[:] = np.arange(z.shape[0])

In [9]: z[:]
Out[9]: array([       0,        1,        2, ..., 99999997, 99999998, 99999999])

In [10]: store.close()

cc @jeromekelleher, @jakirkham - this PR adds support for storing data in any DBM-style database, including Berkeley DB. Should provide an alternative to zip files, without the issues around replacing existing entries. I haven't figured out if/how this works under parallel reads or parallel writes to an array, I know Berkeley DB supports various concurrency options but I don't know which is enabled by default or which is most appropriate for use with zarr. In any case all the tests pass so I will probably merge this and add some caveats to the docs around unknowns for parallel usage. Would be interested if you do try it.

@jakirkham
Copy link
Member

Sounds like a good idea. Don't have time to review it, but do like the idea of having this option.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 16, 2017 via email

@jakirkham
Copy link
Member

Indeed. It is interesting. Thanks for the ping. :)

@alimanfoo alimanfoo merged commit 0a0fb1a into master Nov 16, 2017
@alimanfoo alimanfoo deleted the dbm branch November 16, 2017 20:31
@alimanfoo alimanfoo mentioned this pull request Nov 16, 2017
4 tasks
@jeromekelleher
Copy link
Member

Thanks @alimanfoo, I'll have a play with this when I get a chance and let you know how it goes.

@jeromekelleher
Copy link
Member

FYI, I'm trying this out instead of Zip containers. Working great so far!

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 17, 2017 via email

@jeromekelleher
Copy link
Member

Yeah, doing both concurrent reads and writes and it seems to work fine. Not sure how bsddb3 supports concurrency in BDB. It's pretty solid once it goes through the correct DBEnv object though as far as I know.

@alimanfoo
Copy link
Member Author

Good to know. FWIW it looks like if you use one of the shortcut functions like bsddb3.btopen then it uses a DBEnv with the locking subsystem initialized (DB_INIT_LOCK). I gather that means it is safe to attempt concurrent writes, as long as there's some way to detect deadlocks, which it looks like bsddb3 tries to do (_DeadlockWrap is used everywhere). The next interesting question is whether you do manage to get some concurrent throughput, i.e., you see multiple CPU utilisation while doing concurrent writes, or whether the database locking subsystem is preventing that at all.

@jeromekelleher
Copy link
Member

It's hard to know in my case as I'm doing a lot of compression with concurrent writes, so that's dominating my CPU time. I have 4 cores doing compression, and one core feeding them and it's all looking like it should. As far as I can tell the locking is pretty fine grained and allowing everything to go ahead pretty nicely.

@alimanfoo
Copy link
Member Author

alimanfoo commented Nov 17, 2017 via email

@alimanfoo alimanfoo added enhancement New features or improvements release notes done Automatically applied to PRs which have release notes. labels Nov 20, 2017
@jakirkham jakirkham mentioned this pull request Jul 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New features or improvements release notes done Automatically applied to PRs which have release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants