-
-
Notifications
You must be signed in to change notification settings - Fork 330
DBMStore #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added tests against berkeleydb. These are run in travis (linux) only, I don't think it's worth trying to get bsddb3 built on appveyor. |
OK, I think this is ready to go. |
Here's an example using Berkeley DB B-tree: In [1]: import zarr
In [2]: import bsddb3
In [3]: store = zarr.DBMStore('example.bdb', open=bsddb3.btopen)
In [4]: grp = zarr.group(store)
In [5]: z = grp.create_dataset('foo', shape=100000000, dtype='i8')
In [7]: import numpy as np
In [8]: z[:] = np.arange(z.shape[0])
In [9]: z[:]
Out[9]: array([ 0, 1, 2, ..., 99999997, 99999998, 99999999])
In [10]: store.close() cc @jeromekelleher, @jakirkham - this PR adds support for storing data in any DBM-style database, including Berkeley DB. Should provide an alternative to zip files, without the issues around replacing existing entries. I haven't figured out if/how this works under parallel reads or parallel writes to an array, I know Berkeley DB supports various concurrency options but I don't know which is enabled by default or which is most appropriate for use with zarr. In any case all the tests pass so I will probably merge this and add some caveats to the docs around unknowns for parallel usage. Would be interested if you do try it. |
Sounds like a good idea. Don't have time to review it, but do like the idea of having this option. |
No problem, just thought you'd be interested.
…On Thu, Nov 16, 2017 at 4:32 PM, jakirkham ***@***.***> wrote:
Sounds like a good idea. Don't have time to review it, but do like the
idea of having this option.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/186#issuecomment-344979364>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QuLmuHbxX_Ndxma4qntdhjk9hBKKks5s3GOhgaJpZM4Qf2aI>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Indeed. It is interesting. Thanks for the ping. :) |
Thanks @alimanfoo, I'll have a play with this when I get a chance and let you know how it goes. |
FYI, I'm trying this out instead of Zip containers. Working great so far! |
Cool! Have you tried any concurrent reads or writes? From what I've been
able to glean so far, there are various different locking modes supported
internally within Berkeley DB, but it's not immediately obvious how to use
them via bsddb3 Python API, and it would be great to know which (if any)
should be initialised when using with zarr if you're expecting to do
concurrent reads or concurrent writes to an array.
…On Fri, Nov 17, 2017 at 2:29 PM, Jerome Kelleher ***@***.***> wrote:
FYI, I'm trying this out instead of Zip containers. Working great so far!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/186#issuecomment-345258451>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8QkKuWktqqC4iBDVkbeCi4JFH4Ydfks5s3ZhdgaJpZM4Qf2aI>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
Yeah, doing both concurrent reads and writes and it seems to work fine. Not sure how bsddb3 supports concurrency in BDB. It's pretty solid once it goes through the correct DBEnv object though as far as I know. |
Good to know. FWIW it looks like if you use one of the shortcut functions like bsddb3.btopen then it uses a DBEnv with the locking subsystem initialized (DB_INIT_LOCK). I gather that means it is safe to attempt concurrent writes, as long as there's some way to detect deadlocks, which it looks like bsddb3 tries to do (_DeadlockWrap is used everywhere). The next interesting question is whether you do manage to get some concurrent throughput, i.e., you see multiple CPU utilisation while doing concurrent writes, or whether the database locking subsystem is preventing that at all. |
It's hard to know in my case as I'm doing a lot of compression with concurrent writes, so that's dominating my CPU time. I have 4 cores doing compression, and one core feeding them and it's all looking like it should. As far as I can tell the locking is pretty fine grained and allowing everything to go ahead pretty nicely. |
That's great to know, thanks.
…On Fri, Nov 17, 2017 at 3:26 PM, Jerome Kelleher ***@***.***> wrote:
It's hard to know in my case as I'm doing a lot of compression with
concurrent writes, so that's dominating my CPU time. I have 4 cores doing
compression, and one core feeding them and it's all looking like it should.
As far as I can tell the locking is pretty fine grained and allowing
everything to go ahead pretty nicely.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/alimanfoo/zarr/pull/186#issuecomment-345274280>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAq8Qik-zIFEIWXiuWDca4G_kayTCJzjks5s3aW_gaJpZM4Qf2aI>
.
--
Alistair Miles
Head of Epidemiological Informatics
Centre for Genomics and Global Health <http://cggh.org>
Big Data Institute Building
Old Road Campus
Roosevelt Drive
Oxford
OX3 7LF
United Kingdom
Phone: +44 (0)1865 743596
Email: [email protected]
Web: http://a <http://purl.org/net/aliman>limanfoo.github.io/
Twitter: https://twitter.com/alimanfoo
|
This PR adds a
DBMStore
class, which is a compatibility wrapper around any DBM-style database object, which includes the DBM-style objects available from standard library as well as Berkeley DB and more. Resolves #133.