Proposing some tooling for datasets (ipfs-pack and stuff) #205
Description
This proposes some tooling for large datasets. Warning! As soon as I wrote it, i already want to change it. in particular, I want to change the db
thing to just be a normal ipfs repo. it would help with the serving, too. We just need to lang making ipfs repos super fast with swappable datastores (right now we can't quite do that).
Proposal posted at https://gist.github.com/jbenet/deda429fae2e5af9a86a01b0cbb614f7 and reproduced below for those getting email.
I will update it with the db -> repo
thoughts, and update the gist and the comment below. I will comment when i update it so people get a notification, at least.
IPFS Tooling for datasets
Background
We need some tooling for a certain set of use cases around archival and dataset management. This tooling is for fitting how people work with large files and large datasets.
Grounding Assumptions
Basic grounding assumptions here:
- datasets are "large" (From GB to EB in size)
- datasets should not be duplicated in the filesystem (eg into a .ipfs repo)
- datasets may have different versions
- datasets (at a particular version) are exactly determined (can be hashed)
- people prefer to read and manipulate the datasets in a "working directory" style
- it is not enough to have an HTTP or RPC API, but rather a POSIX filesystem api is essential
- datasets can be represented as a tree of POSIX files and directories
- datasets may be moved using non-ipfs tools
- it would be useful to easily replicate and back up the content (ipfs, ipfs-cluster)
- it would be useful to easily serve the content on the web (ipfs-gateway)
- it would be useful (but not necessary) to digitally sign manifests
Why current IPFS tooling is not enough
The current ipfs tooling assumes we can import all data into a .ipfs
repository directory. There are ongoing efforts to build filestore
to allow referencing content outside of that directory, but this is not yet finalized, and all metadata is stored in the .ipfs repository, not with the directory in question.
We have often discussed Certified ARchives (.car
) as a replacement for tar. This could be a future replacement, along with a reliable way to mount the .cars
, but this is not yet here either.
Other tooling examples
- BagIt - https://tools.ietf.org/html/draft-kunze-bagit-06#section-2.1.3
- WARC - https://en.wikipedia.org/wiki/Web_ARChive
- BitTorrent's "manifest-like"
.torrent
file
Tools for archiving websites:
- https://github.com/edgi-govdata-archiving
- The Internet Archive offers Brozzler as a tool for crawling and archiving sites.
- Web Recorder lets you create verifiable web archive files for submission to the Internet Archive or hosting on your own.
Proposed Tooling Additions
This document proposes the addition or adjustment of the following tools:
dagger/dagify
(or whatever is decided here) - a standalone tool that reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.ipfs-pack
- a standalone tool that creates an "ipfs pack" (similar to WARCs, BagIt, and .torrent files, but with IPLD and importers magic).datadex
or maybegx-dataset
- a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)car
(still only a proposed tool) which create certified archives (single-file hash-linked archive, like a hash-linked .tar), will work closely with ipfs-pack.- The
ipfs repo filestore
abstractions can leverageipfs-packs
to understand what is being tracked.
dagger/dagify
This tool (name discussion here) reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.
> dagger -fmt <fmt-string> -r foo/bar/baz
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
Where <fmt-string>
is a format string that uniquely determines (for ever) the whole dag structure, including chunking scheme, index layout, what is tracked in the index, what is left as raw nodes, etc. The idea is that this string (which ideally will be short) can uniquely describe a strategy for representing the source content as the output ipld graph, and that it can repeatably do so. Meaning that once a given fmt string produces one output, it should never change (lest there is a major bug). This is because people must retain the ability to verify their content, and they need some primitive to do so.
dagger/dagify --only-cid --only-root
This tool will have an --only-cid
flag that ouputs only the cids:
> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
And an --only-root
flag that returns only the last (root) object or cid.
> dagger -fmt <fmt-string> -r foo/bar/baz --only-root
<last-ipld-object>
> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid --only-root
<last-ipld-cid>
ipfs-pack
filesystem packing tool
The idea is that ipfs-pack
is a filesystem packing tool, that establishes the notion of a bundle, bag, or "pack" of files. We use pack
to avoid confusing it with a Bag from BagIt, a very similar format (that ipfs-pack
is compatible with). The way "packs" work is this:
- There MUST BE a pack root directory that defines the pack. (eg at
<path-to-pack-root>/
) It contains all the pack contents and represents the pack in a filesystem. - There MUST BE a pack manifest file that tracks the contents ipfs hashes of the pack contents. (
<pack-root>/PackManifest
) - There MAY BE a pack object database cache file or directory that stores metadata on all the ipld objects in the pack. This is ancilliary and can be reconstructed from a pack root at any time.
Subcommands
> ipfs-pack -h
USAGE
ipfs-pack <subcommand> <arguments>
SUBCOMMANDS
make makes the package, overwriting the ipfs-pack manifest file.
verify verifies the ipfs-pack manifest file is correct.
db creates (or updates) a temporary ipfs object database `.ipfs-pack/db`
serve starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).
bag create BagIt spec-compliant bag from a pack.
car create a `.car` certified archive from a pack.
Usage Example
> pwd
/home/jbenet/myPack
> ls
someJSON.json
someXML.xml
moreData/
> ipfs-pack make
> ipfs-pack make -v
wrote PackManifest
> ls
someJSON.json
someXML.xml
moreData/
PackManifest
> cat PackManifest
QmVP2aaAWFe21QjUujMw5hwYRKD1eGx3yYWEBbMtuxpqXs moreData/0
QmV7eDE2WXuwQnvccsoXSzK5CQGXdFfay1LSadZCwyfbDV moreData/1
QmaMY7h9pmTcA5w9S2dsQT5eGLEQ1CwYQ32HwMTXAev5gQ moreData/2
QmQjYU5PscpCHadDbL1fDvTK4P9eXirSwD8hzJbAyrd5mf moreData/3
QmRErwActoLmffucXq7HPtefBC19MjWUcj1DdBoaAnMm6p moreData/4
QmeWvL929Tdhzw27CS5ZVHD73NQ9TT1xvLvCaXCgi7a9YB moreData/5
QmXbzZeh44jJEUueWjFxEiLcfAfzoaKYEy1fMHygkSD3hm moreData/6
QmYL17nYZrZsAhJut5v7ooD9hmz2rBotC1tqC9ZPxzCfer moreData/7
QmPKkidoUYX12PyCuKzehQuhEJofUJ9PPaX2Gc2iYd4GRs moreData/8
QmQAubXA3Gji5v5oaJhMbvmbGbiuwDf1u9sYsN125mcqrn moreData/9
QmYbYduoHMZAUMB5mjHoJHgJ9WndrdWkTCzuQ6yHkbgqkU someJSON.json
QmeWiZD5cdyiJoS3b7h87Cs9G21uQ1sLmeKrunTae9h5qG someXML.xml
QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm moreData
QmZ7iEGqahTHdUWGGZMUxYRXPwSM3UjBouneLcCmj9e6q6 .
> ipfs-pack db make
> ipfs-pack db make -v
wrote .ipfs-pack/db
> ls -a
./
../
.ipfs-pack/
someJSON.json
someXML.xml
moreData/
PackManifest
> find .ipfs-pack/
.ipfs-pack/
.ipfs-pack/db
ipfs-pack make
create (or update) a pack manifest
This command creates (or updates) the pack's manifest file.
ipfs-pack make
# wrote PackManifest
ipfs-pack verify
checks whether a pack matches its manifest
This command checks whether a pack matches its PackManifest
.
# errors when there is no manifest
> random-files foo
> cd foo
> ipfs-pack verify
error: no PackManifest found
# succeeds when manifest and pack match
> ipfs-pack make
> ipfs-pack verify
# errors when manifest and pack do not match
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file1" >>PackManifest
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file2" >>PackManifest
> touch non-manifest-file3
> ipfs-pack verify
error: in manifest, missing from pack: non-existent-file1
error: in manifest, missing from pack: non-existent-file2
error: in pack, missing from manifest: non-manifest-file3
ipfs-pack db
creates (or updates) a temporary ipfs object database
This command creates (or updates) a temporary ipfs object database (eg at .ipfs-pack/db
). This object database contains positonal metadata for all IPLD objects contained in the pack. (It follows the ipfs repo filestore metadata concerns). It MAY be a different, simpler object-db format, or be a full-fledged ipfs node repo using filestore.
The db is a simple key-value store that supports:
- maps
{ <ipld-cid> : <filestore-descriptor> }
- supports:
list() []<ipld-cid>
to show all cids in db - supports:
put(<ipld-object>) <ipld-cid>
- supports:
get(<ipld-cid>) <ipld-object>
- supports:
putDescriptor(<ipld-cid>, <filestore-descriptor>)
- supports:
getDescriptor(<ipld-cid>) <filestore-descriptor>
- supports:
delete()
to remove itself from disk
Notes:
<filestore-descriptor>
is the metadata necessary to reconstruct the entire object from data in the pack.{get,put}
should be able to add or retrieve the objects from db or from the data in the pack.{get,put}Descriptor
should be able to add or retrieve file descriptors for objects stored in the pack.- Intermediate ipld objects (eg intermediate objects in a file, which are not raw data nodes) may need to be stored in the db.
This database basically implements:
type PackObjectDB interface {
// Make creates or updates a pack-db at packdbPath,
// with data for all the objects in the pack at packPath.
Make(packPath string, packdbPath string) error
// Put associates the given FileDescriptor with the given ipld.CID
// if filestore.Descriptor is nil, Put removes the entry for ipld.CID (rm)
Put(ipld.CID, filestore.Descriptor) error
// Get retrieves the FileDescriptor associated with the given ipld.CID
Get(ipld.CID) (filestore.Descriptor, error)
// List returns all ipld.CID stored in the database
List() (<-chan ipld.CID, error)
// Delete deletes all the database contents and clears all files
Delete() error
}
And does so both through a programmatic interface (some go package), or via cli tooling:
> ipfs-pack-db --help
USAGE
ipfs-pack-db <subcommand> <arguments>
SUBCOMMANDS
make creates (or updates) the pack-db for a pack directory
list lists all cids in the pack-db
put adds a (cid, filestore-descriptor) entry.
get retrieves the filestore-descriptor for a given cid.
delete removes all files representing the pack-db (destructive)
ipfs-pack serve
starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).
This command starts an ipfs node serving the pack's contents (to IPFS and/or HTTP). This command MAY require a full go-ipfs installation to exist. It MAY be a standalone binary (ipfs-pack-serve
). It MUST use an ephemeral node or a one-off node whose id would be stored locally, in the pack, at <pack-root>/.ipfs-pack/repo
> ipfs-pack serve --http
Serving pack at /ip4/0.0.0.0/tcp/1234/http - http://127.0.0.1:1234
> ipfs-pack serve --ipfs
Serving pack at /ip4/0.0.0.0/tcp/1234/ipfs/QmPVUA4rJgckcf1ifrZF5KvwV1Uib5SGjJ7Z5BskEpTaSE
ipfs-pack bag
convert to and from BagIt (spec-compliant) bags.
This command converts between BagIt (spec-compliant) bags, a commonly used archiving format very similar to ipfs-pack
. It works like this:
> ipfs-pack bag --help
USAGE
ipfs-pack-bag <src-pack> <dst-bag>
ipfs-pack-bag <src-bag> <dst-pack>
# convert from pack to bag
> ipfs-pack bag path/to/mypack path/to/mybag
# convert from bag to pack
> ipfs-pack bag path/to/mybag path/to/mypack
ipfs-pack car
convert to and from a car (certified archive).
This command converts between packs and cars (certified archives). It works like this:
> ipfs-pack car --help
USAGE
ipfs-pack-car <src-pack> <dst-car>
ipfs-pack-car <src-car> <dst-pack>
# convert from pack to car
> ipfs-pack car path/to/mypack path/to/mycar.car
# convert from car to pack
> ipfs-pack car path/to/mycar.car path/to/mypack
datadex
or maybe gx-dataset
WIP
a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)
car
- certified archives
WIP
cars would interop with packs.
The ipfs repo filestore
WIP
Maybe the ipfs repo filestore
abstractions can leverage ipfs-packs
to understand what is being tracked in a given directory, particularly if those packs have up-to-date local dbs of all their objects.