Proposing some tooling for datasets (ipfs-pack and stuff)

This proposes some tooling for large datasets. **Warning!** As soon as I wrote it, i already want to change it. in particular, I want to change the `db` thing to just be a normal ipfs repo. it would help with the serving, too. We just need to lang making ipfs repos super fast with swappable datastores (right now we can't quite do that).

Proposal posted at https://gist.github.com/jbenet/deda429fae2e5af9a86a01b0cbb614f7 and reproduced below for those getting email.

I will update it with the `db -> repo` thoughts, and update the gist and the comment below. I will comment when i update it so people get a notification, at least.


# IPFS Tooling for datasets

## Background

We need some tooling for a certain set of use cases around archival and dataset management. This tooling is for fitting how people work with large files and large datasets.

### Grounding Assumptions

Basic grounding assumptions here:
- datasets are "large" (From GB to EB in size)
- datasets should not be duplicated in the filesystem (eg into a .ipfs repo)
- datasets may have different versions
- datasets (at a particular version) are exactly determined (can be hashed)
- people prefer to read and manipulate the datasets in a "working directory" style
- it is not enough to have an HTTP or RPC API, but rather a POSIX filesystem api is essential
- datasets can be represented as a tree of POSIX files and directories
- datasets may be moved using non-ipfs tools
- it would be useful to easily replicate and back up the content (ipfs, ipfs-cluster)
- it would be useful to easily serve the content on the web (ipfs-gateway)
- it would be useful (but not necessary) to digitally sign manifests

### Why current IPFS tooling is not enough

The current ipfs tooling assumes we can import all data into a `.ipfs` repository directory. There are ongoing efforts to build `filestore` to allow referencing content outside of that directory, but this is not yet finalized, and all metadata is stored in the .ipfs repository, not with the directory in question.

We have often discussed Certified ARchives (`.car`) as a replacement for tar. This could be a future replacement, along with a reliable way to mount the `.cars`, but this is not yet here either.

### Other tooling examples

- BagIt - https://tools.ietf.org/html/draft-kunze-bagit-06#section-2.1.3
- WARC - https://en.wikipedia.org/wiki/Web_ARChive
- BitTorrent's "manifest-like" `.torrent` file

Tools for archiving websites:
* https://github.com/edgi-govdata-archiving
* The Internet Archive offers [Brozzler](https://github.com/internetarchive/brozzler) as a tool for crawling and archiving sites.
* [Web Recorder](https://webrecorder.io/) lets you create verifiable web archive files for submission to the Internet Archive or hosting on your own.

## Proposed Tooling Additions

This document proposes the addition or adjustment of the following tools:

- `dagger/dagify` (or [whatever is decided here](https://github.com/ipfs/notes/issues/204)) - a _standalone_ tool that reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string.
- `ipfs-pack` - a _standalone_ tool that creates an "ipfs pack" (similar to WARCs, BagIt, and .torrent files, but with IPLD and importers magic). 
- `datadex` or maybe `gx-dataset` - a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)
- `car` (still only a proposed tool) which create certified archives (single-file hash-linked archive, like a hash-linked .tar), will work closely with ipfs-pack.
- The `ipfs repo filestore` abstractions can leverage `ipfs-packs` to understand what is being tracked.

### `dagger/dagify` 

This tool ([name discussion here](https://github.com/ipfs/notes/issues/204)) reads in a file or directory and outputs an (in-order) ipld graph, according to a given format string. 

```
> dagger -fmt <fmt-string> -r foo/bar/baz
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
<ipld-object>
```

Where `<fmt-string>` is a format string that uniquely _determines_ (for ever) the whole dag structure, including chunking scheme, index layout, what is tracked in the index, what is left as raw nodes, etc. The idea is that this string (which ideally will be short) can uniquely describe a strategy for representing the source content as the output ipld graph, and that it can repeatably do so. Meaning that once a given fmt string produces one output, it should never change (lest there is a major bug). This is because people must retain the ability to verify their content, and they need some primitive to do so.

#### `dagger/dagify --only-cid --only-root`

This tool will have an `--only-cid` flag that ouputs only the cids:

```
> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
<ipld-object-cid>
```

And an `--only-root` flag that returns only the last (root) object or cid.

```
> dagger -fmt <fmt-string> -r foo/bar/baz --only-root
<last-ipld-object>

> dagger -fmt <fmt-string> -r foo/bar/baz --only-cid --only-root
<last-ipld-cid>
```


### `ipfs-pack` filesystem packing tool

The idea is that `ipfs-pack` is a filesystem packing tool, that establishes the notion of a bundle, bag, or "pack" of files. We use `pack` to avoid confusing it with a Bag from BagIt, a very similar format (that `ipfs-pack` is compatible with). The way "packs" work is this:

- There MUST BE a _pack root_ directory that defines the pack. (eg at `<path-to-pack-root>/`) It contains all the pack contents and represents the pack in a filesystem.
- There MUST BE a _pack manifest_ file that tracks the contents ipfs hashes of the pack contents. (`<pack-root>/PackManifest`)
- There MAY BE a _pack object database_ cache file or directory that stores metadata on all the ipld objects in the pack. This is ancilliary and can be reconstructed from a _pack root_ at any time.


#### Subcommands

```
> ipfs-pack -h
USAGE
    ipfs-pack <subcommand> <arguments>

SUBCOMMANDS
    make     makes the package, overwriting the ipfs-pack manifest file.
    verify   verifies the ipfs-pack manifest file is correct.
    db       creates (or updates) a temporary ipfs object database `.ipfs-pack/db`
    serve    starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).
    bag      create BagIt spec-compliant bag from a pack.
    car      create a `.car` certified archive from a pack.
```

#### Usage Example

```
> pwd
/home/jbenet/myPack

> ls
someJSON.json
someXML.xml
moreData/

> ipfs-pack make
> ipfs-pack make -v
wrote PackManifest

> ls
someJSON.json
someXML.xml
moreData/
PackManifest

> cat PackManifest
QmVP2aaAWFe21QjUujMw5hwYRKD1eGx3yYWEBbMtuxpqXs moreData/0
QmV7eDE2WXuwQnvccsoXSzK5CQGXdFfay1LSadZCwyfbDV moreData/1
QmaMY7h9pmTcA5w9S2dsQT5eGLEQ1CwYQ32HwMTXAev5gQ moreData/2
QmQjYU5PscpCHadDbL1fDvTK4P9eXirSwD8hzJbAyrd5mf moreData/3
QmRErwActoLmffucXq7HPtefBC19MjWUcj1DdBoaAnMm6p moreData/4
QmeWvL929Tdhzw27CS5ZVHD73NQ9TT1xvLvCaXCgi7a9YB moreData/5
QmXbzZeh44jJEUueWjFxEiLcfAfzoaKYEy1fMHygkSD3hm moreData/6
QmYL17nYZrZsAhJut5v7ooD9hmz2rBotC1tqC9ZPxzCfer moreData/7
QmPKkidoUYX12PyCuKzehQuhEJofUJ9PPaX2Gc2iYd4GRs moreData/8
QmQAubXA3Gji5v5oaJhMbvmbGbiuwDf1u9sYsN125mcqrn moreData/9
QmYbYduoHMZAUMB5mjHoJHgJ9WndrdWkTCzuQ6yHkbgqkU someJSON.json
QmeWiZD5cdyiJoS3b7h87Cs9G21uQ1sLmeKrunTae9h5qG someXML.xml
QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm moreData
QmZ7iEGqahTHdUWGGZMUxYRXPwSM3UjBouneLcCmj9e6q6 .

> ipfs-pack db make
> ipfs-pack db make -v
wrote .ipfs-pack/db

> ls -a
./
../
.ipfs-pack/
someJSON.json
someXML.xml
moreData/
PackManifest

> find .ipfs-pack/
.ipfs-pack/
.ipfs-pack/db
```

#### `ipfs-pack make` create (or update) a pack manifest

This command creates (or updates) the pack's manifest file.

```
ipfs-pack make
# wrote PackManifest
```

#### `ipfs-pack verify` checks whether a pack matches its manifest

This command checks whether a pack matches its `PackManifest`.

```
# errors when there is no manifest
> random-files foo
> cd foo
> ipfs-pack verify
error: no PackManifest found

# succeeds when manifest and pack match
> ipfs-pack make
> ipfs-pack verify

# errors when manifest and pack do not match
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file1" >>PackManifest
> echo "QmVizQ5fUceForgWogbb2m2v5RRrE8xEm8uSkbkyNB4Rdm non-existent-file2" >>PackManifest
> touch non-manifest-file3
> ipfs-pack verify
error: in manifest, missing from pack: non-existent-file1
error: in manifest, missing from pack: non-existent-file2
error: in pack, missing from manifest: non-manifest-file3
```

#### `ipfs-pack db` creates (or updates) a temporary ipfs object database

This command creates (or updates) a temporary ipfs object database (eg at `.ipfs-pack/db`). This object database contains positonal metadata for all IPLD objects contained in the pack. (It follows the ipfs repo filestore metadata concerns). It MAY be a different, simpler object-db format, or be a full-fledged ipfs node repo using filestore.

The db is a simple key-value store that supports:

- maps `{ <ipld-cid> : <filestore-descriptor> }`
- supports: `list() []<ipld-cid>` to show all cids in db
- supports: `put(<ipld-object>) <ipld-cid>`
- supports: `get(<ipld-cid>) <ipld-object>`
- supports: `putDescriptor(<ipld-cid>, <filestore-descriptor>)`
- supports: `getDescriptor(<ipld-cid>) <filestore-descriptor>`
- supports: `delete()` to remove itself from disk

Notes:

- `<filestore-descriptor>` is the metadata necessary to reconstruct the entire object from data in the pack.
- `{get,put}` should be able to add or retrieve the objects from db or from the data in the pack.
- `{get,put}Descriptor` should be able to add or retrieve file descriptors for objects stored in the pack.
- Intermediate ipld objects (eg intermediate objects in a file, which are not raw data nodes) may need to be stored in the db.

This database basically implements:
```go
type PackObjectDB interface {  
  // Make creates or updates a pack-db at packdbPath, 
  // with data for all the objects in the pack at packPath.
  Make(packPath string, packdbPath string) error

  // Put associates the given FileDescriptor with the given ipld.CID
  // if filestore.Descriptor is nil, Put removes the entry for ipld.CID (rm)
  Put(ipld.CID, filestore.Descriptor) error

  // Get retrieves the FileDescriptor associated with the given ipld.CID
  Get(ipld.CID) (filestore.Descriptor, error)

  // List returns all ipld.CID stored in the database
  List() (<-chan ipld.CID, error)

  // Delete deletes all the database contents and clears all files
  Delete() error
}
```

And does so both through a programmatic interface (some go package), or via cli tooling:

```
> ipfs-pack-db --help
USAGE
    ipfs-pack-db <subcommand> <arguments>

SUBCOMMANDS
    make     creates (or updates) the pack-db for a pack directory
    list     lists all cids in the pack-db
    put      adds a (cid, filestore-descriptor) entry.
    get      retrieves the filestore-descriptor for a given cid.
    delete   removes all files representing the pack-db (destructive)
```


#### `ipfs-pack serve` starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

This command starts an ipfs node serving the pack's contents (to IPFS and/or HTTP). This command MAY require a full go-ipfs installation to exist. It MAY be a standalone binary (`ipfs-pack-serve`). It MUST use an ephemeral node or a one-off node whose id would be stored locally, in the pack, at `<pack-root>/.ipfs-pack/repo`

```
> ipfs-pack serve --http
Serving pack at /ip4/0.0.0.0/tcp/1234/http - http://127.0.0.1:1234

> ipfs-pack serve --ipfs
Serving pack at /ip4/0.0.0.0/tcp/1234/ipfs/QmPVUA4rJgckcf1ifrZF5KvwV1Uib5SGjJ7Z5BskEpTaSE
```

#### `ipfs-pack bag` convert to and from BagIt (spec-compliant) bags.

This command converts between BagIt (spec-compliant) bags, a commonly used [archiving format](https://tools.ietf.org/html/draft-kunze-bagit-06#section-2.1.3) very similar to `ipfs-pack`. It works like this:


```
> ipfs-pack bag --help
USAGE
  ipfs-pack-bag <src-pack> <dst-bag>
  ipfs-pack-bag <src-bag> <dst-pack>

# convert from pack to bag
> ipfs-pack bag path/to/mypack path/to/mybag

# convert from bag to pack
> ipfs-pack bag path/to/mybag path/to/mypack
```

#### `ipfs-pack car` convert to and from a car (certified archive).

This command converts between packs and cars (certified archives). It works like this:


```
> ipfs-pack car --help
USAGE
  ipfs-pack-car <src-pack> <dst-car>
  ipfs-pack-car <src-car> <dst-pack>

# convert from pack to car
> ipfs-pack car path/to/mypack path/to/mycar.car

# convert from car to pack
> ipfs-pack car path/to/mycar.car path/to/mypack
```

### `datadex` or maybe `gx-dataset`

WIP

a tool to prepare and publish a dataset (as an ipfs-pack, guides user to add dataset metadata and license info, and publishes to a registry)

### `car` - certified archives

WIP

cars would interop with packs.

### The `ipfs repo filestore` 

WIP

Maybe the `ipfs repo filestore` abstractions can leverage `ipfs-packs` to understand what is being tracked in a given directory, particularly if those packs have up-to-date local dbs of all their objects.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposing some tooling for datasets (ipfs-pack and stuff) #205

IPFS Tooling for datasets

Background

Grounding Assumptions

Why current IPFS tooling is not enough

Other tooling examples

Proposed Tooling Additions

`dagger/dagify`

`dagger/dagify --only-cid --only-root`

`ipfs-pack` filesystem packing tool

Subcommands

Usage Example

`ipfs-pack make` create (or update) a pack manifest

`ipfs-pack verify` checks whether a pack matches its manifest

`ipfs-pack db` creates (or updates) a temporary ipfs object database

`ipfs-pack serve` starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

`ipfs-pack bag` convert to and from BagIt (spec-compliant) bags.

`ipfs-pack car` convert to and from a car (certified archive).

`datadex` or maybe `gx-dataset`

`car` - certified archives

The `ipfs repo filestore`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposing some tooling for datasets (ipfs-pack and stuff) #205

Description

IPFS Tooling for datasets

Background

Grounding Assumptions

Why current IPFS tooling is not enough

Other tooling examples

Proposed Tooling Additions

dagger/dagify

dagger/dagify --only-cid --only-root

ipfs-pack filesystem packing tool

Subcommands

Usage Example

ipfs-pack make create (or update) a pack manifest

ipfs-pack verify checks whether a pack matches its manifest

ipfs-pack db creates (or updates) a temporary ipfs object database

ipfs-pack serve starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

ipfs-pack bag convert to and from BagIt (spec-compliant) bags.

ipfs-pack car convert to and from a car (certified archive).

datadex or maybe gx-dataset

car - certified archives

The ipfs repo filestore

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`dagger/dagify`

`dagger/dagify --only-cid --only-root`

`ipfs-pack` filesystem packing tool

`ipfs-pack make` create (or update) a pack manifest

`ipfs-pack verify` checks whether a pack matches its manifest

`ipfs-pack db` creates (or updates) a temporary ipfs object database

`ipfs-pack serve` starts an ipfs node serving the pack's contents (to IPFS and/or HTTP).

`ipfs-pack bag` convert to and from BagIt (spec-compliant) bags.

`ipfs-pack car` convert to and from a car (certified archive).

`datadex` or maybe `gx-dataset`

`car` - certified archives

The `ipfs repo filestore`