Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ MDC_DOWNLOAD_PATH=~/.mozdata/datasets # change to where you want to download dat
client.get_dataset('mdc-dataset-id')
```

> [!TIP]
> You can find the `mdc-dataset-id` by looking at the URL of the dataset's page on MDC platform. The ID is the unique string of characters located at the very end of the URL, after the `/datasets/` path. For example, for URL `https://datacollective.mozillafoundation.org/datasets/cmflnuzw6lrt9e6ui4kwcshvn` dataset id will be `cmflnuzw6lrt9e6ui4kwcshvn`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "after /datasets/ in the path"


## Configuration

The client loads configuration from environment variables or `.env` files:
Expand All @@ -70,7 +73,8 @@ MDC_API_URL=https://datacollective.mozillafoundation.org/api
MDC_DOWNLOAD_PATH=~/.mozdata/datasets
```

**Note:** Never commit `.env` files to version control as they contain sensitive information.
> [!TIP]
> Never commit `.env` files to version control as they contain sensitive information.

## Basic Usage

Expand All @@ -90,18 +94,23 @@ dataset = client.get_dataset('your-dataset-id')

## Load and query datasets

**note:** today, this feature only works with Mozilla Common Voice datasets
```
> [!NOTE]
> Today, this feature only works with Mozilla Common Voice datasets

```python
from datacollective import DataCollective

client = DataCollective()

dataset = client.load_dataset("<dataset-id>") # Load dasaset into memory
df = dataset.to_pandas() # Convert to pandas for queryable form
dataset.splits # A list of all splits available in the dataset
# Load dataset into memory
dataset = client.load_dataset("<dataset-id>")
# Convert to pandas for queryable form
df = dataset.to_pandas()
# A list of all splits available in the dataset
dataset.splits
>> ["dev", "train"]
```


## Multiple Environments

You can use different environment configurations:
Expand Down