Skip to content

Conversation

filipecosta90
Copy link

This dataset is based on LAION-400-MILLION OPEN DATASET's dataset of LAION-400-MILLION OPEN DATASET and it

We use the clip embedings as input to produce the required hdf5 files.
The clip embeddings are stored in NPY files. Each NPY file stores 1M samples, and uses around 1GB disk space. There are a total of 400 such files.

By using the create_laion_ds.py we can specify the train an test size, up to 400M vectors.
By default, it produces a 1M train size and a 10K test size.

scikit-learn
jinja2==2.10.1
pandas==1.1.5
datasets==2.14.5

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 packages missing in the requirements.txt files:

  • click
  • wget



def calc(bf, test, neighbors, distances, count):
Parallel(n_jobs=multiprocessing.cpu_count(), require="sharedmem")(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While creating hdf5 files I encountered an issue:

train size:    100000 *  512
test size:      10000 *  512
0/10000...
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
BLAS : Program is Terminated. Because you tried to allocate too many memory regions.
…
Segmentation fault (core dumped)

The system that I was using is Ubuntu 22.04 with 144 cpu_count and python3.8.10. I was able to workaround this error with hardcoding the n_jobs to 64.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants