Included laion400m image embeddings dataset generator and links #44

filipecosta90 · 2023-10-19T15:04:21Z

This dataset is based on LAION-400-MILLION OPEN DATASET's dataset of LAION-400-MILLION OPEN DATASET and it

We use the clip embedings as input to produce the required hdf5 files.
The clip embeddings are stored in NPY files. Each NPY file stores 1M samples, and uses around 1GB disk space. There are a total of 400 such files.

By using the create_laion_ds.py we can specify the train an test size, up to 400M vectors.
By default, it produces a 1M train size and a 10K test size.

mpozniak95 · 2023-10-20T13:38:29Z

requirements_py38.txt

 scikit-learn
 jinja2==2.10.1
 pandas==1.1.5
+datasets==2.14.5


There are 2 packages missing in the requirements.txt files:

click

wget

mpozniak95 · 2023-10-20T13:57:09Z

create_laion_ds.py

+
+
+def calc(bf, test, neighbors, distances, count):
+    Parallel(n_jobs=multiprocessing.cpu_count(), require="sharedmem")(


While creating hdf5 files I encountered an issue:

train size: 100000 * 512 test size: 10000 * 512 0/10000... BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. BLAS : Program is Terminated. Because you tried to allocate too many memory regions. … Segmentation fault (core dumped)

The system that I was using is Ubuntu 22.04 with 144 cpu_count and python3.8.10. I was able to workaround this error with hardcoding the n_jobs to 64.

filipecosta90 requested review from GuyAv46 and alonre24 October 19, 2023 15:04

Included laion400m image embeddings dataset generator and links

4bb8a8a

filipecosta90 force-pushed the laion-400m-img-emb branch from e5db863 to 4bb8a8a Compare October 19, 2023 15:09

mpozniak95 reviewed Oct 20, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Included laion400m image embeddings dataset generator and links #44

Included laion400m image embeddings dataset generator and links #44

filipecosta90 commented Oct 19, 2023

Uh oh!

mpozniak95 Oct 20, 2023

Uh oh!

mpozniak95 Oct 20, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def calc(bf, test, neighbors, distances, count):
		Parallel(n_jobs=multiprocessing.cpu_count(), require="sharedmem")(

Included laion400m image embeddings dataset generator and links #44

Are you sure you want to change the base?

Included laion400m image embeddings dataset generator and links #44

Conversation

filipecosta90 commented Oct 19, 2023

Uh oh!

mpozniak95 Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

mpozniak95 Oct 20, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants