Skip to content

Update redisearch base client to include timeout. Minor improvements on general latency and exported metrics #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 225 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
225 commits
Select commit Hold shift + click to select a range
17ef49e
Update redisearch base client to include timeout. Extended latency me…
filipecosta90 Jul 4, 2023
1e8e719
Included batch_size and parallel info in upload results
filipecosta90 Jul 4, 2023
084a68a
Enable passing env variables for url and api_key qdrant parameters
filipecosta90 Jul 5, 2023
d890fa1
Print ef info if available in config
filipecosta90 Jul 5, 2023
dcdfe21
waiting on full index before return qdrant
filipecosta90 Jul 5, 2023
a1ba1b1
Revert "waiting on full index before return qdrant"
filipecosta90 Jul 5, 2023
87b0330
Added REDIS_QUERY_TIMEOUT config parameter
filipecosta90 Jul 10, 2023
435038a
Introduced MILVUS_USER and MILVUS_PASS env vars
filipecosta90 Jul 11, 2023
f4e49be
Introduce MILVUS_PORT env variable
filipecosta90 Jul 11, 2023
42133d0
Simplify milvus port settings
filipecosta90 Jul 11, 2023
ee7cb9b
Added missing milvus env variables on configure file
filipecosta90 Jul 11, 2023
dfefb49
specify milvus uri when http is detected
filipecosta90 Jul 11, 2023
9766207
wrap milvus upload_batch with backoff
filipecosta90 Jul 11, 2023
9ad6022
Enabled using the API_KEY of weaviate and detecting secure connection…
filipecosta90 Jul 11, 2023
4e1408a
Only printing ef in search parameters if it's available
filipecosta90 Jul 11, 2023
824f17a
Simplified weviate client construct
filipecosta90 Jul 11, 2023
31b2a18
fix NameError on weviate search.py
filipecosta90 Jul 11, 2023
909594d
Update weviate client to include multi-tenant schema check fix
filipecosta90 Jul 11, 2023
42fc051
Specify batch_size (and reduce from 64 to 32) on milvus configs.
filipecosta90 Jul 11, 2023
228b10d
Updating milvus upload params from 16 to 4 concurrent clients doing b…
filipecosta90 Jul 12, 2023
55f91c2
Cleanup on milvus client
filipecosta90 Jul 12, 2023
9641080
Change recall computation to be as ann-benchmarks'
alonre24 Jul 12, 2023
666de49
Merge branch 'update.redisearch' of https://github.com/filipecosta90/…
alonre24 Jul 12, 2023
e108874
Reducing weaviate batch size due to recurrent ingest errors. ensuring…
filipecosta90 Jul 12, 2023
60360ad
Reducing milvus batch size to ensure a steady ingestion
filipecosta90 Jul 12, 2023
832fc86
Increasing timeout config for weaviate. Reduce parallel count on inge…
filipecosta90 Jul 12, 2023
f8b4b7a
Added single client weviate config
filipecosta90 Jul 12, 2023
c10df6d
Added single client weviate config
filipecosta90 Jul 12, 2023
32b4c6e
Revert "Change recall computation to be as ann-benchmarks'"
filipecosta90 Jul 12, 2023
0559dd4
Included single segment config for qdrant
filipecosta90 Jul 13, 2023
e93f273
Verbose info about collection cleaning. Checking if setting is now op…
filipecosta90 Jul 13, 2023
9504e1f
Setting default_segment_number in qdrant-single-node-single-segment
filipecosta90 Jul 13, 2023
0b1f9ee
Remove config assert on qdrant
filipecosta90 Jul 13, 2023
cb01f60
Reverted milvus to original settings
filipecosta90 Jul 14, 2023
64042e7
Bumping weaviate to the latest stable version (1.20.1)
filipecosta90 Jul 17, 2023
3b699b5
Bumping qdrant to latest stable: qdrant/qdrant:v1.3.2
filipecosta90 Jul 17, 2023
575e811
Updated weaviate config timeout
filipecosta90 Jul 17, 2023
03a1829
Using Milvus latest stable v2.2.11
filipecosta90 Jul 18, 2023
f7caa32
Increase qdrant single node configs to avoid timeouts and ensure all …
filipecosta90 Jul 18, 2023
6aabb6d
Added option to run solely specific parallel count
filipecosta90 Jul 18, 2023
bf3300d
changes for running hybrid benchmarks in redis
alonre24 Jul 20, 2023
1769f94
Add patch to support arxiv-titles dataset
alonre24 Jul 26, 2023
1634988
Merge branch 'update.redisearch' into update.redisearch-hybrid_benchm…
filipecosta90 Aug 9, 2023
a01ccbd
Merge pull request #2 from filipecosta90/update.redisearch-hybrid_ben…
filipecosta90 Aug 9, 2023
75d7700
Enabling key prefix ingestion on Redis. More control over clean stage…
filipecosta90 Aug 11, 2023
a7f1c9d
Ensure search results are properly processed when using REDIS_KEY_PREFIX
filipecosta90 Aug 11, 2023
a7c3f32
Ensure that the indexed payload fields are sortable
filipecosta90 Sep 3, 2023
3ebdb3c
Merge remote-tracking branch 'origin/master' into update.redisearch
filipecosta90 Sep 8, 2023
e2f6059
Ensuring when metadata is None we can run the benchmark
filipecosta90 Sep 8, 2023
e2b0c16
Updated client to use REDIS_AUTH
filipecosta90 Sep 11, 2023
d5c9703
Updated client to use REDIS_AUTH
filipecosta90 Sep 11, 2023
6f3aa7c
Merge remote-tracking branch 'origin/master' into update.redisearch
filipecosta90 Oct 19, 2023
1323bf5
Added support of OSS Cluster API
filipecosta90 Oct 26, 2023
bc1ec1a
Add 1M,10M,20M,40M,100M,200M laion datasets
mpozniak95 Nov 6, 2023
904c42c
Merge pull request #3 from mpozniak95/add_laion_dataset
filipecosta90 Nov 9, 2023
70cafeb
Fixed the distance on laion datasets angular==cosine
filipecosta90 Nov 9, 2023
6d8ef9f
Merge remote-tracking branch 'origin/master' into update.redisearch
filipecosta90 Nov 9, 2023
64497f5
Included LAION-400 100K vectors dataset. Showing progress bar on down…
filipecosta90 Nov 9, 2023
1f67607
Merge branch 'master' into update.redisearch
filipecosta90 Nov 22, 2023
94499c1
Fixed sorting order on the search query
filipecosta90 Nov 22, 2023
7772d15
Merge branch 'update.redis' into update.redisearch
filipecosta90 Nov 22, 2023
bcd5057
Ensuring every oss shard has ft.create/ft.drop
filipecosta90 Nov 30, 2023
5992b73
Ensuring every oss shard has ft.create/ft.drop
filipecosta90 Dec 1, 2023
8b41d8a
Ensuring every oss shard has ft.create/ft.drop
filipecosta90 Dec 1, 2023
886dd08
Add laion 400M dataset to datasets.json file
mpozniak95 Dec 6, 2023
c20ee3f
Merge pull request #4 from mpozniak95/add_400M_laion_dataset
filipecosta90 Dec 6, 2023
dfed4ce
Applied black to changed files
filipecosta90 Dec 7, 2023
285d7e7
Merge branch 'update.redis' into update.redisearch
filipecosta90 Dec 7, 2023
4d6ba07
Using random primary node on search_one for Redis
filipecosta90 Dec 17, 2023
96536c5
Using random primary node on search_one for Redis
filipecosta90 Dec 17, 2023
139ee86
Merge branch 'update.redis' into update.redisearch
filipecosta90 Dec 17, 2023
b19349b
Include context on how we distribute search query load on redis in ca…
filipecosta90 Dec 17, 2023
afd73b2
Merge remote-tracking branch 'origin/update.redis' into update.redise…
filipecosta90 Dec 21, 2023
fb2ecc7
Include a large scale redis config
filipecosta90 Dec 21, 2023
ef8bf08
Updated laion-400m angular to cosine reference on dataset
filipecosta90 Jan 25, 2024
3cb1987
Merge remote-tracking branch 'origin/master' into update.redisearch
filipecosta90 Feb 7, 2024
10d7260
Enabled passing postgres connection details
filipecosta90 Feb 7, 2024
fc32ba8
reducing changes to sync with origin
filipecosta90 Feb 15, 2024
2e79be8
General improvements. Extended exposed metrics in results file. Added…
filipecosta90 Feb 15, 2024
b338e95
Merge branch 'general.improvements' into update.redisearch
filipecosta90 Feb 15, 2024
49b6b0d
General improvements. Extended exposed metrics in results file. Added…
filipecosta90 Feb 15, 2024
358484f
Fix: in case on multiple multiple datasets using the same config para…
filipecosta90 Feb 15, 2024
8ec0387
Fix: in case on multiple multiple datasets using the same config para…
filipecosta90 Feb 15, 2024
55a8219
Fix: in case on multiple multiple datasets using the same config para…
filipecosta90 Feb 15, 2024
497aa90
Enabled api key elastic connections
filipecosta90 Mar 4, 2024
beb119f
Extra logging on which auth method we use on elastic
filipecosta90 Mar 4, 2024
100bf97
Enabled api key elastic connections
filipecosta90 Mar 4, 2024
6549035
Fixes per pre-commit hook
filipecosta90 Mar 4, 2024
3f314d8
popping the parallel config from a deep copy of search_params
filipecosta90 Mar 4, 2024
738d98b
Merge branch 'elastic.cloud' into update.redisearch
filipecosta90 Mar 4, 2024
8b652f6
Merge remote-tracking branch 'qdrant/master' into update.redisearch
filipecosta90 Mar 13, 2024
213e6eb
Merge remote-tracking branch 'qdrant/master' into elastic.cloud
filipecosta90 Mar 21, 2024
33bb5cc
Merge branch 'elastic.cloud' into update.redisearch
filipecosta90 Mar 21, 2024
bef25e6
ensuring client connection is available
filipecosta90 Mar 21, 2024
8b6a0c1
dsiabled urlib warnings
filipecosta90 Mar 21, 2024
b682396
dsiabled urlib warnings
filipecosta90 Mar 21, 2024
f8bb088
Disable deprecation warnings on weaviate
filipecosta90 Mar 21, 2024
cc8a9b6
Disable deprecation warnings on weaviate
filipecosta90 Mar 21, 2024
6e194ab
Properly handle Api Error from Elastic
filipecosta90 Mar 22, 2024
c6cdf57
Included elastic index timeout
filipecosta90 Mar 22, 2024
172208b
waiting for ES green status
filipecosta90 Mar 22, 2024
eb28dfb
waiting for ES green status
filipecosta90 Mar 22, 2024
99ef832
Fixed weaviate deprecation warning for apikey
filipecosta90 Mar 22, 2024
738a991
Included repetitions
filipecosta90 Mar 23, 2024
cead3e1
Included repetitions
filipecosta90 Mar 23, 2024
1dcb421
waiting for yellow state in ES
filipecosta90 Mar 23, 2024
145ce7b
Ensuring a proper clean DB at start for memorydb and memorystore. Ens…
filipecosta90 Mar 28, 2024
0f3e576
Add laion-img-emb-768-1G-cosine to datasets
mpozniak95 Apr 3, 2024
77b4ca0
Extracting memory info from redis compatible DBs
filipecosta90 Apr 9, 2024
e31c73c
Included simple plot script
filipecosta90 Apr 11, 2024
1879049
Allowing to specify ALGO
filipecosta90 Apr 19, 2024
8cbc9d4
avoiding call to ft.info on IVF indices
filipecosta90 Apr 19, 2024
f628e32
Adjusted search parameters for non-hnsw queries
filipecosta90 Apr 25, 2024
da9e926
adjusting search parameters for non-hnsw runs
filipecosta90 May 2, 2024
bdd305c
included ivf experiments
filipecosta90 May 6, 2024
fd76a9b
Added specific experiment configurations for IVF and IVF-PQ
filipecosta90 May 8, 2024
a8a9a48
Added specific experiment configurations for IVF and IVF-PQ
filipecosta90 May 8, 2024
9fa1f37
Added specific experiment configurations for IVF and IVF-PQ
filipecosta90 May 8, 2024
b3a0f4e
Added specific experiment configurations for IVF and IVF-PQ
filipecosta90 May 8, 2024
8c222d9
Fixed location of algorithm in search_params
filipecosta90 May 8, 2024
313b5c0
Fixed IVF experiment creation script
filipecosta90 May 9, 2024
5ed2110
Added the ability to query GPU status on redis client (included wrapp…
filipecosta90 May 9, 2024
83210a2
opensearch improvements
filipecosta90 May 16, 2024
b7a2c30
Fixes per PR linter
filipecosta90 May 16, 2024
26b5f8c
Added redis-intel-hnsw* config
filipecosta90 May 28, 2024
ec516a0
Retrieving cloud usage when possible from qdrant
filipecosta90 Jun 7, 2024
f928095
Fixed qdrant get_collection usage()
filipecosta90 Jun 7, 2024
bce646f
Fixed qdrant get_collection usage()
filipecosta90 Jun 7, 2024
4d5ddfb
Fixed qdrant get_collection info()
filipecosta90 Jun 7, 2024
daac642
Added timeout parameter to collection_params. Included all combinatio…
filipecosta90 Jun 11, 2024
562bf89
Included all combinations for redis m=[16,32,64] and ef=[128,256,512]
filipecosta90 Jun 11, 2024
4fe7c00
Added EF-64 variation for Redis
filipecosta90 Jun 12, 2024
56021e8
Enable optimization threads config on qdrant
filipecosta90 Jun 12, 2024
f3d54f7
included retry wrapper on recreate_collection
filipecosta90 Jun 12, 2024
bac6112
Included 200 and 400 client variations on redis
filipecosta90 Jun 14, 2024
47c9d66
Included m=4 and m=10 configs for redis
filipecosta90 Jun 19, 2024
8371d26
Fixed search params setting on redis
filipecosta90 Jun 19, 2024
b137ba0
Fixed search params setting on redis
filipecosta90 Jun 19, 2024
d025ab9
Added m-64-ef-* changes
filipecosta90 Jun 19, 2024
570d7c9
Added m-64-ef-* changes
filipecosta90 Jun 19, 2024
c0ecb6f
Added m-64-ef-* changes
filipecosta90 Jun 19, 2024
3d45ecb
Added m-64-ef-* changes
filipecosta90 Jun 19, 2024
2e39d77
Added m-16-ef-* changes
filipecosta90 Jun 19, 2024
6ef1e94
Added m-16-ef-* changes
filipecosta90 Jun 19, 2024
1dd987b
Added support for BFLOAT16 and FLOAT16 in Redis
filipecosta90 Jul 8, 2024
6e6da52
Added FLOAT64 data type for redis. fixed search: TypeError: data type…
filipecosta90 Jul 9, 2024
aa48d4e
Added FLOAT64 data type for redis. fixed search: TypeError: data type…
filipecosta90 Jul 9, 2024
410527f
Include float64 redis data
filipecosta90 Jul 9, 2024
78d447a
Included extra EF_CONSTRUCT 32, and extra EF_RUNTIME 16, 32
filipecosta90 Jul 11, 2024
5be637a
Included extra EF_CONSTRUCT 32, and extra EF_RUNTIME 16, 32
filipecosta90 Jul 11, 2024
1403f06
temporary glove-100 index size
filipecosta90 Jul 11, 2024
a3284f8
Added more lower M configs for redis variations
filipecosta90 Jul 12, 2024
fd57559
Added FLOAT16 intel config
filipecosta90 Jul 16, 2024
f5efe5b
m-16-ef-64 for milvus
filipecosta90 Jul 18, 2024
a0c6ec8
Varying M/EF for intel/laion setup
filipecosta90 Jul 24, 2024
47c08c9
Add REDIS_KEEP_DOCUMENTS and --only-configure flag
mpozniak95 Aug 7, 2024
4a566f1
Merge pull request #6 from mpozniak95/add_1B_dataset
filipecosta90 Sep 24, 2024
4d1b39c
Added h5-multi file reader
filipecosta90 Sep 29, 2024
7e140c5
Fixed positional arg issue
filipecosta90 Oct 2, 2024
2d8e250
Fixed some typos on loader script
filipecosta90 Oct 2, 2024
6f50710
using pipeline 1 and batch size 1 on CE upload
filipecosta90 Oct 2, 2024
f9e4e2d
Fixed laion-1b queries and dimension
filipecosta90 Oct 3, 2024
c0a5ad3
control the number of concurrent screen sessions on the 1b loader
filipecosta90 Oct 3, 2024
1448b72
split hdf5 file into chunks for reads with less overhead
filipecosta90 Oct 3, 2024
0b01eb4
5 clients per loader on CE loader
filipecosta90 Oct 3, 2024
5e3b711
Using 50 concurrent loaders on laion-1b
filipecosta90 Oct 3, 2024
88552eb
saving outputs of each loader
filipecosta90 Oct 3, 2024
74d0cbb
Merge branch 'update.redisearch' into keep_documents
filipecosta90 Oct 3, 2024
aa701aa
Merge pull request #8 from mpozniak95/keep_documents
filipecosta90 Oct 3, 2024
dca16d5
Adjust code for re-indexing
filipecosta90 Oct 3, 2024
7d53dab
increase client count per runner
filipecosta90 Oct 3, 2024
23e2301
updated upload runner
filipecosta90 Oct 3, 2024
38fb0c6
updated upload runner
filipecosta90 Oct 3, 2024
6ba28d4
using hsetnx to speedup load in case of vector present
filipecosta90 Oct 3, 2024
e3386be
logging missing keys
filipecosta90 Oct 4, 2024
946ff60
Setting REDIS_KEEP_DOCUMENTS=1 as default
filipecosta90 Oct 4, 2024
32b3843
Setting REDIS_KEEP_DOCUMENTS=1 as default
filipecosta90 Oct 4, 2024
2617cb4
Added MAX_QUERIES feature
filipecosta90 Oct 12, 2024
03b742b
fixed MAX_QUERIES
filipecosta90 Oct 12, 2024
353e125
Added REDIS_JUST_INDEX config to avoid resending duplicate data on th…
filipecosta90 Oct 13, 2024
db1896d
Add *args, **kwargs to read data method in reader interface (#10)
mpozniak95 Oct 15, 2024
2f9753a
If boto3 raises no credentials exception use urlib for downloading th…
mpozniak95 Dec 12, 2024
a47d6d3
Update dockerfile and add single case scenario test (#12)
mpozniak95 Dec 12, 2024
aa9e3e2
vector sets
DvirDukhan Mar 19, 2025
6e2b112
git ignore vevn
DvirDukhan Mar 20, 2025
4195d16
fixed fp32 experiment file
DvirDukhan Mar 20, 2025
a1765b9
run script
DvirDukhan Mar 20, 2025
c7ca3e3
Updated weaviate client to use grpc
filipecosta90 Mar 26, 2025
8f033f0
using WEAVIATE GRPC and HTTP port configs
filipecosta90 Mar 26, 2025
99b72f6
ensuring the __del__ method only uses client when it's defined on wea…
filipecosta90 Mar 26, 2025
2c405b9
Updated upload and search steps on weaviate client to match latest cl…
filipecosta90 Mar 26, 2025
41d91bb
Updated upload and search steps on weaviate client to match latest cl…
filipecosta90 Mar 26, 2025
3b91047
use uuids on ingest
filipecosta90 Mar 26, 2025
92b4c84
Allow specifying the hybrid policy
filipecosta90 Mar 31, 2025
e4d9390
catching 2.10 error message. setting default to flushall to mimic com…
filipecosta90 Mar 31, 2025
0e9f9ad
fixed missing } on hybrid policy
filipecosta90 Mar 31, 2025
4aef017
fixed missing } on hybrid policy
filipecosta90 Mar 31, 2025
92dee85
fixed missing } on hybrid policy
filipecosta90 Mar 31, 2025
312547d
adding laion dataset'
Apr 7, 2025
0c36709
adding laion dataset 1M
Apr 7, 2025
3b5e2c9
chunk up the iterable before starting the processes
Apr 7, 2025
92b4ddb
replace arbitrary 15 sec wait with barrier
slice4e Apr 7, 2025
deaf5ab
fixed chunk to correct size
slice4e Apr 8, 2025
4c1d080
implemented custom process management , instead of using pool
slice4e Apr 8, 2025
0d513c2
measure time only during the critical work
slice4e Apr 8, 2025
12868aa
Merge branch 'update.redisearch' into fix-sync
slice4e Apr 9, 2025
2c592a0
Add itertools before islice function
mpozniak95 Apr 9, 2025
83e3f3e
update vset tests
DvirDukhan Apr 21, 2025
2f1a6fb
adding q8
DvirDukhan Apr 21, 2025
ba175b1
Merge pull request #17 from redis-performance/dvirdu_vectorsets
fcostaoliveira Apr 22, 2025
85a6bc7
Merge pull request #16 from mpozniak95/fix-sync
fcostaoliveira Apr 22, 2025
20cea00
Revert "Merge pull request #16 from mpozniak95/fix-sync"
filipecosta90 Apr 22, 2025
8a21508
Optimize client imports to only load engines specified in command line
fcostaoliveira May 6, 2025
075c5c0
Add --queries option to specify number of queries to run
fcostaoliveira May 6, 2025
a8d26cd
Add --ef-runtime option to filter search experiments by ef values
fcostaoliveira May 6, 2025
0209d91
cd /home/fco/redislabs/vector-db-benchmark && git status
Apr 7, 2025
534de8c
Restore performance optimizations from PR #16 (85a6bc7)
fcostaoliveira May 6, 2025
71c1f3b
Add real-time progress bar with throughput display for parallel execu…
fcostaoliveira May 6, 2025
2eda5b7
Add test script for multiprocessing with progress bar
fcostaoliveira May 6, 2025
6127e84
Optimize query cycling for large query counts using generators
fcostaoliveira May 6, 2025
343e906
Update test script to test query cycling optimization
fcostaoliveira May 6, 2025
25ca5ac
Merge pull request #18 from redis-performance/restore-performance-opt…
fcostaoliveira May 6, 2025
7640df0
Add SVS support (#23)
mpozniak95 Jun 12, 2025
fd59cd8
Add reporting memory usage to vectorsets upload, fix running vectorse…
mpozniak95 Jun 12, 2025
878f46d
Record start time before query processing starts (#21)
mihaic Jun 13, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,6 @@ NOTES.md

results/*
tools/custom/data.json

*.png
venv/
5 changes: 5 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,9 @@ ENV PYTHONFAULTHANDLER=1 \
PIP_DEFAULT_TIMEOUT=100 \
POETRY_VERSION=1.5.1

RUN apt update
RUN apt install -y wget

RUN pip install "poetry==$POETRY_VERSION"

# Copy only requirements to cache them in docker layer
Expand All @@ -21,5 +24,7 @@ RUN poetry config virtualenvs.create false \
# Creating folders, and files for a project:
COPY . /code

RUN pip install "boto3"

CMD ["python"]

15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,21 @@ scenario against which it should be tested. A specific scenario may assume
running the server in a single or distributed mode, a different client
implementation and the number of client instances.

## Data sets

We have a number of precomputed data sets. All data sets have been pre-split into train/test and include ground truth data for the top-100 nearest neighbors.

| Dataset | Dimensions | Train size | Test size | Neighbors | Distance |
| ----------------------------------------------------------------------------------------------------------- | ---------: | ---------: | --------: | --------: | --------- |
| [LAION-1M: subset of LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 1,000,000 | 10,000 | 100 | Angular |
| [LAION-10M: subset of LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 10,000,000 | 10,000 | 100 | Angular |
| [LAION-20M: subset of LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 20,000,000 | 10,000 | 100 | Angular |
| [LAION-40M: subset of LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 40,000,000 | 10,000 | 100 | Angular |
| [LAION-100M: subset of LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 100,000,000 | 10,000 | 100 | Angular |
| [LAION-200M: subset of LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 200,000,000 | 10,000 | 100 | Angular |
| [LAION-400M: from LAION 400M English (image embedings)](https://laion.ai/blog/laion-400-open-dataset/) | 512 | 400,000,000 | 10,000 | 100 | Angular |


## How to run a benchmark?

Benchmarks are implemented in server-client mode, meaning that the server is
Expand Down
247 changes: 214 additions & 33 deletions benchmark/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,17 @@
import tarfile
import urllib.request
from dataclasses import dataclass, field
from typing import Dict, Optional

from typing import Dict, List, Optional, Union
import boto3
import botocore.exceptions
from benchmark import DATASETS_DIR
from dataset_reader.ann_compound_reader import AnnCompoundReader
from dataset_reader.ann_h5_reader import AnnH5Reader
from dataset_reader.ann_h5_multi_reader import AnnH5MultiReader
from dataset_reader.base_reader import BaseReader
from dataset_reader.json_reader import JSONReader
from tqdm import tqdm
from pathlib import Path


@dataclass
Expand All @@ -18,59 +22,236 @@ class DatasetConfig:
distance: str
name: str
type: str
path: str
link: Optional[str] = None
path: Dict[
str, List[Dict[str, str]]
] # Now path is expected to handle multi-file structure for h5-multi
link: Optional[Dict[str, List[Dict[str, str]]]] = None
schema: Optional[Dict[str, str]] = field(default_factory=dict)


READER_TYPE = {"h5": AnnH5Reader, "jsonl": JSONReader, "tar": AnnCompoundReader}
READER_TYPE = {
"h5": AnnH5Reader,
"h5-multi": AnnH5MultiReader,
"jsonl": JSONReader,
"tar": AnnCompoundReader,
}


# Progress bar for urllib downloads
def show_progress(block_num, block_size, total_size):
percent = round(block_num * block_size / total_size * 100, 2)
print(f"{percent} %", end="\r")


# Progress handler for S3 downloads
class S3Progress(tqdm):
def __init__(self, total_size):
super().__init__(
total=total_size, unit="B", unit_scale=True, desc="Downloading from S3"
)

def __call__(self, bytes_amount):
self.update(bytes_amount)


class Dataset:
def __init__(self, config: dict):
def __init__(
self,
config: dict,
skip_upload: bool,
skip_search: bool,
upload_start_idx: int,
upload_end_idx: int,
):
self.config = DatasetConfig(**config)
self.skip_upload = skip_upload
self.skip_search = skip_search
self.upload_start_idx = upload_start_idx
self.upload_end_idx = upload_end_idx

def download(self):
target_path = DATASETS_DIR / self.config.path
if isinstance(self.config.path, dict): # Handle multi-file datasets
if self.skip_search is False:
# Download query files
for query in self.config.path.get("queries", []):
self._download_file(query["path"], query["link"])
else:
print(
f"skipping to download query file given skip_search={self.skip_search}"
)
if self.skip_upload is False:
# Download data files
for data in self.config.path.get("data", []):
start_idx = data["start_idx"]
end_idx = data["end_idx"]
data_path = data["path"]
data_link = data["link"]
if self.upload_start_idx >= end_idx:
print(
f"skipping downloading {data_path} from {data_link} given {self.upload_start_idx}>{end_idx}"
)
continue
if self.upload_end_idx < start_idx:
print(
f"skipping downloading {data_path} from {data_link} given {self.upload_end_idx}<{start_idx}"
)
continue
self._download_file(data["path"], data["link"])
else:
print(
f"skipping to download data/upload files given skip_upload={self.skip_upload}"
)

else: # Handle single-file datasets
target_path = DATASETS_DIR / self.config.path

if target_path.exists():
print(f"{target_path} already exists")
return

if self.config.link:
downloaded_withboto = False
if is_s3_link(self.config.link):
print("Use boto3 to download from S3. Faster!")
try:
self._download_from_s3(self.config.link, target_path)
downloaded_withboto = True
except botocore.exceptions.NoCredentialsError:
print("Credentials not found, downloading without boto3")
if not downloaded_withboto:
print(f"Downloading from URL {self.config.link}...")
tmp_path, _ = urllib.request.urlretrieve(
self.config.link, None, show_progress
)
self._extract_or_move_file(tmp_path, target_path)

def _download_file(self, relative_path: str, url: str):
target_path = DATASETS_DIR / relative_path
if target_path.exists():
print(f"{target_path} already exists")
return

if self.config.link:
print(f"Downloading {self.config.link}...")
tmp_path, _ = urllib.request.urlretrieve(self.config.link)
print(f"Downloading from {url} to {target_path}")
tmp_path, _ = urllib.request.urlretrieve(url, None, show_progress)
self._extract_or_move_file(tmp_path, target_path)

if self.config.link.endswith(".tgz") or self.config.link.endswith(
".tar.gz"
):
print(f"Extracting: {tmp_path} -> {target_path}")
(DATASETS_DIR / self.config.path).mkdir(exist_ok=True, parents=True)
file = tarfile.open(tmp_path)
def _extract_or_move_file(self, tmp_path, target_path):
if tmp_path.endswith(".tgz") or tmp_path.endswith(".tar.gz"):
print(f"Extracting: {tmp_path} -> {target_path}")
(DATASETS_DIR / self.config.path).mkdir(exist_ok=True, parents=True)
with tarfile.open(tmp_path) as file:
file.extractall(target_path)
file.close()
os.remove(tmp_path)
else:
print(f"Moving: {tmp_path} -> {target_path}")
(DATASETS_DIR / self.config.path).parent.mkdir(exist_ok=True)
shutil.copy2(tmp_path, target_path)
os.remove(tmp_path)
os.remove(tmp_path)
else:
print(f"Moving: {tmp_path} -> {target_path}")
Path(target_path).parent.mkdir(exist_ok=True)
shutil.copy2(tmp_path, target_path)
os.remove(tmp_path)

def _download_from_s3(self, link, target_path):
s3 = boto3.client("s3")
bucket_name, s3_key = parse_s3_url(link)
tmp_path = f"/tmp/{os.path.basename(s3_key)}"

print(
f"Downloading from S3: {link}... bucket_name={bucket_name}, s3_key={s3_key}"
)
object_info = s3.head_object(Bucket=bucket_name, Key=s3_key)
total_size = object_info["ContentLength"]

with open(tmp_path, "wb") as f:
progress = S3Progress(total_size)
s3.download_fileobj(bucket_name, s3_key, f, Callback=progress)

self._extract_or_move_file(tmp_path, target_path)

def get_reader(self, normalize: bool) -> BaseReader:
reader_class = READER_TYPE[self.config.type]
return reader_class(DATASETS_DIR / self.config.path, normalize=normalize)

if self.config.type == "h5-multi":
# For h5-multi, we need to pass both data files and query file
data_files = self.config.path["data"]
for data_file_dict in data_files:
data_file_dict["path"] = DATASETS_DIR / data_file_dict["path"]
query_file = DATASETS_DIR / self.config.path["queries"][0]["path"]
return reader_class(
data_files=data_files,
query_file=query_file,
normalize=normalize,
skip_upload=self.skip_upload,
skip_search=self.skip_search,
)
else:
# For single-file datasets
return reader_class(DATASETS_DIR / self.config.path, normalize=normalize)


def is_s3_link(link):
return link.startswith("s3://") or "s3.amazonaws.com" in link


def parse_s3_url(s3_url):
if s3_url.startswith("s3://"):
s3_parts = s3_url.replace("s3://", "").split("/", 1)
bucket_name = s3_parts[0]
s3_key = s3_parts[1] if len(s3_parts) > 1 else ""
else:
s3_parts = s3_url.replace("http://", "").replace("https://", "").split("/", 1)

if ".s3.amazonaws.com" in s3_parts[0]:
bucket_name = s3_parts[0].split(".s3.amazonaws.com")[0]
s3_key = s3_parts[1] if len(s3_parts) > 1 else ""
else:
bucket_name = s3_parts[0]
s3_key = s3_parts[1] if len(s3_parts) > 1 else ""

return bucket_name, s3_key


if __name__ == "__main__":
dataset = Dataset(
dataset_s3_split = Dataset(
{
"name": "glove-25-angular",
"vector_size": 25,
"distance": "Cosine",
"type": "h5",
"path": "glove-25-angular/glove-25-angular.hdf5",
"link": "http://ann-benchmarks.com/glove-25-angular.hdf5",
}
"name": "laion-img-emb-768d-1Billion-cosine",
"vector_size": 768,
"distance": "cosine",
"type": "h5-multi",
"path": {
"data": [
{
"file_number": 1,
"path": "laion-1b/data/laion-img-emb-768d-1Billion-cosine-data-part1-0_to_10000000.hdf5",
"link": "http://benchmarks.redislabs.s3.amazonaws.com/vecsim/laion-1b/laion-img-emb-768d-1Billion-cosine-data-part1-0_to_10000000.hdf5",
"vector_range": "0-10000000",
"file_size": "30.7 GB",
},
{
"file_number": 2,
"path": "laion-1b/data/laion-img-emb-768d-1Billion-cosine-data-part10-90000000_to_100000000.hdf5",
"link": "http://benchmarks.redislabs.s3.amazonaws.com/vecsim/laion-1b/laion-img-emb-768d-1Billion-cosine-data-part10-90000000_to_100000000.hdf5",
"vector_range": "90000000-100000000",
"file_size": "30.7 GB",
},
{
"file_number": 3,
"path": "laion-1b/data/laion-img-emb-768d-1Billion-cosine-data-part100-990000000_to_1000000000.hdf5",
"link": "http://benchmarks.redislabs.s3.amazonaws.com/vecsim/laion-1b/laion-img-emb-768d-1Billion-cosine-data-part100-990000000_to_1000000000.hdf5",
"vector_range": "990000000-1000000000",
"file_size": "30.7 GB",
},
],
"queries": [
{
"path": "laion-1b/laion-img-emb-768d-1Billion-cosine-queries.hdf5",
"link": "http://benchmarks.redislabs.s3.amazonaws.com/vecsim/laion-1b/laion-img-emb-768d-1Billion-cosine-queries.hdf5",
"file_size": "38.7 MB",
},
],
},
},
skip_upload=True,
skip_search=False,
)

dataset.download()
dataset_s3_split.download()
reader = dataset_s3_split.get_reader(normalize=False)
print(reader) # Outputs the AnnH5MultiReader instance
Loading
Loading