Skip to content

Commit e3a97e9

Browse files
author
Sam Partee
authored
Full Fledged CLI and Library (#1)
Move to Library + CLI approach While a generic script is useful, there was alot of opportunity to stop duplicating the same functions we've written many times. split the functions in the script out into modules and introduce the library structure The expectation for saved datasets has now been changed to require that a column in a pickled dataframe have byte vectors already created. This removes the need for the user to specify a column for the vector field. The goal is to eventually have helper functions in the library to make this easy. Lastly, a bunch of project helpers and python project standard items were added.
1 parent c803620 commit e3a97e9

29 files changed

+820
-349
lines changed

.gitignore

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
1-
__pycache__/
1+
__pycache__/
2+
redisvl.egg-info/

Makefile

Lines changed: 73 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
MAKEFLAGS += --no-print-directory
2+
3+
# Do not remove this block. It is used by the 'help' rule when
4+
# constructing the help output.
5+
# help:
6+
# help: Developer Makefile
7+
# help:
8+
9+
10+
SHELL:=/bin/bash
11+
12+
# help: help - display this makefile's help information
13+
.PHONY: help
14+
help:
15+
@grep "^# help\:" Makefile | grep -v grep | sed 's/\# help\: //' | sed 's/\# help\://'
16+
17+
18+
# help:
19+
# help: Style
20+
# help: -------
21+
22+
# help: style - Sort imports and format with black
23+
.PHONY: style
24+
style: sort-imports format
25+
26+
27+
# help: check-style - check code style compliance
28+
.PHONY: check-style
29+
check-style: check-sort-imports check-format
30+
31+
32+
# help: format - perform code style format
33+
.PHONY: format
34+
format:
35+
@black ./redisvl ./tests/
36+
37+
38+
# help: sort-imports - apply import sort ordering
39+
.PHONY: sort-imports
40+
sort-imports:
41+
@isort ./redisvl ./tests/ --profile black
42+
43+
44+
# help: check-lint - run static analysis checks
45+
.PHONY: check-lint
46+
check-lint:
47+
@pylint --rcfile=.pylintrc ./redisvl
48+
49+
50+
# help:
51+
# help: Test
52+
# help: -------
53+
54+
# help: test - Run all tests
55+
.PHONY: test
56+
test:
57+
@python -m pytest
58+
59+
# help: test-verbose - Run all tests verbosely
60+
.PHONY: test-verbose
61+
test-verbose:
62+
@python -m pytest -vv -s
63+
64+
# help: test-cov - Run all tests with coverage
65+
.PHONY: test-cov
66+
test-cov:
67+
@python -m pytest -vv --cov=./redisvl
68+
69+
# help: cov - generate html coverage report
70+
.PHONY: cov
71+
cov:
72+
@coverage html
73+
@echo if data was present, coverage report is in ./htmlcov/index.html

README.md

Lines changed: 78 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,65 +1,97 @@
1-
# RediSearch Data Loader
2-
The purpose of this script is to assist in loading datasets to a RediSearch instance efficiently.
1+
# RedisVL
32

4-
The project is brand new and will undergo improvements over time.
3+
A CLI and Library to help with loading data into Redis specifically for
4+
usage with RediSearch and Redis Vector Search capabilities
55

6-
## Getting Started
6+
### Usage
77

8-
### Requirements
9-
Install the Python requirements listed in `requirements.txt`.
10-
11-
```bash
12-
$ pip install -r requirements.txt
138
```
9+
usage: redisvl <command> [<args>]
1410
15-
### Data
16-
In order to run the script you need to have a dataset that contains your vectors and metadata.
11+
Commands:
12+
load Load vector data into redis
13+
index Index manipulation (create, delete, etc.)
14+
query Query an existing index
1715
18-
>Currently, the data file must be a pickled pandas dataframe. Support for more data types will be included in future iterations.
16+
Redis Vector load CLI
1917
20-
### Schema
21-
Along with the dataset, you must update the dataset schema for RediSearch in [`data/schema.py`](data/schema.py).
18+
positional arguments:
19+
command Subcommand to run
2220
23-
### Running
24-
The `main.py` script provides an entrypoint with optional arguments to upload your dataset to a Redis server.
21+
optional arguments:
22+
-h, --help show this help message and exit
2523
26-
#### Usage
27-
```
28-
python main.py
29-
30-
-h, --help Show this help message and exit
31-
--host Redis host
32-
-p, --port Redis port
33-
-a, --password Redis password
34-
-c , --concurrency Amount of concurrency
35-
-d , --data Path to data file
36-
--prefix Key prefix for all hashes in the search index
37-
-v , --vector Vector field name in df
38-
-i , --index Index name
3924
```
4025

41-
#### Defaults
26+
For any of the above commands, you will need to have an index schema written
27+
into a yaml file for the cli to read. The format of the schema is as follows
28+
29+
```yaml
30+
index:
31+
name: sample # index name used for querying
32+
storage_type: hash
33+
key_field: "id" # column name to use for key in redis
34+
prefix: vector # prefix used for all loaded docs
35+
36+
# all fields to create index with
37+
# sub-items correspond to redis-py Field arguments
38+
fields:
39+
tag:
40+
categories: # name of a tag field used for queries
41+
separator: "|"
42+
year: # name of a tag field used for queries
43+
separator: "|"
44+
vector:
45+
vector: # name of the vector field used for queries
46+
datatype: "float32"
47+
algorithm: "flat" # flat or HSNW
48+
dims: 768
49+
distance_metric: "cosine" # ip, L2, cosine
50+
```
4251
43-
| Argument | Default |
44-
| ----------- | ----------- |
45-
| Host | `localhost` |
46-
| Port | `6379` |
47-
| Password | "" |
48-
| Concurrency | `50` |
49-
| Data (Path) | `data/embeddings.pkl` |
50-
| Prefix | `vector:` |
51-
| Vector (Field Name) | `vector` |
52-
| Index Name | `index` |
52+
#### Example Usage
5353
54+
```bash
55+
# load in a pickled dataframe with
56+
redisvl load -s sample.yml -d embeddings.pkl
57+
```
5458

55-
#### Examples
59+
```bash
60+
# load in a pickled dataframe to a specific address and port
61+
redisvl load -s sample.yml -d embeddings.pkl -h 127.0.0.1 -p 6379
62+
```
5663

57-
Load to a local (default) redis server with a custom index name and with concurrency = 100:
5864
```bash
59-
$ python main.py -d data/embeddings.pkl -i myIndex -c 100
65+
# load in a pickled dataframe to a specific
66+
# address and port and with password
67+
redisvl load -s sample.yml -d embeddings.pkl -h 127.0.0.1 -p 6379 -p supersecret
6068
```
6169

62-
Load to a cloud redis server with all other defaults:
70+
### Support
71+
72+
#### Supported Index Fields
73+
74+
- ``geo``
75+
- ``tag``
76+
- ``numeric``
77+
- ``vector``
78+
- ``text``
79+
#### Supported Data Types
80+
- Pandas DataFrame (pickled)
81+
#### Supported Redis Data Types
82+
- Hash
83+
- JSON (soon)
84+
85+
### Install
86+
Install the Python requirements listed in `requirements.txt`.
87+
6388
```bash
64-
$ python main.py -h {redis-host} -p {redis-port} -a {redis-password}
65-
```
89+
git clone https://github.com/RedisVentures/data-loader.git
90+
cd redisvl
91+
pip install .
92+
```
93+
94+
### Creating Input Data
95+
#### Pandas DataFrame
96+
97+
more to come, see tests and sample-data for usage

conftest.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import os
2+
import pytest
3+
4+
from redisvl.utils.connection import get_async_redis_connection
5+
6+
HOST = os.environ.get("REDIS_HOST", "localhost")
7+
PORT = os.environ.get("REDIS_PORT", 6379)
8+
USER = os.environ.get("REDIS_USER", "default")
9+
PASS = os.environ.get("REDIS_PASSWORD", "")
10+
11+
@pytest.fixture
12+
def async_redis():
13+
return get_async_redis_connection(HOST, PORT, PASS)

data/schema.py

Lines changed: 0 additions & 23 deletions
This file was deleted.

0 commit comments

Comments
 (0)