Skip to content

Full Fledged CLI and Library #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 11, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
__pycache__/
__pycache__/
redisvl.egg-info/
73 changes: 73 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
MAKEFLAGS += --no-print-directory

# Do not remove this block. It is used by the 'help' rule when
# constructing the help output.
# help:
# help: Developer Makefile
# help:


SHELL:=/bin/bash

# help: help - display this makefile's help information
.PHONY: help
help:
@grep "^# help\:" Makefile | grep -v grep | sed 's/\# help\: //' | sed 's/\# help\://'


# help:
# help: Style
# help: -------

# help: style - Sort imports and format with black
.PHONY: style
style: sort-imports format


# help: check-style - check code style compliance
.PHONY: check-style
check-style: check-sort-imports check-format


# help: format - perform code style format
.PHONY: format
format:
@black ./redisvl ./tests/


# help: sort-imports - apply import sort ordering
.PHONY: sort-imports
sort-imports:
@isort ./redisvl ./tests/ --profile black


# help: check-lint - run static analysis checks
.PHONY: check-lint
check-lint:
@pylint --rcfile=.pylintrc ./redisvl


# help:
# help: Test
# help: -------

# help: test - Run all tests
.PHONY: test
test:
@python -m pytest

# help: test-verbose - Run all tests verbosely
.PHONY: test-verbose
test-verbose:
@python -m pytest -vv -s

# help: test-cov - Run all tests with coverage
.PHONY: test-cov
test-cov:
@python -m pytest -vv --cov=./redisvl

# help: cov - generate html coverage report
.PHONY: cov
cov:
@coverage html
@echo if data was present, coverage report is in ./htmlcov/index.html
124 changes: 78 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,65 +1,97 @@
# RediSearch Data Loader
The purpose of this script is to assist in loading datasets to a RediSearch instance efficiently.
# RedisVL

The project is brand new and will undergo improvements over time.
A CLI and Library to help with loading data into Redis specifically for
usage with RediSearch and Redis Vector Search capabilities
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should back link to our docs probably!


## Getting Started
### Usage

### Requirements
Install the Python requirements listed in `requirements.txt`.

```bash
$ pip install -r requirements.txt
```
usage: redisvl <command> [<args>]

### Data
In order to run the script you need to have a dataset that contains your vectors and metadata.
Commands:
load Load vector data into redis
index Index manipulation (create, delete, etc.)
query Query an existing index

>Currently, the data file must be a pickled pandas dataframe. Support for more data types will be included in future iterations.
Redis Vector load CLI

### Schema
Along with the dataset, you must update the dataset schema for RediSearch in [`data/schema.py`](data/schema.py).
positional arguments:
command Subcommand to run

### Running
The `main.py` script provides an entrypoint with optional arguments to upload your dataset to a Redis server.
optional arguments:
-h, --help show this help message and exit

#### Usage
```
python main.py

-h, --help Show this help message and exit
--host Redis host
-p, --port Redis port
-a, --password Redis password
-c , --concurrency Amount of concurrency
-d , --data Path to data file
--prefix Key prefix for all hashes in the search index
-v , --vector Vector field name in df
-i , --index Index name
```

#### Defaults
For any of the above commands, you will need to have an index schema written
into a yaml file for the cli to read. The format of the schema is as follows

```yaml
index:
name: sample # index name used for querying
storage_type: hash
key_field: "id" # column name to use for key in redis
prefix: vector # prefix used for all loaded docs

# all fields to create index with
# sub-items correspond to redis-py Field arguments
fields:
tag:
categories: # name of a tag field used for queries
separator: "|"
year: # name of a tag field used for queries
separator: "|"
vector:
vector: # name of the vector field used for queries
datatype: "float32"
algorithm: "flat" # flat or HSNW
dims: 768
distance_metric: "cosine" # ip, L2, cosine
```

| Argument | Default |
| ----------- | ----------- |
| Host | `localhost` |
| Port | `6379` |
| Password | "" |
| Concurrency | `50` |
| Data (Path) | `data/embeddings.pkl` |
| Prefix | `vector:` |
| Vector (Field Name) | `vector` |
| Index Name | `index` |
#### Example Usage

```bash
# load in a pickled dataframe with
redisvl load -s sample.yml -d embeddings.pkl
```

#### Examples
```bash
# load in a pickled dataframe to a specific address and port
redisvl load -s sample.yml -d embeddings.pkl -h 127.0.0.1 -p 6379
```

Load to a local (default) redis server with a custom index name and with concurrency = 100:
```bash
$ python main.py -d data/embeddings.pkl -i myIndex -c 100
# load in a pickled dataframe to a specific
# address and port and with password
redisvl load -s sample.yml -d embeddings.pkl -h 127.0.0.1 -p 6379 -p supersecret
```

Load to a cloud redis server with all other defaults:
### Support

#### Supported Index Fields

- ``geo``
- ``tag``
- ``numeric``
- ``vector``
- ``text``
#### Supported Data Types
- Pandas DataFrame (pickled)
#### Supported Redis Data Types
- Hash
- JSON (soon)

### Install
Install the Python requirements listed in `requirements.txt`.

```bash
$ python main.py -h {redis-host} -p {redis-port} -a {redis-password}
```
git clone https://github.com/RedisVentures/data-loader.git
cd redisvl
pip install .
```

### Creating Input Data
#### Pandas DataFrame

more to come, see tests and sample-data for usage
13 changes: 13 additions & 0 deletions conftest.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
import os
import pytest

from redisvl.utils.connection import get_async_redis_connection

HOST = os.environ.get("REDIS_HOST", "localhost")
PORT = os.environ.get("REDIS_PORT", 6379)
USER = os.environ.get("REDIS_USER", "default")
PASS = os.environ.get("REDIS_PASSWORD", "")

@pytest.fixture
def async_redis():
return get_async_redis_connection(HOST, PORT, PASS)
23 changes: 0 additions & 23 deletions data/schema.py

This file was deleted.

Loading