Skip to content

Add bindings to brute force index #332

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Nov 16, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions TESTING_RECALL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Testing recall

Selecting HNSW parameters for a specific use case highly impacts the search quality. One way to test the quality of the constructed index is to compare the HNSW search results to the actual results (i.e., the actual `k` nearest neighbors).
For that cause, the API enables creating a simple "brute-force" index in which vectors are stored as is, and searching for the `k` nearest neighbors to a query vector requires going over the entire index.
Comparing between HNSW and brute-force results may help with finding the desired HNSW parameters for achieving a satisfying recall, based on the index size and data dimension.

### Brute force index API
`hnswlib.BFIndex(space, dim)` creates a non-initialized index in space `space` with integer dimension `dim`.

`hnswlib.BFIndex` methods:

`init_index(max_elements)` initializes the index with no elements.

max_elements defines the maximum number of elements that can be stored in the structure.

`add_items(data, ids)` inserts the data (numpy array of vectors, shape:`N*dim`) into the structure.
`ids` are optional N-size numpy array of integer labels for all elements in data.

`delete_vector(label)` delete the element associated with the given `label` so it will be omitted from search results.

`knn_query(data, k = 1)` make a batch query for `k `closest elements for each element of the
`data` (shape:`N*dim`). Returns a numpy array of (shape:`N*k`).

`load_index(path_to_index, max_elements = 0)` loads the index from persistence to the uninitialized index.

`save_index(path_to_index)` saves the index from persistence.

### measuring recall example

```
import hnswlib
import numpy as np

dim = 32
num_elements = 100000
k = 10
nun_queries = 10

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# Declaring index
hnsw_index = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
bf_index = hnswlib.BFIndex(space='l2', dim=dim)

# Initing both hnsw and brute force indices
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# hnsw construction params:
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

hnsw_index.init_index(max_elements=num_elements, ef_construction=200, M=16)
bf_index.init_index(max_elements=num_elements)

# Controlling the recall for hnsw by setting ef:
# higher ef leads to better accuracy, but slower search
hnsw_index.set_ef(200)

# Set number of threads used during batch search/construction in hnsw
# By default using all available cores
hnsw_index.set_num_threads(1)

print("Adding batch of %d elements" % (len(data)))
hnsw_index.add_items(data)
bf_index.add_items(data)

print("Indices built")

# Generating query data
query_data = np.float32(np.random.random((nun_queries, dim)))

# Query the elements and measure recall:
labels_hnsw, distances_hnsw = hnsw_index.knn_query(query_data, k)
labels_bf, distances_bf = bf_index.knn_query(query_data, k)

# Measure recall
correct = 0
for i in range(nun_queries):
for label in labels_hnsw[i]:
for correct_label in labels_bf[i]:
if label == correct_label:
correct += 1
break

print("recall is :", float(correct)/(k*nun_queries))
```
Empty file added python_bindings/__init__.py
Empty file.
193 changes: 193 additions & 0 deletions python_bindings/bindings.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -652,6 +652,188 @@ class Index {

};

template<typename dist_t, typename data_t=float>
class BFIndex {
public:
BFIndex(const std::string &space_name, const int dim) :
space_name(space_name), dim(dim) {
normalize=false;
if(space_name=="l2") {
space = new hnswlib::L2Space(dim);
}
else if(space_name=="ip") {
space = new hnswlib::InnerProductSpace(dim);
}
else if(space_name=="cosine") {
space = new hnswlib::InnerProductSpace(dim);
normalize=true;
} else {
throw new std::runtime_error("Space name must be one of l2, ip, or cosine.");
}
alg = NULL;
index_inited = false;
}

static const int ser_version = 1; // serialization version

std::string space_name;
int dim;
bool index_inited;
bool normalize;

hnswlib::labeltype cur_l;
hnswlib::BruteforceSearch<dist_t> *alg;
hnswlib::SpaceInterface<float> *space;

~BFIndex() {
delete space;
if (alg)
delete alg;
}

void init_new_index(const size_t maxElements) {
if (alg) {
throw new std::runtime_error("The index is already initiated.");
}
cur_l = 0;
alg = new hnswlib::BruteforceSearch<dist_t>(space, maxElements);
index_inited = true;
}

void normalize_vector(float *data, float *norm_array){
float norm=0.0f;
for(int i=0;i<dim;i++)
norm+=data[i]*data[i];
norm= 1.0f / (sqrtf(norm) + 1e-30f);
for(int i=0;i<dim;i++)
norm_array[i]=data[i]*norm;
}

void addItems(py::object input, py::object ids_ = py::none()) {
py::array_t < dist_t, py::array::c_style | py::array::forcecast > items(input);
auto buffer = items.request();
size_t rows, features;

if (buffer.ndim != 2 && buffer.ndim != 1) throw std::runtime_error("data must be a 1d/2d array");
if (buffer.ndim == 2) {
rows = buffer.shape[0];
features = buffer.shape[1];
} else {
rows = 1;
features = buffer.shape[0];
}

if (features != dim)
throw std::runtime_error("wrong dimensionality of the vectors");

std::vector<size_t> ids;

if (!ids_.is_none()) {
py::array_t < size_t, py::array::c_style | py::array::forcecast > items(ids_);
auto ids_numpy = items.request();
if (ids_numpy.ndim == 1 && ids_numpy.shape[0] == rows) {
std::vector<size_t> ids1(ids_numpy.shape[0]);
for (size_t i = 0; i < ids1.size(); i++) {
ids1[i] = items.data()[i];
}
ids.swap(ids1);
} else if (ids_numpy.ndim == 0 && rows == 1) {
ids.push_back(*items.data());
} else
throw std::runtime_error("wrong dimensionality of the labels");
}
{

for (size_t row = 0; row < rows; row++) {
size_t id = ids.size() ? ids.at(row) : cur_l + row;
if (!normalize) {
alg->addPoint((void *) items.data(row), (size_t) id);
} else {
float normalized_vector[dim];
normalize_vector((float *)items.data(row), normalized_vector);
alg->addPoint((void *) normalized_vector, (size_t) id);
}
}
cur_l+=rows;
}
}

void deleteVector(size_t label) {
alg->removePoint(label);
}

void saveIndex(const std::string &path_to_index) {
alg->saveIndex(path_to_index);
}

void loadIndex(const std::string &path_to_index, size_t max_elements) {
if (alg) {
std::cerr<<"Warning: Calling load_index for an already inited index. Old index is being deallocated.";
delete alg;
}
alg = new hnswlib::BruteforceSearch<dist_t>(space, path_to_index);
cur_l = alg->cur_element_count;
index_inited = true;
}

py::object knnQuery_return_numpy(py::object input, size_t k = 1) {

py::array_t < dist_t, py::array::c_style | py::array::forcecast > items(input);
auto buffer = items.request();
hnswlib::labeltype *data_numpy_l;
dist_t *data_numpy_d;
size_t rows, features;
{
py::gil_scoped_release l;

if (buffer.ndim != 2 && buffer.ndim != 1) throw std::runtime_error("data must be a 1d/2d array");
if (buffer.ndim == 2) {
rows = buffer.shape[0];
features = buffer.shape[1];
} else {
rows = 1;
features = buffer.shape[0];
}

data_numpy_l = new hnswlib::labeltype[rows * k];
data_numpy_d = new dist_t[rows * k];

for (size_t row = 0; row < rows; row++) {
std::priority_queue<std::pair<dist_t, hnswlib::labeltype >> result = alg->searchKnn(
(void *) items.data(row), k);
for (int i = k - 1; i >= 0; i--) {
auto &result_tuple = result.top();
data_numpy_d[row * k + i] = result_tuple.first;
data_numpy_l[row * k + i] = result_tuple.second;
result.pop();
}
}
}

py::capsule free_when_done_l(data_numpy_l, [](void *f) {
delete[] f;
});
py::capsule free_when_done_d(data_numpy_d, [](void *f) {
delete[] f;
});


return py::make_tuple(
py::array_t<hnswlib::labeltype>(
{rows, k}, // shape
{k * sizeof(hnswlib::labeltype),
sizeof(hnswlib::labeltype)}, // C-style contiguous strides for double
data_numpy_l, // the data pointer
free_when_done_l),
py::array_t<dist_t>(
{rows, k}, // shape
{k * sizeof(dist_t), sizeof(dist_t)}, // C-style contiguous strides for double
data_numpy_d, // the data pointer
free_when_done_d));

}

};


PYBIND11_PLUGIN(hnswlib) {
Expand Down Expand Up @@ -716,5 +898,16 @@ PYBIND11_PLUGIN(hnswlib) {
return "<hnswlib.Index(space='" + a.space_name + "', dim="+std::to_string(a.dim)+")>";
});

py::class_<BFIndex<float>>(m, "BFIndex")
.def(py::init<const std::string &, const int>(), py::arg("space"), py::arg("dim"))
.def("init_index", &BFIndex<float>::init_new_index, py::arg("max_elements"))
.def("knn_query", &BFIndex<float>::knnQuery_return_numpy, py::arg("data"), py::arg("k")=1)
.def("add_items", &BFIndex<float>::addItems, py::arg("data"), py::arg("ids") = py::none())
.def("delete_vector", &BFIndex<float>::deleteVector, py::arg("label"))
.def("save_index", &BFIndex<float>::saveIndex, py::arg("path_to_index"))
.def("load_index", &BFIndex<float>::loadIndex, py::arg("path_to_index"), py::arg("max_elements")=0)
.def("__repr__", [](const BFIndex<float> &a) {
return "<hnswlib.BFIndex(space='" + a.space_name + "', dim="+std::to_string(a.dim)+")>";
});
return m.ptr();
}
88 changes: 88 additions & 0 deletions python_bindings/tests/bindings_test_recall.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
import hnswlib
import numpy as np

dim = 32
num_elements = 100000
k = 10
nun_queries = 10

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))

# Declaring index
hnsw_index = hnswlib.Index(space='l2', dim=dim) # possible options are l2, cosine or ip
bf_index = hnswlib.BFIndex(space='l2', dim=dim)

# Initing both hnsw and brute force indices
# max_elements - the maximum number of elements (capacity). Will throw an exception if exceeded
# during insertion of an element.
# The capacity can be increased by saving/loading the index, see below.
#
# hnsw construction params:
# ef_construction - controls index search speed/build speed tradeoff
#
# M - is tightly connected with internal dimensionality of the data. Strongly affects the memory consumption (~M)
# Higher M leads to higher accuracy/run_time at fixed ef/efConstruction

hnsw_index.init_index(max_elements=num_elements, ef_construction=200, M=16)
bf_index.init_index(max_elements=num_elements)

# Controlling the recall for hnsw by setting ef:
# higher ef leads to better accuracy, but slower search
hnsw_index.set_ef(200)

# Set number of threads used during batch search/construction in hnsw
# By default using all available cores
hnsw_index.set_num_threads(1)

print("Adding batch of %d elements" % (len(data)))
hnsw_index.add_items(data)
bf_index.add_items(data)

print("Indices built")

# Generating query data
query_data = np.float32(np.random.random((nun_queries, dim)))

# Query the elements and measure recall:
labels_hnsw, distances_hnsw = hnsw_index.knn_query(query_data, k)
labels_bf, distances_bf = bf_index.knn_query(query_data, k)

# Measure recall
correct = 0
for i in range(nun_queries):
for label in labels_hnsw[i]:
for correct_label in labels_bf[i]:
if label == correct_label:
correct += 1
break

print("recall is :", float(correct)/(k*nun_queries))

# test serializing the brute force index
index_path = 'bf_index.bin'
print("Saving index to '%s'" % index_path)
bf_index.save_index(index_path)
del bf_index

# Re-initiating, loading the index
bf_index = hnswlib.BFIndex(space='l2', dim=dim)

print("\nLoading index from '%s'\n" % index_path)
bf_index.load_index(index_path)

# Query the brute force index again to verify that we get the same results
labels_bf, distances_bf = bf_index.knn_query(query_data, k)

# Measure recall
correct = 0
for i in range(nun_queries):
for label in labels_hnsw[i]:
for correct_label in labels_bf[i]:
if label == correct_label:
correct += 1
break

print("recall after reloading is :", float(correct)/(k*nun_queries))