Skip to content

Consistent qdrant point ids #1839

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 22, 2024
Merged

Consistent qdrant point ids #1839

merged 6 commits into from
Nov 22, 2024

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Nov 20, 2024

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/6094

Description (What does it do?)

This PR generates reproducible uuids off of the resource readable id for vector points stored in Qdrant. What this lets us do, is directly reference and check for existing embeddings in Qdrant if we have a learning resource or content file. Currently for vector similarity, the endpoint unnecessarily re-embeds the referenced document even though the embeddings for that already exist in qdrant (causes a slight delay when loading /api/v1/learning_resources/181/vector_similar/) - this is resolved in this PR since we can re-use the existing embedding

How can this be tested?

  1. Checkout main and make sure you have learning resources locally
  2. clear existing collections and generate the embeddings via python manage.py generate_embeddings --all --skip-contentfiles
  3. find some learning resource id and load the vector similarity endpoint /api/v1/learning_resources/{resource id}/vector_similar/- note the delay in loading
  4. Checkout this branch
  5. make sure you have learning resources locally
  6. clear existing collections and generate the embeddings via python manage.py generate_embeddings --all --skip-contentfiles
  7. find some learning resource and load the vector similarity endpoint /api/v1/learning_resources/{resource id}/vector_similar/ - note how much faster it loads

Additional Context

  • We generate the uuid off of the resource "readable_id" instead of the "id" so that if we had some "master embeddings" snapshot - it can be instantly re-used in any environment.

@shanbady shanbady added Needs Review An open Pull Request that is ready for review and removed Work in Progress labels Nov 21, 2024
@shanbady shanbady marked this pull request as ready for review November 21, 2024 15:31
@abeglova abeglova self-assigned this Nov 22, 2024
Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@shanbady shanbady merged commit 0370de1 into main Nov 22, 2024
11 checks passed
@odlbot odlbot mentioned this pull request Nov 25, 2024
19 tasks
mbertrand pushed a commit that referenced this pull request Dec 2, 2024
* adding util method for generating point id

* moving point id generation outside of model and adding to embed command

* fixing vector similarity endpoint

* adding test

* sorting ids in test

* updating hash key for contentfiles
@rhysyngsun rhysyngsun deleted the shanbady/qdrant-consistent-ids branch February 7, 2025 20:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants