-
Notifications
You must be signed in to change notification settings - Fork 3
embedding content updates and removals #2217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to work well although i haven't tested it will a delete or update for existing vectors. I will figure out how to test that out. In the meantime, here are some comments
operations = [ | ||
migrations.AddField( | ||
model_name="contentfile", | ||
name="archive_checksum", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a task to the migration to set archive_checksum to the current value of checksum and calculate the checksum from the content for existing contentfiles
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call.
learning_resources/serializers.py
Outdated
@@ -1286,13 +1288,15 @@ class Meta: | |||
"uid", | |||
"title", | |||
"description", | |||
"archive_checksum", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need archive_checksum in the serializer since it's not useful information for users or used for elasticsearch/vector search
unpublished_tasks = [ | ||
tasks.bulk_deindex_learning_resources.si(ids, resource_type), | ||
] | ||
if django_settings.QDRANT_ENABLE_INDEXING_PLUGIN_HOOKS: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the elasticsearch tasks should be chained before the vector embedding tasks since the vector embedding tasks can potentially take a long time. It's better to make the elasticsearch index up to date sooner
vector_search/utils.py
Outdated
based on the resource_type (learning resource vs content file) | ||
""" | ||
client = qdrant_client() | ||
if resource_type != CONTENT_FILE_TYPE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have a seperate version of update_payload(), should_generate_embeddings(), _embedding_context() etc for learning resources and content files instead of having all the if statements?you should be able to know which version of the function to call in the task
vector_search/utils.py
Outdated
) | ||
|
||
|
||
def should_generate_embeddings(serialized_document, resource_type): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to have seperate content file and learning resource verisons of this function instead of all the if statements.
You can consider separate content file or resource specific versions of update_payload(), and _embedding_context() too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. I updated the code so they each have their own _embedding_context, update_payload and should_generate_embeddings methods
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 16 changes: 0 error, 0 warning, 16 info
|
OpenAPI ChangesShow/hide 8 changes: 0 error, 0 warning, 8 info
|
OpenAPI ChangesShow/hide 8 changes: 0 error, 0 warning, 8 info
|
OpenAPI ChangesShow/hide 8 changes: 0 error, 0 warning, 8 info
|
OpenAPI ChangesShow/hide 8 changes: 0 error, 0 warning, 8 info
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
What are the relevant tickets?
Closes https://github.com/mitodl/hq/issues/6754
Description (What does it do?)
This PR introduces updates/upserts and removals for embedding vectors.
The main changes:
For contentfiles
For resources
How can this be tested?
testing the etl hooks
python manage.py backpopulate_micromasters_data
For reference, here is a list of things that this PR (and functionality before it) should ensure: