Skip to content

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Jan 29, 2025

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/6602

Description (What does it do?)

This PR adds the ability to toggle the use of the Semantic chunker when embedding contentfiles.

How can this be tested?

Testing interactively

To quickly play with the different params and see how the chunker works interactively:

  1. Checkout this branch
  2. Assemble some long text to chunk (either from an existing content file or copy and paste several paragraphs from wikipedia)
  3. in django shell run the following:
from vector_search.utils import dense_encoder,_chunk_documents
from django.conf import settings

text = """
Put your the content to chunk here
"""

settings.CONTENT_FILE_EMBEDDING_CHUNK_OVERLAP = 10
settings.CONTENT_FILE_EMBEDDING_CHUNK_SIZE_OVERRIDE = 200
settings.SEMANTIC_CHUNKING_CONFIG['buffer_size'] = 3
settings.SEMANTIC_CHUNKING_CONFIG['breakpoint_threshold_type'] = 'gradient'
settings.SEMANTIC_CHUNKING_CONFIG['breakpoint_threshold_amount'] = 0.1

settings.CONTENT_FILE_EMBEDDING_SEMANTIC_CHUNKING_ENABLED = True
encoder = dense_encoder()

docs = _chunk_documents(encoder,[text],{})

for doc in docs:
    print(doc, "-----------\n")
  1. Try tweaking the settings and re-run to see how chunks are affected:
    • buffer_size - window number of sentences before and after to look at when combining
    • breakpoint_threshold_type - ('percentile', 'standard_deviation', 'interquartile', 'gradient') what method we use to factor out - chunk outliers
    • breakpoint_threshold_amount - value we use for breakpoint_threshold_type to filter outliers
    • number_of_chunks - number of chunks to consider for merging

Testing via embedding command

  1. Checkout this branch
  2. bring down celery and redis docker compose down celery redis
  3. set CONTENT_FILE_EMBEDDING_SEMANTIC_CHUNKING_ENABLED to True in your env.
  4. When using the local default (fastembed) chunking with the semantic chunker is extremely slow. for testing, to see results populate in qdrant faster set QDRANT_CHUNK_SIZE to 1
  5. bring celery and redis back up docker compose up -d
  6. find some learning resources that have contentfiles and pass the ids to the generate_embeddings command python manage.y generate_embeddings --resource-ids 1,3,2
  7. inspect the chunks in the contentfiles collection on the qdrant dashboard - note that the chunks should look well formed as compared to non-semantic chunks on rc

@shanbady shanbady added the Needs Review An open Pull Request that is ready for review label Jan 29, 2025
@shanbady shanbady marked this pull request as ready for review January 29, 2025 21:26
@shanbady shanbady changed the title Shanbady/semantic chunking semantic chunking Jan 30, 2025
@shanbady shanbady changed the title semantic chunking Semantic Chunking of Content Files Jan 30, 2025
@abeglova abeglova self-assigned this Jan 30, 2025
Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but the task was extremely slow until i set QDRANT_ENCODER=vector_search.encoders.litellm.LiteLLMEncoder . I'm not sure how vectorizing all the content files is going to work

@Ferdi
Copy link

Ferdi commented Jan 30, 2025

QDRANT_ENCODER=vector_search.encoders.litellm.LiteLLMEncoder

what does this do ?

@shanbady
Copy link
Contributor Author

QDRANT_ENCODER=vector_search.encoders.litellm.LiteLLMEncoder

what does this do ?

This is the setting that allows us to toggle between fastembed for local environments and litellm when deployed. By setting this to "vector_search.encoders.litellm.LiteLLMEncoder" it routes all requests through litelllm

@Ferdi
Copy link

Ferdi commented Jan 31, 2025

it routes all requests through litelllm

litellm is a proxy. what encoder does it use ?

@shanbady shanbady merged commit 5f58f93 into main Jan 31, 2025
11 checks passed
@shanbady
Copy link
Contributor Author

it routes all requests through litelllm

litellm is a proxy. what encoder does it use ?

The embedding model is a separate setting configured via QDRANT_DENSE_MODEL - on rc and production this is currently set to text-embedding-3-large

@odlbot odlbot mentioned this pull request Jan 31, 2025
13 tasks
@shanbady shanbady deleted the shanbady/semantic-chunking branch January 31, 2025 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants