Semantic Chunking of Content Files #2005

shanbady · 2025-01-29T18:25:38Z

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/6602

Description (What does it do?)

This PR adds the ability to toggle the use of the Semantic chunker when embedding contentfiles.

How can this be tested?

Testing interactively

To quickly play with the different params and see how the chunker works interactively:

Checkout this branch
Assemble some long text to chunk (either from an existing content file or copy and paste several paragraphs from wikipedia)
in django shell run the following:

from vector_search.utils import dense_encoder,_chunk_documents
from django.conf import settings

text = """
Put your the content to chunk here
"""

settings.CONTENT_FILE_EMBEDDING_CHUNK_OVERLAP = 10
settings.CONTENT_FILE_EMBEDDING_CHUNK_SIZE_OVERRIDE = 200
settings.SEMANTIC_CHUNKING_CONFIG['buffer_size'] = 3
settings.SEMANTIC_CHUNKING_CONFIG['breakpoint_threshold_type'] = 'gradient'
settings.SEMANTIC_CHUNKING_CONFIG['breakpoint_threshold_amount'] = 0.1

settings.CONTENT_FILE_EMBEDDING_SEMANTIC_CHUNKING_ENABLED = True
encoder = dense_encoder()

docs = _chunk_documents(encoder,[text],{})

for doc in docs:
    print(doc, "-----------\n")

Try tweaking the settings and re-run to see how chunks are affected:
- buffer_size - window number of sentences before and after to look at when combining
- breakpoint_threshold_type - ('percentile', 'standard_deviation', 'interquartile', 'gradient') what method we use to factor out - chunk outliers
- breakpoint_threshold_amount - value we use for breakpoint_threshold_type to filter outliers
- number_of_chunks - number of chunks to consider for merging

Testing via embedding command

Checkout this branch
bring down celery and redis docker compose down celery redis
set CONTENT_FILE_EMBEDDING_SEMANTIC_CHUNKING_ENABLED to True in your env.
When using the local default (fastembed) chunking with the semantic chunker is extremely slow. for testing, to see results populate in qdrant faster set QDRANT_CHUNK_SIZE to 1
bring celery and redis back up docker compose up -d
find some learning resources that have contentfiles and pass the ids to the generate_embeddings command python manage.y generate_embeddings --resource-ids 1,3,2
inspect the chunks in the contentfiles collection on the qdrant dashboard - note that the chunks should look well formed as compared to non-semantic chunks on rc

…gle semantic chunker

abeglova

LGTM but the task was extremely slow until i set QDRANT_ENCODER=vector_search.encoders.litellm.LiteLLMEncoder . I'm not sure how vectorizing all the content files is going to work

Ferdi · 2025-01-30T23:37:30Z

QDRANT_ENCODER=vector_search.encoders.litellm.LiteLLMEncoder

what does this do ?

shanbady · 2025-01-31T14:23:32Z

QDRANT_ENCODER=vector_search.encoders.litellm.LiteLLMEncoder

what does this do ?

This is the setting that allows us to toggle between fastembed for local environments and litellm when deployed. By setting this to "vector_search.encoders.litellm.LiteLLMEncoder" it routes all requests through litelllm

Ferdi · 2025-01-31T14:27:31Z

it routes all requests through litelllm

litellm is a proxy. what encoder does it use ?

shanbady · 2025-01-31T16:49:52Z

it routes all requests through litelllm

litellm is a proxy. what encoder does it use ?

The embedding model is a separate setting configured via QDRANT_DENSE_MODEL - on rc and production this is currently set to text-embedding-3-large

shanbady added 7 commits January 28, 2025 10:48

adding deps for semantic chunker

88792f2

conforming to langchain embedding interface and adding ability to tog…

d5ded1b

…gle semantic chunker

fixing tests

75fb4e3

combine recursive and semantic chunker to stay within chunk size

333c94b

fixing tests

4219662

updating defaults

4afb962

adding test and fixes

376de5b

shanbady added the Needs Review An open Pull Request that is ready for review label Jan 29, 2025

shanbady marked this pull request as ready for review January 29, 2025 21:26

shanbady changed the title ~~Shanbady/semantic chunking~~ semantic chunking Jan 30, 2025

shanbady changed the title ~~semantic chunking~~ Semantic Chunking of Content Files Jan 30, 2025

abeglova self-assigned this Jan 30, 2025

doc updates

e06d07c

abeglova approved these changes Jan 30, 2025

View reviewed changes

shanbady merged commit 5f58f93 into main Jan 31, 2025
11 checks passed

odlbot mentioned this pull request Jan 31, 2025

Release 0.29.0 #2007

Merged

13 tasks

shanbady deleted the shanbady/semantic-chunking branch January 31, 2025 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Semantic Chunking of Content Files #2005

Semantic Chunking of Content Files #2005

Uh oh!

shanbady commented Jan 29, 2025 •

edited

Loading

Uh oh!

abeglova left a comment

Uh oh!

Ferdi commented Jan 30, 2025

Uh oh!

shanbady commented Jan 31, 2025

Uh oh!

Ferdi commented Jan 31, 2025

Uh oh!

Uh oh!

shanbady commented Jan 31, 2025

Uh oh!

Uh oh!

Semantic Chunking of Content Files #2005

Semantic Chunking of Content Files #2005

Uh oh!

Conversation

shanbady commented Jan 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What are the relevant tickets?

Description (What does it do?)

How can this be tested?

Testing interactively

Testing via embedding command

Uh oh!

abeglova left a comment

Choose a reason for hiding this comment

Uh oh!

Ferdi commented Jan 30, 2025

Uh oh!

shanbady commented Jan 31, 2025

Uh oh!

Ferdi commented Jan 31, 2025

Uh oh!

Uh oh!

shanbady commented Jan 31, 2025

Uh oh!

Uh oh!

shanbady commented Jan 29, 2025 •

edited

Loading