feat: Add synthetic-data-kit for file_search doc conversion #2484

bbrowning · 2025-06-20T22:13:25Z

What does this PR do?

This adds a builtin::document_conversion tool for converting documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local implementation that uses Docling, but need to debug some segfault issues I'm hitting locally with that so pushing this first as a simpler reference implementation.

Long-term I think we'll want a remote implemention here as well - like perhaps docling-serve or unstructured.io - but need to look more into that.

Related to #2436

Test Plan

This passes the existing
tests/verifications/openai_api/test_responses.py and adds additional file type tests for .md, .txt, .docx, and .pptx files in addition to the pre-existing .pdf and text content as string tests.

# Run Ollama

INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml

# vector store integration tests

LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \
  --embedding-model=all-MiniLM-L6-v2

# Responses API file_search verification tests

pytest -sv tests/verifications/openai_api/test_responses.py \
  -k'file_search' \
  --base-url=http://localhost:8321/v1/openai/v1 \
  --model=meta-llama/Llama-3.2-3B-Instruct

bbrowning · 2025-06-21T14:53:31Z

This seems to work reasonably well. It's opened as a draft because I'm curious what others think about the approach here of using a tool and the existing tool_runtime API with a well-known tool name to convert various types of files to a text format for the file_search implementation with RAG.

My plan would be to follow this up with another inline tool for using Docling as a library, as well as a remote tool using Docling Serve. Docling handles more file types than meta-llama/synthetic-data-kit, but it's also a bit heavier with more configuration knobs that the provider would expose so an admin can control things like OCR engine in use, whether to use vision models or not, and so forth. This initial implementation with meta-llama/synthetic-data-kit gives us a handful of useful parsers and file types while keeping the overall surface area low, at the expense of fewer file formats and less powerful conversion abilities.

This adds a `builtin::document_conversion` tool for converting documents when used with file_search that uses meta-llama/synthetic-data-kit. I also have another local implementation that uses Docling, but need to debug some segfault issues I'm hitting locally with that so pushing this first as a simpler reference implementation. Long-term I think we'll want a remote implemention here as well - like perhaps docling-serve or unstructured.io - but need to look more into that. This passes the existing `tests/verifications/openai_api/test_responses.py` but doesn't yet add any new tests for file types besides text and pdf. Signed-off-by: Ben Browning <[email protected]>

This is needed to get the filename of our file, even though we don't need its actual contents here anymore. Signed-off-by: Ben Browning <[email protected]>

This expands the file types tested with file_search to include Word documents (.docx), Markdown (.md), text (.txt), PDF (.pdf), and PowerPoint (.pptx) files. Python's mimetypes library doesn't actually recognize markdown docs as text, so we have to handle that case specifically instead of relying on mimetypes to get it right. Signed-off-by: Ben Browning <[email protected]>

leseb · 2025-07-03T09:43:08Z

How does this relate to #2311?

bbrowning · 2025-07-07T16:51:38Z

How does this relate to #2311?

#2311 is adding a provider to the partially implemented Synthetic Data Generation API, as well as pushing that API closer to implemention. I'm not doing anything with synthetic data here at all, and just using synthetic-data-kit as a way to do some document conversion.

leseb · 2025-08-28T10:16:28Z

@bbrowning are you planning on pursuing this? Thanks!

bbrowning · 2025-09-03T14:13:29Z

This was just a prototype to get feedback on the overall idea/direction, but not something I feel strongly about pushing to completion unless there's specific interest in pluggable document conversion.

raghotham · 2025-10-31T22:16:38Z

Curious if we should look at this one again given that there was a demo of llama stack integrated well with docling in the office hours this week. cc @franciscojavierarceo

franciscojavierarceo · 2025-11-01T00:05:27Z

So the approach the team and I are looking at will probably look more similar to what's mentioned in #4003 (contextual embeddings a la Anthropic).

Which is to say that we'd probably want a full API to process data. So similar but I think the approach will look different.

We'll provide more info about this in a follow up issue but the short version is that a significant number of our enterprise customers are asking for advanced data extraction within their RAG applications (i.e., advanced forms of chunking/processing) for their proprietary data and we should be able to offer this as both a simple configuration option and full API in the stack IMO.

bbrowning · 2025-11-04T19:40:39Z

I'm going to close this PR, instead of just leaving this here as a draft. That will clear the way for other implementations mentioned in the comments here, as this particular variant of document conversion never really went anywhere.

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 20, 2025

bbrowning force-pushed the document-convert branch 2 times, most recently from df09a85 to e56690a Compare June 20, 2025 22:55

bbrowning added 3 commits June 27, 2025 13:31

Still retrieve the file_response in openai_vector_store_mixin

0f7d487

This is needed to get the filename of our file, even though we don't need its actual contents here anymore. Signed-off-by: Ben Browning <[email protected]>

bbrowning force-pushed the document-convert branch from fb6763e to 1485f3b Compare June 27, 2025 17:42

leseb added the new-in-tree-provider label Jul 3, 2025

franciscojavierarceo mentioned this pull request Nov 1, 2025

Feature Request: Make Vector Store File Upload Chunking Strategy Configurable #4021

Open

bbrowning closed this Nov 4, 2025

This was referenced Nov 9, 2025

Add a FileProcessor API for provider-based processing #4114

Open

feat(api): add file_processor API skeleton #4113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add synthetic-data-kit for file_search doc conversion #2484

feat: Add synthetic-data-kit for file_search doc conversion #2484

Uh oh!

bbrowning commented Jun 20, 2025 •

edited

Loading

Uh oh!

bbrowning commented Jun 21, 2025

Uh oh!

leseb commented Jul 3, 2025

Uh oh!

bbrowning commented Jul 7, 2025

Uh oh!

leseb commented Aug 28, 2025

Uh oh!

bbrowning commented Sep 3, 2025

Uh oh!

raghotham commented Oct 31, 2025

Uh oh!

franciscojavierarceo commented Nov 1, 2025 •

edited

Loading

Uh oh!

bbrowning commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: Add synthetic-data-kit for file_search doc conversion #2484

feat: Add synthetic-data-kit for file_search doc conversion #2484

Uh oh!

Conversation

bbrowning commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

bbrowning commented Jun 21, 2025

Uh oh!

leseb commented Jul 3, 2025

Uh oh!

bbrowning commented Jul 7, 2025

Uh oh!

leseb commented Aug 28, 2025

Uh oh!

bbrowning commented Sep 3, 2025

Uh oh!

raghotham commented Oct 31, 2025

Uh oh!

franciscojavierarceo commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bbrowning commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

bbrowning commented Jun 20, 2025 •

edited

Loading

franciscojavierarceo commented Nov 1, 2025 •

edited

Loading