Skip to content

Conversation

@bbrowning
Copy link
Collaborator

@bbrowning bbrowning commented Jun 20, 2025

What does this PR do?

This adds a builtin::document_conversion tool for converting documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local implementation that uses Docling, but need to debug some segfault issues I'm hitting locally with that so pushing this first as a simpler reference implementation.

Long-term I think we'll want a remote implemention here as well - like perhaps docling-serve or unstructured.io - but need to look more into that.

Related to #2436

Test Plan

This passes the existing
tests/verifications/openai_api/test_responses.py and adds additional file type tests for .md, .txt, .docx, and .pptx files in addition to the pre-existing .pdf and text content as string tests.

# Run Ollama

INFERENCE_MODEL="meta-llama/Llama-3.2-3B-Instruct" \
llama stack run llama_stack/templates/ollama/run.yaml

# vector store integration tests

LLAMA_STACK_CONFIG=http://localhost:8321 \
pytest -sv tests/integration/vector_io/test_openai_vector_stores.py \
  --embedding-model=all-MiniLM-L6-v2

# Responses API file_search verification tests

pytest -sv tests/verifications/openai_api/test_responses.py \
  -k'file_search' \
  --base-url=http://localhost:8321/v1/openai/v1 \
  --model=meta-llama/Llama-3.2-3B-Instruct

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 20, 2025
@bbrowning bbrowning force-pushed the document-convert branch 2 times, most recently from df09a85 to e56690a Compare June 20, 2025 22:55
@bbrowning
Copy link
Collaborator Author

This seems to work reasonably well. It's opened as a draft because I'm curious what others think about the approach here of using a tool and the existing tool_runtime API with a well-known tool name to convert various types of files to a text format for the file_search implementation with RAG.

My plan would be to follow this up with another inline tool for using Docling as a library, as well as a remote tool using Docling Serve. Docling handles more file types than meta-llama/synthetic-data-kit, but it's also a bit heavier with more configuration knobs that the provider would expose so an admin can control things like OCR engine in use, whether to use vision models or not, and so forth. This initial implementation with meta-llama/synthetic-data-kit gives us a handful of useful parsers and file types while keeping the overall surface area low, at the expense of fewer file formats and less powerful conversion abilities.

This adds a `builtin::document_conversion` tool for converting
documents when used with file_search that uses
meta-llama/synthetic-data-kit. I also have another local
implementation that uses Docling, but need to debug some segfault
issues I'm hitting locally with that so pushing this first as a
simpler reference implementation.

Long-term I think we'll want a remote implemention here as well - like
perhaps docling-serve or unstructured.io - but need to look more into
that.

This passes the existing
`tests/verifications/openai_api/test_responses.py` but doesn't yet add
any new tests for file types besides text and pdf.

Signed-off-by: Ben Browning <[email protected]>
This is needed to get the filename of our file, even though we don't
need its actual contents here anymore.

Signed-off-by: Ben Browning <[email protected]>
This expands the file types tested with file_search to include Word
documents (.docx), Markdown (.md), text (.txt), PDF (.pdf), and
PowerPoint (.pptx) files.

Python's mimetypes library doesn't actually recognize markdown docs as
text, so we have to handle that case specifically instead of relying
on mimetypes to get it right.

Signed-off-by: Ben Browning <[email protected]>
@leseb
Copy link
Collaborator

leseb commented Jul 3, 2025

How does this relate to #2311?

@bbrowning
Copy link
Collaborator Author

How does this relate to #2311?

#2311 is adding a provider to the partially implemented Synthetic Data Generation API, as well as pushing that API closer to implemention. I'm not doing anything with synthetic data here at all, and just using synthetic-data-kit as a way to do some document conversion.

@leseb
Copy link
Collaborator

leseb commented Aug 28, 2025

@bbrowning are you planning on pursuing this? Thanks!

@bbrowning
Copy link
Collaborator Author

This was just a prototype to get feedback on the overall idea/direction, but not something I feel strongly about pushing to completion unless there's specific interest in pluggable document conversion.

@raghotham
Copy link
Contributor

Curious if we should look at this one again given that there was a demo of llama stack integrated well with docling in the office hours this week. cc @franciscojavierarceo

@franciscojavierarceo
Copy link
Collaborator

franciscojavierarceo commented Nov 1, 2025

So the approach the team and I are looking at will probably look more similar to what's mentioned in #4003 (contextual embeddings a la Anthropic).

Which is to say that we'd probably want a full API to process data. So similar but I think the approach will look different.

We'll provide more info about this in a follow up issue but the short version is that a significant number of our enterprise customers are asking for advanced data extraction within their RAG applications (i.e., advanced forms of chunking/processing) for their proprietary data and we should be able to offer this as both a simple configuration option and full API in the stack IMO.

@bbrowning
Copy link
Collaborator Author

I'm going to close this PR, instead of just leaving this here as a draft. That will clear the way for other implementations mentioned in the comments here, as this particular variant of document conversion never really went anywhere.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. new-in-tree-provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants