-
Notifications
You must be signed in to change notification settings - Fork 1.2k
feat: Add synthetic-data-kit for file_search doc conversion #2484
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
df09a85 to
e56690a
Compare
|
This seems to work reasonably well. It's opened as a draft because I'm curious what others think about the approach here of using a tool and the existing tool_runtime API with a well-known tool name to convert various types of files to a text format for the My plan would be to follow this up with another inline tool for using Docling as a library, as well as a remote tool using Docling Serve. Docling handles more file types than meta-llama/synthetic-data-kit, but it's also a bit heavier with more configuration knobs that the provider would expose so an admin can control things like OCR engine in use, whether to use vision models or not, and so forth. This initial implementation with meta-llama/synthetic-data-kit gives us a handful of useful parsers and file types while keeping the overall surface area low, at the expense of fewer file formats and less powerful conversion abilities. |
This adds a `builtin::document_conversion` tool for converting documents when used with file_search that uses meta-llama/synthetic-data-kit. I also have another local implementation that uses Docling, but need to debug some segfault issues I'm hitting locally with that so pushing this first as a simpler reference implementation. Long-term I think we'll want a remote implemention here as well - like perhaps docling-serve or unstructured.io - but need to look more into that. This passes the existing `tests/verifications/openai_api/test_responses.py` but doesn't yet add any new tests for file types besides text and pdf. Signed-off-by: Ben Browning <[email protected]>
This is needed to get the filename of our file, even though we don't need its actual contents here anymore. Signed-off-by: Ben Browning <[email protected]>
This expands the file types tested with file_search to include Word documents (.docx), Markdown (.md), text (.txt), PDF (.pdf), and PowerPoint (.pptx) files. Python's mimetypes library doesn't actually recognize markdown docs as text, so we have to handle that case specifically instead of relying on mimetypes to get it right. Signed-off-by: Ben Browning <[email protected]>
fb6763e to
1485f3b
Compare
|
How does this relate to #2311? |
#2311 is adding a provider to the partially implemented Synthetic Data Generation API, as well as pushing that API closer to implemention. I'm not doing anything with synthetic data here at all, and just using synthetic-data-kit as a way to do some document conversion. |
|
@bbrowning are you planning on pursuing this? Thanks! |
|
This was just a prototype to get feedback on the overall idea/direction, but not something I feel strongly about pushing to completion unless there's specific interest in pluggable document conversion. |
|
Curious if we should look at this one again given that there was a demo of llama stack integrated well with docling in the office hours this week. cc @franciscojavierarceo |
|
So the approach the team and I are looking at will probably look more similar to what's mentioned in #4003 (contextual embeddings a la Anthropic). Which is to say that we'd probably want a full API to process data. So similar but I think the approach will look different. We'll provide more info about this in a follow up issue but the short version is that a significant number of our enterprise customers are asking for advanced data extraction within their RAG applications (i.e., advanced forms of chunking/processing) for their proprietary data and we should be able to offer this as both a simple configuration option and full API in the stack IMO. |
|
I'm going to close this PR, instead of just leaving this here as a draft. That will clear the way for other implementations mentioned in the comments here, as this particular variant of document conversion never really went anywhere. |
What does this PR do?
This adds a
builtin::document_conversiontool for converting documents when used with file_search that usesmeta-llama/synthetic-data-kit. I also have another local implementation that uses Docling, but need to debug some segfault issues I'm hitting locally with that so pushing this first as a simpler reference implementation.
Long-term I think we'll want a remote implemention here as well - like perhaps docling-serve or unstructured.io - but need to look more into that.
Related to #2436
Test Plan
This passes the existing
tests/verifications/openai_api/test_responses.pyand adds additional file type tests for .md, .txt, .docx, and .pptx files in addition to the pre-existing .pdf and text content as string tests.