-
Notifications
You must be signed in to change notification settings - Fork 529
Open
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers
Description
Title: Migrate PDF Processing Core from Marker to Docling
Description:
We are currently using Marker as the core PDF parsing module in our system. This issue aims to **replace Marker with Docling as the main PDF parsing engine.
The updated implementation must retain the same functionality and output structure as the current Marker-based version to ensure backward compatibility with downstream processing components in Omniparse.
🧪 Requirements
- Replace Marker with Docling in the PDF parsing core.
- Ensure the output format is identical to what Marker currently produces (or provide a compatibility adapter).
- All existing test cases for Marker must pass with Docling.
- Provide a Google Colab notebook demonstrating the updated implementation and validating its output with test PDFs.
- Ensure performance is comparable or better than Marker in terms of speed and memory usage.
🛠️ Tips
- Check out Docling's segment and node extraction tools—they should map closely to Marker’s annotation and token-level representations.
- You may need to write a thin compatibility layer to normalize Docling outputs to Marker-style structures.
✅ Acceptance Criteria
- Functionality parity with Marker: same sections, headers, paragraphs, tokens.
- Tests green ✅ in CI.
- Colab notebook demo included and reproducible.
emocat17
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers