Skip to content

Migrate PDF Processing Core from Marker to Docling #109

@adithya-s-k

Description

@adithya-s-k

Title: Migrate PDF Processing Core from Marker to Docling

Description:

We are currently using Marker as the core PDF parsing module in our system. This issue aims to **replace Marker with Docling as the main PDF parsing engine.

The updated implementation must retain the same functionality and output structure as the current Marker-based version to ensure backward compatibility with downstream processing components in Omniparse.

🧪 Requirements

  • Replace Marker with Docling in the PDF parsing core.
  • Ensure the output format is identical to what Marker currently produces (or provide a compatibility adapter).
  • All existing test cases for Marker must pass with Docling.
  • Provide a Google Colab notebook demonstrating the updated implementation and validating its output with test PDFs.
  • Ensure performance is comparable or better than Marker in terms of speed and memory usage.

🛠️ Tips

  • Check out Docling's segment and node extraction tools—they should map closely to Marker’s annotation and token-level representations.
  • You may need to write a thin compatibility layer to normalize Docling outputs to Marker-style structures.

✅ Acceptance Criteria

  • Functionality parity with Marker: same sections, headers, paragraphs, tokens.
  • Tests green ✅ in CI.
  • Colab notebook demo included and reproducible.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions