Skip to content

Conversation

jamesbraza
Copy link
Collaborator

This PR cleans up some tech debt in our parsing/chunking schemes

  • Includes all parsing and chunking options in hash-like Metadata.summary field
  • Removed dead code for ParsingOptions and deprecating chunking_algorithm
  • Added multimodal to the autogenerated index name, with a test

@jamesbraza jamesbraza self-assigned this Oct 10, 2025
@jamesbraza jamesbraza added the bug Something isn't working label Oct 10, 2025
@Copilot Copilot AI review requested due to automatic review settings October 10, 2025 23:28
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 10, 2025
Copy link

dosubot bot commented Oct 10, 2025

Related Documentation

Checked 1 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the metadata structure for parsing and chunking operations to consolidate configuration options into hash-like summary fields instead of separate type fields.

  • Replaced parse_type and chunk_type fields with comprehensive summary fields that include all relevant options
  • Removed deprecated ParsingOptions enum and marked chunking_algorithm as deprecated
  • Added multimodal parameter to autogenerated index names to ensure unique indexes for different parsing configurations

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_paperqa.py Updates test assertions to use new summary field format and adds multimodal parameter test
src/paperqa/types.py Refactors ChunkMetadata and ParsedMetadata to replace type fields with summary fields
src/paperqa/settings.py Removes deprecated ParsingOptions, marks chunking_algorithm as deprecated, adds multimodal to index naming
src/paperqa/readers.py Updates parsing and chunking logic to generate summary strings instead of type identifiers
src/paperqa/docs.py Updates condition to check summary field instead of parse_type
packages/paper-qa-pypdf/tests/test_paperqa_pypdf.py Updates test to check summary field format
packages/paper-qa-pypdf/src/paperqa_pypdf/reader.py Generates summary string with multimodal information
packages/paper-qa-pymupdf/tests/test_paperqa_pymupdf.py Updates test to check summary field format
packages/paper-qa-pymupdf/src/paperqa_pymupdf/reader.py Generates detailed summary string with parsing parameters

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

return self


class ParsingOptions(StrEnum):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The premise behind this has been deprecated -- since it's now captured in the input settings.

In the future if there's some compatibility relationship between parsings and chunking algorithms we'll need to add this code back in. It's why it was originally left here FYI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I know what you mean, we can restore in the future when needed

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Oct 14, 2025
@jamesbraza jamesbraza merged commit c4d0323 into main Oct 14, 2025
5 checks passed
@jamesbraza jamesbraza deleted the better-parsing-md branch October 14, 2025 22:06
Copy link

dosubot bot commented Oct 14, 2025

Documentation Updates

Checked 1 published document(s). No updates required.

How did I do? Any feedback?  Join Discord

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants