Skip to content

[textractprettyprinter] List contents are duplicated when generating text output using get_text_from_layout_json #391

@adityachandak287

Description

@adityachandak287

Current Behavior

While trying to create markdown or text files from AWS Textract JSON output using the get_text_from_layout_json function, the contents of ALL the list items are duplicated in the output.

Expected Behavior

Each list item's contents should be included in the output only once.

Related Issues

Possible Solution

The AWS docs on Textract Layout Response Objects mention that in the case of LAYOUT_LIST elements, their children can point to LAYOUT_TEXT elements, which is the case here.

Layout elements can also point to different objects, such as TABLE objects, Key-Value pairs, or LAYOUT_TEXT elements in the case of LAYOUT_LIST

Due to this, when getting all layouts from the Textract JSON output (LinearizeLayout._get_layout_blocks), the LIST_LAYOUT as well as its child TEXT_LAYOUT layout elements are included, which leads to the duplication in output text.

The get_text_from_layout_json function is a wrapper over LinearizeLayout.get_text function which loops over all layouts (blocks with LAYOUT.* type) from the Textract JSON output and collects the text contents from their children blocks.

The fix lies in the LinearizeLayout._get_layout_blocks function where we can exclude the LAYOUT_TEXT elements which are children of LAYOUT_LIST elements.

Steps to Reproduce

Minimal reproduction repo: adityachandak287/textractprettyprinter-list-duplication-bug-repro

The repository contains the following for reference:

Environment
amazon-textract-caller==0.2.4
amazon-textract-prettyprinter==0.1.10
amazon-textract-response-parser==0.1.48
boto3==1.35.6
botocore==1.35.6

Edit: Added related issues section.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions