Skip to content

Landscape tables result in jumbled text when using extractor.start_document_analysis with TextractFeatures.TABLES #420

@gertct

Description

@gertct

❌ Tested on both v.1.8.5 & v1.9.0 and both fail

Example Document:
Page 8 & 9 of this document (07432326.pdf) have tables in landscape

Expected:

Broxtowe Borough Council Foster Avenue, Beeston, Nottingham, NG9 1AB

Actual:

Council Borough Broxtowe 1AB NG9 Nottingham, Beeston, Avenue, Foster

Note

This issue does not exist on portrait tables

Full textraction:
07432326_ocr.txt

extractor = Textractor()

document = extractor.start_document_analysis(
                    file_source=xxxx,
                    save_image=False,
                    features=[TextractFeatures.TABLES],
                    s3_upload_path=xxxx,
                )

return document.response

Important

✅ This used to work on v1.4.5 - here's the same document on that version

Example extraction:
v1.4.5.txt

Broxtowe Borough Council Foster Avenue, Beeston, Nottingham, NG9 1AB

and we actually get:

Broxtowe Borough Council Foster Avenue, Beeston, Nottingham, NG9 1AB

Here's a diff between textractor.py v.1.4.5 (left) and v.1.9.0 (right) https://www.diffchecker.com/4JsE2FLv/

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions