Skip to content

Table cell word order is incorrect in multi line cells #136

@Hoho5000

Description

@Hoho5000

Using this library to extract table data can lead to incorrect cell text when multiple lines of text are involved.

For example, the following table headers:

 ___________      _______________
| RETURN OF |    |     TAX       |
|  CAPITAL  |    | (WITHHOLDING) |
 ‾‾‾‾‾‾‾‾‾‾‾     |    REFUND     |
                  ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾

becomes "RETURN CAPITAL OF" and "(WITHHOLDING) TAX REFUND" when creating the table cell object.

This is a result of table_cell.py (Line 177).

If I'm not mistaken, it's my understanding that where this is called internally that the word IDs are already sorted in the response.

Removing the sorting and setting the words directly fixes the issue for me.

Is there a particular reason for words that get added to the table cell need to be "sorted" based on bounding boxes, or can that be removed?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions