Skip to content

Get rid of -DOCSTART- tokens in CoNLL output #80

@frreiss

Description

@frreiss

conll_2003_to_dataframes() currently passes through the special -DOCSTART- token when importing the CoNLL file format. It would be better if the import code dropped this special token and the sentence boundary that follows it and did not include either of them in the reconstructed document.

Major subtasks

  • Modify conll_2003_to_dataframes() so that it drops the -DOCSTART- token and the blank line after it when importing a data set in CoNLL-2003 format.
  • Modify conll_2003_output_to_dataframes() so that it also drops the first two lines of each document when importing model outputs
  • Update examples and tutorials to reflect this change. Where needed, subtract 11 from the offsets of any spans we computed with the previous version of conll_2003_to_dataframes()

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions