Skip to content

french words that contains single quote get broken down #6

@joshweir

Description

@joshweir
Tokenizer::WhitespaceTokenizer.new.tokenize "et souligne l'interrelation étroite de l'imagerie avec le comportement" 
=> ["et", "souligne", "l", "'", "i", "n", "t", "e", "r", "r", "e", "l", "a", "t", "i", "o", "n", "étroite", "de", "l", "'", "i", "m", "a", "g", "e", "r", "i", "e", "avec", "le", "comportement"]

Looking at tokenizer.rb, this is because: PRE_N_POST = ['"', "'"], the single quote is treated as a pre/post splitter, hence assumes that any characters after are tokens. I'll look at tackling this, the only splittables that look problematic are ' and . which could appear within a token - the single quote used in french words and the period being used in tokens like email addresses. I was thinking the approach could be to only treat the ' or . as a splittable if it is at the beginning or end of a token - not within.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions