- 
                Notifications
    
You must be signed in to change notification settings  - Fork 11
 
Open
Description
Tokenizer::WhitespaceTokenizer.new.tokenize "et souligne l'interrelation étroite de l'imagerie avec le comportement" 
=> ["et", "souligne", "l", "'", "i", "n", "t", "e", "r", "r", "e", "l", "a", "t", "i", "o", "n", "étroite", "de", "l", "'", "i", "m", "a", "g", "e", "r", "i", "e", "avec", "le", "comportement"]
Looking at tokenizer.rb, this is because: PRE_N_POST = ['"', "'"], the single quote is treated as a pre/post splitter, hence assumes that any characters after are tokens. I'll look at tackling this, the only splittables that look problematic are ' and . which could appear within a token - the single quote used in french words and the period being used in tokens like email addresses. I was thinking the approach could be to only treat the ' or . as a splittable if it is at the beginning or end of a token - not within.
Metadata
Metadata
Assignees
Labels
No labels