Commit bacac27
LLM tokenizer pretty decoding fix for emojis/unicode (#1360)
Summary:
Pull Request resolved: #1360
Emojis are not well-handled in the current decoding logic (see example in test plan). What unfortunately happens is that emojis/unicode are tokenized as two symbols - I believe one is to indicate extended unicode (or maybe a type of unicode, e.g., emoji), and the second is the type (e.g., smiley face, omega, ...). This solution makes the assumption that these will always come in pairs and that the symbol "�" is returned by tokenizer if the symbol is unknown (we verify that this wasn't the intended symbol by running it back through tokenizer). This logic will break down if symbols are split up into 3 or more tokens.
Example:
Input String: 😂
Output Token IDs: list of length 2
Pretty Decoded Tokens: ['😂[1/2]', '😂[2/2]']
Note that we cannot just output a single token here as we will be providing attributions for each of the token IDs. In attribution, all such cases here should really be grouped together so inputs are valid and attributions make sense.
Reviewed By: cyrjano
Differential Revision: D63435671
fbshipit-source-id: c029ab17b7c7e6ef1a3fff429da2ecb902d425951 parent 15738d0 commit bacac27
1 file changed
+26
-1
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
230 | 230 | | |
231 | 231 | | |
232 | 232 | | |
233 | | - | |
| 233 | + | |
| 234 | + | |
| 235 | + | |
| 236 | + | |
| 237 | + | |
| 238 | + | |
| 239 | + | |
| 240 | + | |
| 241 | + | |
| 242 | + | |
| 243 | + | |
| 244 | + | |
| 245 | + | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
| 254 | + | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
234 | 259 | | |
235 | 260 | | |
236 | 261 | | |
| |||
0 commit comments