-
Notifications
You must be signed in to change notification settings - Fork 48
Description
Currently, the SQLite driver uses utf8mb4
as the database charset, and it saves the table and column charsets and collation to the information schema, but it doesn't verify or consider them in any way.
The general idea is to check what charset and collation was specified and verify its compatibility with Unicode (SQLite). If we find a charset or collation that would lead to incorrect application behavior, we should throw an error.
For collations, we should also try to match them to SQLite-supported collations and apply them correctly. The current driver simply adds COLLATE NOCASE
to all textual columns, because it's the default MySQL behavior, but we need to reflect what was defined in the table/column definition.
More particularly, @adamziel suggested:
- Throwing an exception if we see a collation referring to a different character set or comparison rules
- Adding an option like
enforce_utf8_charset
that defaults to false for all the other cases. When we see a mismatched encoding and the option istrue
, we'd log a warning and use a sensible default encoding when we see a query with a "salvageable" encoding or collation definition. When the option isfalse
, we'd throw a fatal error explaining what happened and that there's an option you can use. It could also be calledstrict_mode
or so.
What's "salvageable" is quite arbitrary and it's easier to say what isn't.
For example, an incompatible set of collation rule would lead to a very different application behavior and I'd just throw an error right away. MySQL Collation doc page explains different parts of the collation suffix:
Suffix | Meaning
-- | --
_ai | Accent-insensitive
_as | Accent-sensitive
_ci | Case-insensitive
_cs | Case-sensitive
_ks | Kana-sensitive
_bin | Binary
When a table declares latin1
, utf16
, or anything that isn't utf8
, we'd likely break the app by quietly using utf-8.
On the flip side, I think we're good to rewrite utf8
, utf8mb3
, and other similar variations as utf8mb4
(when the option is set). Unicode characters sets page lists deprecated character sets and recommends using utf8mb4
instead. That would still change the application behavior, but I can't imagine a plugin that relies on collating up to 3 bytes from every UTF-8 characters and not the fourth byte. Similarly, utf8mb4_general_ci
is deprecated in favor of utf8mb4_unicode_ci
and I think we could treat them as the same charset.
The Unicode character sets page also discusses other variations, such as general_
, language-specific character sets, etc. It's important that we're aware of this general problem space and, when in doubt, default to throwing an error instead of continuing silently.
See also Automattic/sqlite-database-integration#21 (comment).