Charset and collation support

Currently, the SQLite driver uses `utf8mb4` as the database charset, and it saves the table and column charsets and collation to the information schema, but it doesn't verify or consider them in any way.

The general idea is to check what charset and collation was specified and verify its compatibility with Unicode (SQLite). If we find a charset or collation that would lead to incorrect application behavior, we should throw an error.

For collations, we should also try to match them to SQLite-supported collations and apply them correctly. The current driver simply adds `COLLATE NOCASE` to all textual columns, because it's the default MySQL behavior, but we need to reflect what was defined in the table/column definition.

---

More particularly, @adamziel suggested:

- Throwing an exception if we see a collation referring to a different character set or comparison rules
- Adding an option like `enforce_utf8_charset` that defaults to false for all the other cases. When we see a mismatched encoding and the option is `true`, we'd log a warning and use a sensible default encoding when we see a query with a "salvageable" encoding or collation definition. When the option is `false`, we'd throw a fatal error explaining what happened and that there's an option you can use. It could also be called `strict_mode` or so.

What's "salvageable" is quite arbitrary and it's easier to say what isn't.

For example, an incompatible set of collation rule would lead to a very different application behavior and I'd just throw an error right away. [MySQL Collation doc page](https://dev.mysql.com/doc/refman/8.0/en/charset-collation-names.html) explains different parts of the collation suffix:

```
Suffix | Meaning
-- | --
_ai | Accent-insensitive
_as | Accent-sensitive
_ci | Case-insensitive
_cs | Case-sensitive
_ks | Kana-sensitive
_bin | Binary
```

When a table declares `latin1`, `utf16`, or anything that isn't `utf8`, we'd likely break the app by quietly using utf-8.

On the flip side, I think we're good to rewrite `utf8`, `utf8mb3`, and other similar variations as `utf8mb4` (when the option is set). [Unicode characters sets](https://dev.mysql.com/doc/refman/8.4/en/charset-unicode-sets.html) page lists deprecated character sets and recommends using `utf8mb4` instead. That would still change the application behavior, but I can't imagine a plugin that relies on collating up to 3 bytes from every UTF-8 characters and not the fourth byte. Similarly, `utf8mb4_general_ci` is deprecated in favor of `utf8mb4_unicode_ci` and I think we could treat them as the same charset.

The [Unicode character sets](https://dev.mysql.com/doc/refman/8.4/en/charset-unicode-sets.html) page also discusses other variations, such as `general_`, language-specific character sets, etc. It's important that we're aware of this general problem space and, when in doubt, default to throwing an error instead of continuing silently.



See also https://github.com/Automattic/sqlite-database-integration/pull/21#discussion_r1980059519.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Charset and collation support #192

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Charset and collation support #192

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions