Skip to content

Commit ed05bf6

Browse files
miss-islingtonerlend-aaslandAlexWaygoodCAM-GerlachCorvinM
authored
[3.12] gh-108590: Improve sqlite3 docs on encoding issues and how to handle those (GH-108699) (#111324)
Add a guide for how to handle non-UTF-8 text encodings. Link to that guide from the 'text_factory' docs. (cherry picked from commit 1262e41) Co-authored-by: Erlend E. Aasland <[email protected]> Co-authored-by: Alex Waygood <[email protected]> Co-authored-by: C.A.M. Gerlach <[email protected]> Co-authored-by: Corvin <[email protected]> Co-authored-by: Ezio Melotti <[email protected]> Co-authored-by: Serhiy Storchaka <[email protected]>
1 parent 3d67b69 commit ed05bf6

File tree

1 file changed

+50
-33
lines changed

1 file changed

+50
-33
lines changed

Doc/library/sqlite3.rst

+50-33
Original file line numberDiff line numberDiff line change
@@ -1123,6 +1123,10 @@ Connection objects
11231123
f.write('%s\n' % line)
11241124
con.close()
11251125

1126+
.. seealso::
1127+
1128+
:ref:`sqlite3-howto-encoding`
1129+
11261130

11271131
.. method:: backup(target, *, pages=-1, progress=None, name="main", sleep=0.250)
11281132

@@ -1189,6 +1193,10 @@ Connection objects
11891193

11901194
.. versionadded:: 3.7
11911195

1196+
.. seealso::
1197+
1198+
:ref:`sqlite3-howto-encoding`
1199+
11921200
.. method:: getlimit(category, /)
11931201

11941202
Get a connection runtime limit.
@@ -1410,39 +1418,8 @@ Connection objects
14101418
and returns a text representation of it.
14111419
The callable is invoked for SQLite values with the ``TEXT`` data type.
14121420
By default, this attribute is set to :class:`str`.
1413-
If you want to return ``bytes`` instead, set *text_factory* to ``bytes``.
14141421

1415-
Example:
1416-
1417-
.. testcode::
1418-
1419-
con = sqlite3.connect(":memory:")
1420-
cur = con.cursor()
1421-
1422-
AUSTRIA = "Österreich"
1423-
1424-
# by default, rows are returned as str
1425-
cur.execute("SELECT ?", (AUSTRIA,))
1426-
row = cur.fetchone()
1427-
assert row[0] == AUSTRIA
1428-
1429-
# but we can make sqlite3 always return bytestrings ...
1430-
con.text_factory = bytes
1431-
cur.execute("SELECT ?", (AUSTRIA,))
1432-
row = cur.fetchone()
1433-
assert type(row[0]) is bytes
1434-
# the bytestrings will be encoded in UTF-8, unless you stored garbage in the
1435-
# database ...
1436-
assert row[0] == AUSTRIA.encode("utf-8")
1437-
1438-
# we can also implement a custom text_factory ...
1439-
# here we implement one that appends "foo" to all strings
1440-
con.text_factory = lambda x: x.decode("utf-8") + "foo"
1441-
cur.execute("SELECT ?", ("bar",))
1442-
row = cur.fetchone()
1443-
assert row[0] == "barfoo"
1444-
1445-
con.close()
1422+
See :ref:`sqlite3-howto-encoding` for more details.
14461423

14471424
.. attribute:: total_changes
14481425

@@ -1601,7 +1578,6 @@ Cursor objects
16011578
COMMIT;
16021579
""")
16031580

1604-
16051581
.. method:: fetchone()
16061582

16071583
If :attr:`~Cursor.row_factory` is ``None``,
@@ -2580,6 +2556,47 @@ With some adjustments, the above recipe can be adapted to use a
25802556
instead of a :class:`~collections.namedtuple`.
25812557

25822558

2559+
.. _sqlite3-howto-encoding:
2560+
2561+
How to handle non-UTF-8 text encodings
2562+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2563+
2564+
By default, :mod:`!sqlite3` uses :class:`str` to adapt SQLite values
2565+
with the ``TEXT`` data type.
2566+
This works well for UTF-8 encoded text, but it might fail for other encodings
2567+
and invalid UTF-8.
2568+
You can use a custom :attr:`~Connection.text_factory` to handle such cases.
2569+
2570+
Because of SQLite's `flexible typing`_, it is not uncommon to encounter table
2571+
columns with the ``TEXT`` data type containing non-UTF-8 encodings,
2572+
or even arbitrary data.
2573+
To demonstrate, let's assume we have a database with ISO-8859-2 (Latin-2)
2574+
encoded text, for example a table of Czech-English dictionary entries.
2575+
Assuming we now have a :class:`Connection` instance :py:data:`!con`
2576+
connected to this database,
2577+
we can decode the Latin-2 encoded text using this :attr:`~Connection.text_factory`:
2578+
2579+
.. testcode::
2580+
2581+
con.text_factory = lambda data: str(data, encoding="latin2")
2582+
2583+
For invalid UTF-8 or arbitrary data in stored in ``TEXT`` table columns,
2584+
you can use the following technique, borrowed from the :ref:`unicode-howto`:
2585+
2586+
.. testcode::
2587+
2588+
con.text_factory = lambda data: str(data, errors="surrogateescape")
2589+
2590+
.. note::
2591+
2592+
The :mod:`!sqlite3` module API does not support strings
2593+
containing surrogates.
2594+
2595+
.. seealso::
2596+
2597+
:ref:`unicode-howto`
2598+
2599+
25832600
.. _sqlite3-explanation:
25842601

25852602
Explanation

0 commit comments

Comments
 (0)