Changing pybind11::str to exclusively hold PyUnicodeObject #2409

rwgk · 2020-08-19T08:32:16Z

Before this PR, pybind11::str can hold PyUnicodeObject or PyBytesObject, which is probably surprising and was never documented. As a side-effect, pybind11::isinstance<str>() is true for both pybind11::str and pybind11::bytes. This PR changes the pybind11::str implementation to be in line with the documented behavior, but provides an escape hatch to go back to the legacy behavior, via the PYBIND11_STR_LEGACY_PERMISSIVE macro. This macro will be removed in future releases.

This PR changes pybind11::str so that it can only hold PyUnicodeObject, and pybind11::isinstance<str>() is true only for pybind11::str, but false for pybind11::bytes. However, for Python 2 only (!), the pybind11::str caster is modified to implicitly decode bytes to PyUnicodeObject. Without this implicit conversion, Python code currently used with Python 2 & 3 would need to be cluttered with six.ensure_text() or similar, only to be un-cluttered later after Python 2 support is dropped.

This PR was exhaustively tested in the Google environment (hundreds of thousands of indirect dependencies). A one-page summary of user code fixes needed is here, along with fixes needed for other PRs. The number of fixes needed in connection with this PR was similar to that for other PRs. Two types of required fixes are expected to be common:

Accidental use of pybind11::str instead of pybind11::bytes, masked by the legacy permissive behavior. These are probably very easy to fix.
Reliance on pybind11::isinstance<str>(obj) being true for bytes. This is likely to be easy to fix in most cases by adding || pybind11::isinstance<bytes>(obj), but a fix may be more involved, e.g. if pybind11::isinstance<T> appears in a template (we found one such case in the Google environment).

Backward compatible change preparing for pybind11 update. Hidden (and luckily inconsequential) bugs discovered while testing with the current pybind11 github master branch, and current pybind/pybind11#2409 applied locally. The code changed in this CL depends on a pybind11 mis-feature: Current `stable` `pybind11::str` can hold either `PyUnicodeObject` (as documented) or `PyBytesObject` (undocumented and probably very surprising), even under Python 3. pybind PR #2409 changes `pybind11::str` so that it can only hold `PyUnicodeObject`. PiperOrigin-RevId: 327849650 Change-Id: I2a119479a6af8ab8ec5315a1b8565e96952b84c1

Adding missing `bytes` type to `test_constructors()`, to exercise the code change. The changes in the PR were cherry-picked from PR pybind#2409 (with a very minor modification in test_pytypes.py related to flake8). Via PR pybind#2409, these changes were extensively tested in the Google environment, as summarized here: https://docs.google.com/document/d/1TPL-J__mph_yHa1quDvsO12E_F5OZnvBaZlW9IIrz8M/ The changes in this PR did not cause an issues at all. Note that `test_constructors()` before this PR passes for Python 2 only because `pybind11::str` can hold `PyUnicodeObject` or `PyBytesObject`. As a side-effect of this PR, `test_constructors()` no longer relies on this permissive `pybind11::str` behavior. However, the permissive behavior is still exercised/exposed via the existing `test_pybind11_str_raw_str()`. The test code change is designed to enable easy removal later, when Python 2 support is dropped. For completeness: confusingly, the non-test code changes travelled through PR Example `ambiguous conversion` error fixed by this PR: ``` pybind11/tests/test_pytypes.cpp:214:23: error: ambiguous conversion for functional-style cast from 'pybind11::detail::item_accessor' (aka 'accessor<accessor_policies::generic_item>') to 'py::bytes' "bytes"_a=py::bytes(d["bytes"]), ^~~~~~~~~~~~~~~~~~~~ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor PYBIND11_OBJECT(bytes, object, PYBIND11_BYTES_CHECK) ^ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor pybind11/include/pybind11/detail/../pytypes.h:987:15: note: candidate constructor inline bytes::bytes(const pybind11::str &s) { ^ 1 error generated. ```

Adding missing `bytes` type to `test_constructors()`, to exercise the code change. The changes in the PR were cherry-picked from PR #2409 (with a very minor modification in test_pytypes.py related to flake8). Via PR #2409, these changes were extensively tested in the Google environment, as summarized here: https://docs.google.com/document/d/1TPL-J__mph_yHa1quDvsO12E_F5OZnvBaZlW9IIrz8M/ The changes in this PR did not cause an issues at all. Note that `test_constructors()` before this PR passes for Python 2 only because `pybind11::str` can hold `PyUnicodeObject` or `PyBytesObject`. As a side-effect of this PR, `test_constructors()` no longer relies on this permissive `pybind11::str` behavior. However, the permissive behavior is still exercised/exposed via the existing `test_pybind11_str_raw_str()`. The test code change is designed to enable easy removal later, when Python 2 support is dropped. For completeness: confusingly, the non-test code changes travelled through PR Example `ambiguous conversion` error fixed by this PR: ``` pybind11/tests/test_pytypes.cpp:214:23: error: ambiguous conversion for functional-style cast from 'pybind11::detail::item_accessor' (aka 'accessor<accessor_policies::generic_item>') to 'py::bytes' "bytes"_a=py::bytes(d["bytes"]), ^~~~~~~~~~~~~~~~~~~~ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor PYBIND11_OBJECT(bytes, object, PYBIND11_BYTES_CHECK) ^ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor pybind11/include/pybind11/detail/../pytypes.h:987:15: note: candidate constructor inline bytes::bytes(const pybind11::str &s) { ^ 1 error generated. ```

… py::error_already_set if not. Similar to pybind#2392, but does not depend on pybind#2409. Splitting out this PR from pybind#2409 to make that PR as simple as possible. Net effects of this PR: * Adds missing test coverage. * Changes TypeError to UnicodeDecodeError for Python 2. This PR has two commits. Please do not squash, to make the behavior change obvious in the commit history.

wjakob · 2020-11-16T10:14:17Z

Those changes look good to me. Putting string-specific code into the py::object caster is a no-go in principle, but I see that it is masked via a condition that can be tested at compile time, and it's for Python 2.7 only (i.e. will eventually be removed). Can you document this new flag in the documentation (e.g. in the porting guide?)

rwgk · 2020-11-16T20:55:36Z

Can you document this new flag in the documentation (e.g. in the porting guide?)

Done. Adding to docs/upgrade.rst is the only change I made. Thanks Wenzel!

YannickJadoul

A few minor details, but this looks good to me! As we've discussed in plenty of other issues/PRs before, I would argue that most or almost all use cases that will be affected can be considered bugs.
pybind11 is already pretty clear about the names str and bytes having "Python 3 semantics", I would say.

The one thing is this special case for Python 2 (and the changes necessary to pyobject_caster, as pointed out by @wjakob). The good news is Python 2 already does something similar here:

Python 2.7.17 (default, Sep 30 2020, 13:38:04) 
[GCC 7.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'abc'
'abc'
>>> unicode('abc')
u'abc'
>>> print('\xc3\xb1'.decode('utf-8'))
ñ
>>> '\xc3\xb1'
'\xc3\xb1'
>>> unicode('\xc3\xb1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>>

So the current implementation matches behavior, except that it's using codec "ascii" instead of "utf-8". We could match this, but I'm also fine keeping UTF-8, since pybind11 is again very clear on following UTF-8 everywhere.

I also want to quickly remind ourselves of why this detail::PyUnicode_Check_Permissive was implemented in the first place. I kind of remember we already dug back in history, at some point, but I forgot the conclusion. I'll see if I can find it and add a cross-reference from this PR.

YannickJadoul · 2020-11-16T21:53:02Z

docs/changelog.rst

+v2.7.0 (WIP)
+------------------------------
+
+* ``py::str`` changed to exclusively hold `PyUnicodeObject`. Previously
+  ``py::str`` could also hold `bytes`, which is probably surprising, was
+  never documented, and can mask bugs (e.g. accidental use of ``py::str``
+  instead of ``py::bytes``).
+  `#2409 <https://github.com/pybind/pybind11/pull/2409>`_
+
+


Checking with @henryiii: I thought we collected the changelog entries in the PR (indicated in the PR template), to avoid conflicts?

Also, what's the plan for future future release? (i.e., 2.6.2 is still indicated as "TBA")
Also (very minor), this says "WIP", while v2.6.2 says "(TBA, not yet release)". We should probably be consistent? Again, @henryiii, did you have a plan, here?

(Just realizing: maybe this is still a remainder from long long time ago, before templates and the new changelog system?)

YannickJadoul · 2020-11-16T21:57:25Z

docs/upgrade.rst

@@ -10,6 +10,31 @@ modernization and other useful information.

 .. _upgrade-guide-2.6:

+v2.7


Same here: how do we want to approach this, w.r.t. merging the current version into master? How confusing will it be if master already contains parts of an v2.7 upgrade guide?

Not really related to this PR though; @rwgk is just a bit unlucky that this might be the first PR for v2.7 with bigger changes that require notes in the upgrade guide.

YannickJadoul · 2020-11-16T21:57:55Z

docs/upgrade.rst

+``py::bytes``. Starting with v2.7, ``py::str`` exclusively holds
+``PyUnicodeObject`` (`#2409 <https://github.com/pybind/pybind11/pull/2409>`_),
+and ``py::isinstance<str>()`` is ``true`` only for ``py::str``. To help in
+the transition of client code, the ``PYBIND11_STR_LEGACY_PERMISSIVE`` macro


Let's document that users should nót rely on this, and that we're planning/aiming to get rid of this in v2.8. I know this is part of the "transition of client code", but let's make it explicit.

Also, another minor detail, now that I read this again: this is user-facing documentation, so "client code" is unnecessary, perhaps?

Oh, nevermind, it is actually 2 lines further down; I guess I was just expecting/scanning for "2.8"...

YannickJadoul · 2020-11-16T22:01:41Z

docs/upgrade.rst

+and ``py::isinstance<str>()`` is ``true`` only for ``py::str``. To help in
+the transition of client code, the ``PYBIND11_STR_LEGACY_PERMISSIVE`` macro
+is provided as an escape hatch to go back to the legacy behavior. This macro
+will be removed in future releases. Two types of required client-code fixes


Again, all code is "client-code" in this context, to me.

YannickJadoul · 2020-11-16T22:02:09Z

docs/upgrade.rst

+are expected to be common:
+
+* Accidental use of ``py::str`` instead of ``py::bytes``, masked by the legacy
+  behavior. These are probably very easy to fix, by changing from


You might mention this is probably a bug?

YannickJadoul · 2020-11-16T22:05:02Z

include/pybind11/cast.h

+#if PY_MAJOR_VERSION < 3 && !defined(PYBIND11_STR_LEGACY_PERMISSIVE)
+        // For Python 2, without this implicit conversion, Python code would
+        // need to be cluttered with six.ensure_text() or similar, only to be
+        // un-cluttered later after Python 2 support is dropped.
+        if (std::is_same<T, str>::value && isinstance<bytes>(src)) {
+            PyObject *str_from_bytes = PyUnicode_FromEncodedObject(src.ptr(), "utf-8", nullptr);
+            if (!str_from_bytes) throw error_already_set();
+            value = reinterpret_steal<type>(str_from_bytes);
+            return true;
+        }
+#endif


Alternatively, if we don't want this here (cfr. @wjakob's remark), we could specialize pyobject_caster<str> or type_caster<str>?
But maybe that's too much for code that will disappear in a hopefully not all too distant future.

YannickJadoul · 2020-11-16T22:12:18Z

include/pybind11/stl.h

@@ -144,7 +144,7 @@ template <typename Type, typename Value> struct list_caster {
    using value_conv = make_caster<Value>;

    bool load(handle src, bool convert) {
-        if (!isinstance<sequence>(src) || isinstance<str>(src))
+        if (!isinstance<sequence>(src) || isinstance<bytes>(src) || isinstance<str>(src))


I think we actually don't want this change. Doing so (or well, not doing so) would fix #1807.

I'm OK with making and discussing this change (or undoing your change) in a separate PR, though.

See also #2198.

YannickJadoul · 2020-11-16T23:08:45Z

I'll see if I can find it and add a cross-reference from this PR.

OK, this is the best there is to be found, it seems: 5612a0c
There's not a lot I can distill from that, except that pybind11 has changed a lot since then, and that the original author from back then (@wjakob) said he's OK with the current changes, so ... not too worried about this :-)

YannickJadoul · 2020-11-16T23:23:25Z

Diving further with @bstaletic into this commit 5612a0c, there's a piece of dead code, now:

pybind11/include/pybind11/pytypes.h

Lines 960 to 972 in 17c22b9

    
           operator std::string() const { 
        
               object temp = *this; 
        
               if (PyUnicode_Check(m_ptr)) { 
        
                   temp = reinterpret_steal<object>(PyUnicode_AsUTF8String(m_ptr)); 
        
                   if (!temp) 
        
                       pybind11_fail("Unable to extract string contents! (encoding issue)"); 
        
               } 
        
               char *buffer; 
        
               ssize_t length; 
        
               if (PYBIND11_BYTES_AS_STRING_AND_SIZE(temp.ptr(), &buffer, &length)) 
        
                   pybind11_fail("Unable to extract string contents! (invalid type)"); 
        
               return std::string(buffer, (size_t) length); 
        
           }

This was introduced in 5612a0c, but the second part is now dead code. Is this something to check with @wjakob, and maybe fix in a follow-up PR?

rwgk · 2020-12-01T04:41:35Z

Thanks @YannickJadoul, sorry for the late reply. Just a quick note: I'll wait until after the 2.6.2 release before putting on a few finishing touches here (e.g. I'll try get rid of the dead code you pointed out).

YannickJadoul · 2020-12-01T18:01:27Z

Thanks @YannickJadoul, sorry for the late reply. Just a quick note: I'll wait until after the 2.6.2 release before putting on a few finishing touches here (e.g. I'll try get rid of the dead code you pointed out).

No worries. Sounds good to me. (Though, is there a 2.6.2 planned; I'm not sure we currently have a lot of bugfixes waiting for release?)
Should we go ahead and merge #2198 as a way to fix #1807 already in 2.6.2?

rwgk · 2020-12-01T18:07:27Z

Should we go ahead and merge #2198 as a way to fix #1807 already in 2.6.2?

We had a lot of back and forth about this particular change (me, you, @bstaletic taking different positions), which is why I'm preserving current behavior in this PR.

I'll comment on #2198 to log what I think is the best change.

YannickJadoul · 2020-12-01T18:15:46Z

Well, yeah, I also noted it here: #2409 (comment)

I'm not sure I see anything wrong with the change in #2198. I just though it had stalled because we figured out there was a much bigger issue.

rwgk · 2021-01-29T15:47:14Z

@YannickJadoul, I changed "client code" to "user code" the first time I want to be unambiguous about what "transition" refers to, and delete the second mention of "client code" a couple lines down. You're right, the context is clear enough there.

A follow-up PR for the dead code you discovered would be nice. I'm leaving the code in the PR as it was for the past ~2.5 months. It has been in use internally for the entire time (and an earlier version even since August 2020).

rwgk marked this pull request as draft August 19, 2020 08:34

rwgk force-pushed the pybind11_next branch from 283302f to 24d8021 Compare August 19, 2020 21:12

rwgk mentioned this pull request Aug 28, 2020

Fixing pybind11::bytes() ambiguous conversion issue. #2442

Merged

rwgk force-pushed the pybind11_next branch from 24d8021 to 2ac4dbb Compare August 28, 2020 20:50

rwgk mentioned this pull request Sep 8, 2020

Adds check if str(handle) correctly converted the object, and throw py::error_already_set if not. #2473

Closed

rwgk force-pushed the pybind11_next branch from 2ac4dbb to 6f270a3 Compare September 10, 2020 02:36

rwgk changed the title ~~Meta PR for Google Patches~~ Changing pybind11::str to exclusively hold PyUnicodeObject Sep 10, 2020

rwgk force-pushed the pybind11_next branch 2 times, most recently from bd1d77b to 21376dd Compare September 16, 2020 23:02

rwgk force-pushed the pybind11_next branch from 5720383 to 101b888 Compare September 27, 2020 06:00

rwgk force-pushed the pybind11_next branch 2 times, most recently from 0a4d35e to 1369cc2 Compare October 18, 2020 07:18

rwgk force-pushed the pybind11_next branch from 1369cc2 to 356c693 Compare October 29, 2020 15:03

rwgk force-pushed the pybind11_next branch 4 times, most recently from 8398ef9 to 9258f93 Compare November 15, 2020 16:53

rwgk marked this pull request as ready for review November 15, 2020 17:22

rwgk requested review from YannickJadoul, henryiii and wjakob November 15, 2020 17:22

henryiii added this to the v2.7 milestone Nov 15, 2020

rwgk force-pushed the pybind11_next branch from 9258f93 to 203492a Compare November 16, 2020 20:48

YannickJadoul approved these changes Nov 16, 2020

View reviewed changes

rwgk mentioned this pull request Dec 1, 2020

Fixes #1807: 2.3.0 regression: <class 'bytes'> -> std::vector<uint8_t> #2198

Open

rwgk force-pushed the pybind11_next branch from 203492a to 947db47 Compare December 3, 2020 23:23

rwgk force-pushed the pybind11_next branch from 947db47 to 2039372 Compare December 28, 2020 23:13

rwgk mentioned this pull request Dec 29, 2020

[BUG] TSAN error logs #2754

Closed

rwgk mentioned this pull request Jan 13, 2021

Plug leaking function_records in cpp_function initialization in case of exceptions (found by Valgrind in #2746) #2756

Merged

rwgk added 3 commits January 28, 2021 10:29

Changing pybind11::str to exclusively hold PyUnicodeObject

2db6cf9

Reducing from two to only one macro: PYBIND11_STR_LEGACY_PERMISSIVE

37234f2

Minor updates to address comments by @YannickJadoul from Dec 1, 2020.

51894fa

rwgk force-pushed the pybind11_next branch from 2039372 to 51894fa Compare January 29, 2021 15:36

rwgk merged commit 0432ae7 into pybind:master Jan 29, 2021

rwgk deleted the pybind11_next branch January 29, 2021 22:10

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#2409

Closed

		@@ -10,6 +10,31 @@ modernization and other useful information.

		.. _upgrade-guide-2.6:

		v2.7

Changing pybind11::str to exclusively hold PyUnicodeObject #2409

Changing pybind11::str to exclusively hold PyUnicodeObject #2409

Uh oh!

Conversation

rwgk commented Aug 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjakob commented Nov 16, 2020

Uh oh!

rwgk commented Nov 16, 2020

Uh oh!

YannickJadoul left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YannickJadoul commented Nov 16, 2020

Uh oh!

YannickJadoul commented Nov 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rwgk commented Dec 1, 2020

Uh oh!

YannickJadoul commented Dec 1, 2020

Uh oh!

rwgk commented Dec 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

YannickJadoul commented Dec 1, 2020

Uh oh!

rwgk commented Jan 29, 2021

Uh oh!

Uh oh!

rwgk commented Aug 19, 2020 •

edited

Loading

YannickJadoul commented Nov 16, 2020 •

edited

Loading

rwgk commented Dec 1, 2020 •

edited

Loading