-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Changing pybind11::str to exclusively hold PyUnicodeObject #2409
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Backward compatible change preparing for pybind11 update. Hidden (and luckily inconsequential) bugs discovered while testing with the current pybind11 github master branch, and current pybind/pybind11#2409 applied locally. The code changed in this CL depends on a pybind11 mis-feature: Current `stable` `pybind11::str` can hold either `PyUnicodeObject` (as documented) or `PyBytesObject` (undocumented and probably very surprising), even under Python 3. pybind PR #2409 changes `pybind11::str` so that it can only hold `PyUnicodeObject`. PiperOrigin-RevId: 327849650 Change-Id: I2a119479a6af8ab8ec5315a1b8565e96952b84c1
Adding missing `bytes` type to `test_constructors()`, to exercise the code change. The changes in the PR were cherry-picked from PR pybind#2409 (with a very minor modification in test_pytypes.py related to flake8). Via PR pybind#2409, these changes were extensively tested in the Google environment, as summarized here: https://docs.google.com/document/d/1TPL-J__mph_yHa1quDvsO12E_F5OZnvBaZlW9IIrz8M/ The changes in this PR did not cause an issues at all. Note that `test_constructors()` before this PR passes for Python 2 only because `pybind11::str` can hold `PyUnicodeObject` or `PyBytesObject`. As a side-effect of this PR, `test_constructors()` no longer relies on this permissive `pybind11::str` behavior. However, the permissive behavior is still exercised/exposed via the existing `test_pybind11_str_raw_str()`. The test code change is designed to enable easy removal later, when Python 2 support is dropped. For completeness: confusingly, the non-test code changes travelled through PR Example `ambiguous conversion` error fixed by this PR: ``` pybind11/tests/test_pytypes.cpp:214:23: error: ambiguous conversion for functional-style cast from 'pybind11::detail::item_accessor' (aka 'accessor<accessor_policies::generic_item>') to 'py::bytes' "bytes"_a=py::bytes(d["bytes"]), ^~~~~~~~~~~~~~~~~~~~ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor PYBIND11_OBJECT(bytes, object, PYBIND11_BYTES_CHECK) ^ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor pybind11/include/pybind11/detail/../pytypes.h:987:15: note: candidate constructor inline bytes::bytes(const pybind11::str &s) { ^ 1 error generated. ```
Adding missing `bytes` type to `test_constructors()`, to exercise the code change. The changes in the PR were cherry-picked from PR #2409 (with a very minor modification in test_pytypes.py related to flake8). Via PR #2409, these changes were extensively tested in the Google environment, as summarized here: https://docs.google.com/document/d/1TPL-J__mph_yHa1quDvsO12E_F5OZnvBaZlW9IIrz8M/ The changes in this PR did not cause an issues at all. Note that `test_constructors()` before this PR passes for Python 2 only because `pybind11::str` can hold `PyUnicodeObject` or `PyBytesObject`. As a side-effect of this PR, `test_constructors()` no longer relies on this permissive `pybind11::str` behavior. However, the permissive behavior is still exercised/exposed via the existing `test_pybind11_str_raw_str()`. The test code change is designed to enable easy removal later, when Python 2 support is dropped. For completeness: confusingly, the non-test code changes travelled through PR Example `ambiguous conversion` error fixed by this PR: ``` pybind11/tests/test_pytypes.cpp:214:23: error: ambiguous conversion for functional-style cast from 'pybind11::detail::item_accessor' (aka 'accessor<accessor_policies::generic_item>') to 'py::bytes' "bytes"_a=py::bytes(d["bytes"]), ^~~~~~~~~~~~~~~~~~~~ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor PYBIND11_OBJECT(bytes, object, PYBIND11_BYTES_CHECK) ^ pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor pybind11/include/pybind11/detail/../pytypes.h:987:15: note: candidate constructor inline bytes::bytes(const pybind11::str &s) { ^ 1 error generated. ```
… py::error_already_set if not. Similar to pybind#2392, but does not depend on pybind#2409. Splitting out this PR from pybind#2409 to make that PR as simple as possible. Net effects of this PR: * Adds missing test coverage. * Changes TypeError to UnicodeDecodeError for Python 2. This PR has two commits. Please do not squash, to make the behavior change obvious in the commit history.
bd1d77b
to
21376dd
Compare
0a4d35e
to
1369cc2
Compare
8398ef9
to
9258f93
Compare
Those changes look good to me. Putting string-specific code into the |
Done. Adding to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor details, but this looks good to me! As we've discussed in plenty of other issues/PRs before, I would argue that most or almost all use cases that will be affected can be considered bugs.
pybind11 is already pretty clear about the names str
and bytes
having "Python 3 semantics", I would say.
The one thing is this special case for Python 2 (and the changes necessary to pyobject_caster
, as pointed out by @wjakob). The good news is Python 2 already does something similar here:
Python 2.7.17 (default, Sep 30 2020, 13:38:04)
[GCC 7.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'abc'
'abc'
>>> unicode('abc')
u'abc'
>>> print('\xc3\xb1'.decode('utf-8'))
ñ
>>> '\xc3\xb1'
'\xc3\xb1'
>>> unicode('\xc3\xb1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>>
So the current implementation matches behavior, except that it's using codec "ascii" instead of "utf-8". We could match this, but I'm also fine keeping UTF-8, since pybind11 is again very clear on following UTF-8 everywhere.
I also want to quickly remind ourselves of why this detail::PyUnicode_Check_Permissive
was implemented in the first place. I kind of remember we already dug back in history, at some point, but I forgot the conclusion. I'll see if I can find it and add a cross-reference from this PR.
docs/changelog.rst
Outdated
v2.7.0 (WIP) | ||
------------------------------ | ||
|
||
* ``py::str`` changed to exclusively hold `PyUnicodeObject`. Previously | ||
``py::str`` could also hold `bytes`, which is probably surprising, was | ||
never documented, and can mask bugs (e.g. accidental use of ``py::str`` | ||
instead of ``py::bytes``). | ||
`#2409 <https://github.com/pybind/pybind11/pull/2409>`_ | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking with @henryiii: I thought we collected the changelog entries in the PR (indicated in the PR template), to avoid conflicts?
Also, what's the plan for future future release? (i.e., 2.6.2 is still indicated as "TBA")
Also (very minor), this says "WIP", while v2.6.2 says "(TBA, not yet release)". We should probably be consistent? Again, @henryiii, did you have a plan, here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Just realizing: maybe this is still a remainder from long long time ago, before templates and the new changelog system?)
@@ -10,6 +10,31 @@ modernization and other useful information. | |||
|
|||
.. _upgrade-guide-2.6: | |||
|
|||
v2.7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here: how do we want to approach this, w.r.t. merging the current version into master
? How confusing will it be if master
already contains parts of an v2.7 upgrade guide?
Not really related to this PR though; @rwgk is just a bit unlucky that this might be the first PR for v2.7 with bigger changes that require notes in the upgrade guide.
docs/upgrade.rst
Outdated
``py::bytes``. Starting with v2.7, ``py::str`` exclusively holds | ||
``PyUnicodeObject`` (`#2409 <https://github.com/pybind/pybind11/pull/2409>`_), | ||
and ``py::isinstance<str>()`` is ``true`` only for ``py::str``. To help in | ||
the transition of client code, the ``PYBIND11_STR_LEGACY_PERMISSIVE`` macro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's document that users should nót rely on this, and that we're planning/aiming to get rid of this in v2.8. I know this is part of the "transition of client code", but let's make it explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, another minor detail, now that I read this again: this is user-facing documentation, so "client code" is unnecessary, perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nevermind, it is actually 2 lines further down; I guess I was just expecting/scanning for "2.8"...
docs/upgrade.rst
Outdated
and ``py::isinstance<str>()`` is ``true`` only for ``py::str``. To help in | ||
the transition of client code, the ``PYBIND11_STR_LEGACY_PERMISSIVE`` macro | ||
is provided as an escape hatch to go back to the legacy behavior. This macro | ||
will be removed in future releases. Two types of required client-code fixes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, all code is "client-code" in this context, to me.
are expected to be common: | ||
|
||
* Accidental use of ``py::str`` instead of ``py::bytes``, masked by the legacy | ||
behavior. These are probably very easy to fix, by changing from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might mention this is probably a bug?
#if PY_MAJOR_VERSION < 3 && !defined(PYBIND11_STR_LEGACY_PERMISSIVE) | ||
// For Python 2, without this implicit conversion, Python code would | ||
// need to be cluttered with six.ensure_text() or similar, only to be | ||
// un-cluttered later after Python 2 support is dropped. | ||
if (std::is_same<T, str>::value && isinstance<bytes>(src)) { | ||
PyObject *str_from_bytes = PyUnicode_FromEncodedObject(src.ptr(), "utf-8", nullptr); | ||
if (!str_from_bytes) throw error_already_set(); | ||
value = reinterpret_steal<type>(str_from_bytes); | ||
return true; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, if we don't want this here (cfr. @wjakob's remark), we could specialize pyobject_caster<str>
or type_caster<str>
?
But maybe that's too much for code that will disappear in a hopefully not all too distant future.
@@ -144,7 +144,7 @@ template <typename Type, typename Value> struct list_caster { | |||
using value_conv = make_caster<Value>; | |||
|
|||
bool load(handle src, bool convert) { | |||
if (!isinstance<sequence>(src) || isinstance<str>(src)) | |||
if (!isinstance<sequence>(src) || isinstance<bytes>(src) || isinstance<str>(src)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we actually don't want this change. Doing so (or well, not doing so) would fix #1807.
I'm OK with making and discussing this change (or undoing your change) in a separate PR, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #2198.
OK, this is the best there is to be found, it seems: 5612a0c |
Diving further with @bstaletic into this commit 5612a0c, there's a piece of dead code, now: pybind11/include/pybind11/pytypes.h Lines 960 to 972 in 17c22b9
This was introduced in 5612a0c, but the second part is now dead code. Is this something to check with @wjakob, and maybe fix in a follow-up PR? |
Thanks @YannickJadoul, sorry for the late reply. Just a quick note: I'll wait until after the 2.6.2 release before putting on a few finishing touches here (e.g. I'll try get rid of the dead code you pointed out). |
No worries. Sounds good to me. (Though, is there a 2.6.2 planned; I'm not sure we currently have a lot of bugfixes waiting for release?) |
We had a lot of back and forth about this particular change (me, you, @bstaletic taking different positions), which is why I'm preserving current behavior in this PR. I'll comment on #2198 to log what I think is the best change. |
Well, yeah, I also noted it here: #2409 (comment) I'm not sure I see anything wrong with the change in #2198. I just though it had stalled because we figured out there was a much bigger issue. |
@YannickJadoul, I changed "client code" to "user code" the first time I want to be unambiguous about what "transition" refers to, and delete the second mention of "client code" a couple lines down. You're right, the context is clear enough there. A follow-up PR for the dead code you discovered would be nice. I'm leaving the code in the PR as it was for the past ~2.5 months. It has been in use internally for the entire time (and an earlier version even since August 2020). |
Before this PR,
pybind11::str
can holdPyUnicodeObject
orPyBytesObject
, which is probably surprising and was never documented. As a side-effect,pybind11::isinstance<str>()
istrue
for bothpybind11::str
andpybind11::bytes
. This PR changes thepybind11::str
implementation to be in line with the documented behavior, but provides an escape hatch to go back to the legacy behavior, via thePYBIND11_STR_LEGACY_PERMISSIVE
macro. This macro will be removed in future releases.This PR changes
pybind11::str
so that it can only holdPyUnicodeObject
, andpybind11::isinstance<str>()
is true only forpybind11::str
, but false forpybind11::bytes
. However, for Python 2 only (!), thepybind11::str
caster is modified to implicitly decodebytes
toPyUnicodeObject
. Without this implicit conversion, Python code currently used with Python 2 & 3 would need to be cluttered withsix.ensure_text()
or similar, only to be un-cluttered later after Python 2 support is dropped.This PR was exhaustively tested in the Google environment (hundreds of thousands of indirect dependencies). A one-page summary of user code fixes needed is here, along with fixes needed for other PRs. The number of fixes needed in connection with this PR was similar to that for other PRs. Two types of required fixes are expected to be common:
Accidental use of
pybind11::str
instead ofpybind11::bytes
, masked by the legacy permissive behavior. These are probably very easy to fix.Reliance on
pybind11::isinstance<str>(obj)
beingtrue
forbytes
. This is likely to be easy to fix in most cases by adding|| pybind11::isinstance<bytes>(obj)
, but a fix may be more involved, e.g. ifpybind11::isinstance<T>
appears in atemplate
(we found one such case in the Google environment).