Skip to content

Changing pybind11::str to exclusively hold PyUnicodeObject #2409

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 29, 2021

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Aug 19, 2020

Before this PR, pybind11::str can hold PyUnicodeObject or PyBytesObject, which is probably surprising and was never documented. As a side-effect, pybind11::isinstance<str>() is true for both pybind11::str and pybind11::bytes. This PR changes the pybind11::str implementation to be in line with the documented behavior, but provides an escape hatch to go back to the legacy behavior, via the PYBIND11_STR_LEGACY_PERMISSIVE macro. This macro will be removed in future releases.

This PR changes pybind11::str so that it can only hold PyUnicodeObject, and pybind11::isinstance<str>() is true only for pybind11::str, but false for pybind11::bytes. However, for Python 2 only (!), the pybind11::str caster is modified to implicitly decode bytes to PyUnicodeObject. Without this implicit conversion, Python code currently used with Python 2 & 3 would need to be cluttered with six.ensure_text() or similar, only to be un-cluttered later after Python 2 support is dropped.

This PR was exhaustively tested in the Google environment (hundreds of thousands of indirect dependencies). A one-page summary of user code fixes needed is here, along with fixes needed for other PRs. The number of fixes needed in connection with this PR was similar to that for other PRs. Two types of required fixes are expected to be common:

  • Accidental use of pybind11::str instead of pybind11::bytes, masked by the legacy permissive behavior. These are probably very easy to fix.

  • Reliance on pybind11::isinstance<str>(obj) being true for bytes. This is likely to be easy to fix in most cases by adding || pybind11::isinstance<bytes>(obj), but a fix may be more involved, e.g. if pybind11::isinstance<T> appears in a template (we found one such case in the Google environment).

@rwgk rwgk marked this pull request as draft August 19, 2020 08:34
tensorflow-copybara pushed a commit to tensorflow/tensorflow that referenced this pull request Aug 21, 2020
Backward compatible change preparing for pybind11 update.

Hidden (and luckily inconsequential) bugs discovered while testing
with the current pybind11 github master branch, and current
pybind/pybind11#2409 applied locally.

The code changed in this CL depends on a pybind11 mis-feature: Current `stable`
`pybind11::str` can hold either `PyUnicodeObject` (as documented) or
`PyBytesObject` (undocumented and probably very surprising), even under
Python 3. pybind PR #2409 changes `pybind11::str` so that it can only hold
`PyUnicodeObject`.

PiperOrigin-RevId: 327849650
Change-Id: I2a119479a6af8ab8ec5315a1b8565e96952b84c1
rwgk added a commit to rwgk/pybind11 that referenced this pull request Aug 28, 2020
Adding missing `bytes` type to `test_constructors()`, to exercise the code change.

The changes in the PR were cherry-picked from PR pybind#2409 (with a very minor
modification in test_pytypes.py related to flake8). Via PR pybind#2409, these
changes were extensively tested in the Google environment, as summarized here:
https://docs.google.com/document/d/1TPL-J__mph_yHa1quDvsO12E_F5OZnvBaZlW9IIrz8M/
The changes in this PR did not cause an issues at all.

Note that `test_constructors()` before this PR passes for Python 2 only
because `pybind11::str` can hold `PyUnicodeObject` or `PyBytesObject`. As a
side-effect of this PR, `test_constructors()` no longer relies on this
permissive `pybind11::str` behavior. However, the permissive behavior is still
exercised/exposed via the existing `test_pybind11_str_raw_str()`.

The test code change is designed to enable easy removal later, when Python 2
support is dropped.

For completeness: confusingly, the non-test code changes travelled through PR

Example `ambiguous conversion` error fixed by this PR:
```
pybind11/tests/test_pytypes.cpp:214:23: error: ambiguous conversion for functional-style cast from 'pybind11::detail::item_accessor' (aka 'accessor<accessor_policies::generic_item>') to 'py::bytes'
            "bytes"_a=py::bytes(d["bytes"]),
                      ^~~~~~~~~~~~~~~~~~~~
pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor
    PYBIND11_OBJECT(bytes, object, PYBIND11_BYTES_CHECK)
                    ^
pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor
pybind11/include/pybind11/detail/../pytypes.h:987:15: note: candidate constructor
inline bytes::bytes(const pybind11::str &s) {
              ^
1 error generated.
```
rwgk added a commit that referenced this pull request Aug 28, 2020
Adding missing `bytes` type to `test_constructors()`, to exercise the code change.

The changes in the PR were cherry-picked from PR #2409 (with a very minor
modification in test_pytypes.py related to flake8). Via PR #2409, these
changes were extensively tested in the Google environment, as summarized here:
https://docs.google.com/document/d/1TPL-J__mph_yHa1quDvsO12E_F5OZnvBaZlW9IIrz8M/
The changes in this PR did not cause an issues at all.

Note that `test_constructors()` before this PR passes for Python 2 only
because `pybind11::str` can hold `PyUnicodeObject` or `PyBytesObject`. As a
side-effect of this PR, `test_constructors()` no longer relies on this
permissive `pybind11::str` behavior. However, the permissive behavior is still
exercised/exposed via the existing `test_pybind11_str_raw_str()`.

The test code change is designed to enable easy removal later, when Python 2
support is dropped.

For completeness: confusingly, the non-test code changes travelled through PR

Example `ambiguous conversion` error fixed by this PR:
```
pybind11/tests/test_pytypes.cpp:214:23: error: ambiguous conversion for functional-style cast from 'pybind11::detail::item_accessor' (aka 'accessor<accessor_policies::generic_item>') to 'py::bytes'
            "bytes"_a=py::bytes(d["bytes"]),
                      ^~~~~~~~~~~~~~~~~~~~
pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor
    PYBIND11_OBJECT(bytes, object, PYBIND11_BYTES_CHECK)
                    ^
pybind11/include/pybind11/detail/../pytypes.h:957:21: note: candidate constructor
pybind11/include/pybind11/detail/../pytypes.h:987:15: note: candidate constructor
inline bytes::bytes(const pybind11::str &s) {
              ^
1 error generated.
```
rwgk added a commit to rwgk/pybind11 that referenced this pull request Sep 8, 2020
… py::error_already_set if not.

Similar to pybind#2392, but does not depend on pybind#2409.

Splitting out this PR from pybind#2409 to make that PR as simple as possible.

Net effects of this PR:
* Adds missing test coverage.
* Changes TypeError to UnicodeDecodeError for Python 2.

This PR has two commits. Please do not squash, to make the behavior change obvious in the commit history.
@rwgk rwgk changed the title Meta PR for Google Patches Changing pybind11::str to exclusively hold PyUnicodeObject Sep 10, 2020
@rwgk rwgk force-pushed the pybind11_next branch 2 times, most recently from bd1d77b to 21376dd Compare September 16, 2020 23:02
@rwgk rwgk force-pushed the pybind11_next branch 2 times, most recently from 0a4d35e to 1369cc2 Compare October 18, 2020 07:18
@rwgk rwgk force-pushed the pybind11_next branch 4 times, most recently from 8398ef9 to 9258f93 Compare November 15, 2020 16:53
@rwgk rwgk marked this pull request as ready for review November 15, 2020 17:22
@henryiii henryiii added this to the v2.7 milestone Nov 15, 2020
@wjakob
Copy link
Member

wjakob commented Nov 16, 2020

Those changes look good to me. Putting string-specific code into the py::object caster is a no-go in principle, but I see that it is masked via a condition that can be tested at compile time, and it's for Python 2.7 only (i.e. will eventually be removed). Can you document this new flag in the documentation (e.g. in the porting guide?)

@rwgk
Copy link
Collaborator Author

rwgk commented Nov 16, 2020

Can you document this new flag in the documentation (e.g. in the porting guide?)

Done. Adding to docs/upgrade.rst is the only change I made. Thanks Wenzel!

Copy link
Collaborator

@YannickJadoul YannickJadoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor details, but this looks good to me! As we've discussed in plenty of other issues/PRs before, I would argue that most or almost all use cases that will be affected can be considered bugs.
pybind11 is already pretty clear about the names str and bytes having "Python 3 semantics", I would say.

The one thing is this special case for Python 2 (and the changes necessary to pyobject_caster, as pointed out by @wjakob). The good news is Python 2 already does something similar here:

Python 2.7.17 (default, Sep 30 2020, 13:38:04) 
[GCC 7.5.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 'abc'
'abc'
>>> unicode('abc')
u'abc'
>>> print('\xc3\xb1'.decode('utf-8'))
ñ
>>> '\xc3\xb1'
'\xc3\xb1'
>>> unicode('\xc3\xb1')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
>>> 

So the current implementation matches behavior, except that it's using codec "ascii" instead of "utf-8". We could match this, but I'm also fine keeping UTF-8, since pybind11 is again very clear on following UTF-8 everywhere.

I also want to quickly remind ourselves of why this detail::PyUnicode_Check_Permissive was implemented in the first place. I kind of remember we already dug back in history, at some point, but I forgot the conclusion. I'll see if I can find it and add a cross-reference from this PR.

Comment on lines 9 to 18
v2.7.0 (WIP)
------------------------------

* ``py::str`` changed to exclusively hold `PyUnicodeObject`. Previously
``py::str`` could also hold `bytes`, which is probably surprising, was
never documented, and can mask bugs (e.g. accidental use of ``py::str``
instead of ``py::bytes``).
`#2409 <https://github.com/pybind/pybind11/pull/2409>`_


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking with @henryiii: I thought we collected the changelog entries in the PR (indicated in the PR template), to avoid conflicts?

Also, what's the plan for future future release? (i.e., 2.6.2 is still indicated as "TBA")
Also (very minor), this says "WIP", while v2.6.2 says "(TBA, not yet release)". We should probably be consistent? Again, @henryiii, did you have a plan, here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Just realizing: maybe this is still a remainder from long long time ago, before templates and the new changelog system?)

@@ -10,6 +10,31 @@ modernization and other useful information.

.. _upgrade-guide-2.6:

v2.7
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here: how do we want to approach this, w.r.t. merging the current version into master? How confusing will it be if master already contains parts of an v2.7 upgrade guide?

Not really related to this PR though; @rwgk is just a bit unlucky that this might be the first PR for v2.7 with bigger changes that require notes in the upgrade guide.

docs/upgrade.rst Outdated
``py::bytes``. Starting with v2.7, ``py::str`` exclusively holds
``PyUnicodeObject`` (`#2409 <https://github.com/pybind/pybind11/pull/2409>`_),
and ``py::isinstance<str>()`` is ``true`` only for ``py::str``. To help in
the transition of client code, the ``PYBIND11_STR_LEGACY_PERMISSIVE`` macro
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document that users should nót rely on this, and that we're planning/aiming to get rid of this in v2.8. I know this is part of the "transition of client code", but let's make it explicit.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, another minor detail, now that I read this again: this is user-facing documentation, so "client code" is unnecessary, perhaps?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nevermind, it is actually 2 lines further down; I guess I was just expecting/scanning for "2.8"...

docs/upgrade.rst Outdated
and ``py::isinstance<str>()`` is ``true`` only for ``py::str``. To help in
the transition of client code, the ``PYBIND11_STR_LEGACY_PERMISSIVE`` macro
is provided as an escape hatch to go back to the legacy behavior. This macro
will be removed in future releases. Two types of required client-code fixes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, all code is "client-code" in this context, to me.

are expected to be common:

* Accidental use of ``py::str`` instead of ``py::bytes``, masked by the legacy
behavior. These are probably very easy to fix, by changing from
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might mention this is probably a bug?

Comment on lines +1626 to +1669
#if PY_MAJOR_VERSION < 3 && !defined(PYBIND11_STR_LEGACY_PERMISSIVE)
// For Python 2, without this implicit conversion, Python code would
// need to be cluttered with six.ensure_text() or similar, only to be
// un-cluttered later after Python 2 support is dropped.
if (std::is_same<T, str>::value && isinstance<bytes>(src)) {
PyObject *str_from_bytes = PyUnicode_FromEncodedObject(src.ptr(), "utf-8", nullptr);
if (!str_from_bytes) throw error_already_set();
value = reinterpret_steal<type>(str_from_bytes);
return true;
}
#endif
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, if we don't want this here (cfr. @wjakob's remark), we could specialize pyobject_caster<str> or type_caster<str>?
But maybe that's too much for code that will disappear in a hopefully not all too distant future.

@@ -144,7 +144,7 @@ template <typename Type, typename Value> struct list_caster {
using value_conv = make_caster<Value>;

bool load(handle src, bool convert) {
if (!isinstance<sequence>(src) || isinstance<str>(src))
if (!isinstance<sequence>(src) || isinstance<bytes>(src) || isinstance<str>(src))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we actually don't want this change. Doing so (or well, not doing so) would fix #1807.

I'm OK with making and discussing this change (or undoing your change) in a separate PR, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also #2198.

@YannickJadoul
Copy link
Collaborator

I'll see if I can find it and add a cross-reference from this PR.

OK, this is the best there is to be found, it seems: 5612a0c
There's not a lot I can distill from that, except that pybind11 has changed a lot since then, and that the original author from back then (@wjakob) said he's OK with the current changes, so ... not too worried about this :-)

@YannickJadoul
Copy link
Collaborator

YannickJadoul commented Nov 16, 2020

Diving further with @bstaletic into this commit 5612a0c, there's a piece of dead code, now:

operator std::string() const {
object temp = *this;
if (PyUnicode_Check(m_ptr)) {
temp = reinterpret_steal<object>(PyUnicode_AsUTF8String(m_ptr));
if (!temp)
pybind11_fail("Unable to extract string contents! (encoding issue)");
}
char *buffer;
ssize_t length;
if (PYBIND11_BYTES_AS_STRING_AND_SIZE(temp.ptr(), &buffer, &length))
pybind11_fail("Unable to extract string contents! (invalid type)");
return std::string(buffer, (size_t) length);
}

This was introduced in 5612a0c, but the second part is now dead code. Is this something to check with @wjakob, and maybe fix in a follow-up PR?

@rwgk
Copy link
Collaborator Author

rwgk commented Dec 1, 2020

Thanks @YannickJadoul, sorry for the late reply. Just a quick note: I'll wait until after the 2.6.2 release before putting on a few finishing touches here (e.g. I'll try get rid of the dead code you pointed out).

@YannickJadoul
Copy link
Collaborator

Thanks @YannickJadoul, sorry for the late reply. Just a quick note: I'll wait until after the 2.6.2 release before putting on a few finishing touches here (e.g. I'll try get rid of the dead code you pointed out).

No worries. Sounds good to me. (Though, is there a 2.6.2 planned; I'm not sure we currently have a lot of bugfixes waiting for release?)
Should we go ahead and merge #2198 as a way to fix #1807 already in 2.6.2?

@rwgk
Copy link
Collaborator Author

rwgk commented Dec 1, 2020

Should we go ahead and merge #2198 as a way to fix #1807 already in 2.6.2?

We had a lot of back and forth about this particular change (me, you, @bstaletic taking different positions), which is why I'm preserving current behavior in this PR.

I'll comment on #2198 to log what I think is the best change.

@YannickJadoul
Copy link
Collaborator

Well, yeah, I also noted it here: #2409 (comment)

I'm not sure I see anything wrong with the change in #2198. I just though it had stalled because we figured out there was a much bigger issue.

@rwgk
Copy link
Collaborator Author

rwgk commented Jan 29, 2021

@YannickJadoul, I changed "client code" to "user code" the first time I want to be unambiguous about what "transition" refers to, and delete the second mention of "client code" a couple lines down. You're right, the context is clear enough there.

A follow-up PR for the dead code you discovered would be nice. I'm leaving the code in the PR as it was for the past ~2.5 months. It has been in use internally for the entire time (and an earlier version even since August 2020).

@rwgk rwgk merged commit 0432ae7 into pybind:master Jan 29, 2021
@rwgk rwgk deleted the pybind11_next branch January 29, 2021 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants