Skip to content

Cache return types of called functions #390

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

jagerman
Copy link
Member

@jagerman jagerman commented Sep 6, 2016

Each time a function is called, we currently do a type lookup in registered_types_cpp to figure out the python type info associated with the return value. This hash lookup adds to the overhead, which can be a small but noticeable overhead for functions called in a tight python loop (one of the two main issues reported in #376). It can, however, be easily cached since the lookup is going to return an identical value each time (since pybind11 doesn't support de-registering types).

This commit adds an (optional) cache parameter to the lookup for generic types (i.e. those inheriting from type_caster_generic) so that these lookups can be cached.

Of course, adding a cache variable isn't free either: there is a small (0.93%) increase in the test .so size (on linux/g++6), but it seems worthwhile for a noticeable overhead reduction.

@wjakob
Copy link
Member

wjakob commented Sep 11, 2016

The alternative to how this commit addresses the issue (storing a pointer per function call) would be to do that per type via a static member variable of a template class. Any thoughts on that approach?

jagerman added a commit to jagerman/pybind11 that referenced this pull request Sep 11, 2016
The current inheritance testing isn't sufficient to detect a cache
failure; the test added here breaks PR pybind#390, which caches the
run-time-determined return type the first time a function is called,
then reuses that cached type even though the run-time type could be
different for a future call.
@jagerman
Copy link
Member Author

That won't work, and actually, neither will this PR as currently written. The issue is that the actual return type is not always the same for a given function; currently, because we're using typeid on the pointer to be cast, we're actually do a run-time polymorphic type lookup when calling typeid to resolve the type; without this, the return_class_1 and _2 functions in test_inheritance wouldn't work:

    m.def("return_class_1", []() -> BaseClass* { return new DerivedClass1(); });
    m.def("return_class_2", []() -> BaseClass* { return new DerivedClass2(); });

we pick those up correctly as new instances of DerivedClass1/DerivedClass2, not BaseClass*, but at compile time there is no difference between those functions.

This PR is broken because it caches per-function, which means the problem doesn't show up in the above, but would show up in:

    m.def("return_class_n", [](bool one) -> BaseClass* { if (one) return new DerivedClass1(); else return new DerivedClass2(); });

i.e. if a function returns different types, we fail to notice (because the cache will contain whatever the first run-time type was). This is obviously a testing defect that I've submitted PR #409 to detect (separate from this PR because it's a useful test whether or not this PR proceeds).

@jagerman
Copy link
Member Author

I think, however, that there's an easy enough solution for this PR: instead of just using the cache when we find it, we can use the cached value if we find it and its type agrees with the current value's typeid; if it doesn't, we do a lookup and replace the cached value.

In practice, this should work almost as well when something is called in a tight loop. It would take a function that continually oscillates in the actual type returned to defeat the cache, so that, for example, the following would never benefit from the cache:

static bool counter = 1;
m.def("f1", [&counter]() -> BaseClass* { if (counter++ % 2 == 0) return new Derived1(); else return new Derived2(); });

Changing % 2 to % 3 would result in a cache hit 1/3 of the time, etc.

This approach can be combined with the templated-static variable approach; it just reduces the granularity of the cache to any functions with the same return type, which for most cases is probably fine. It means that the following code will additionally be cache-defeating:

m.def("f1", [&counter]() -> BaseClass* { return new Derived1(); });
m.def("f2", [&counter]() -> BaseClass* { return new Derived2(); });

Combined with alternating calls from Python to f1 and f2. That's probably still a pretty low-likelihood situation, so I'll go ahead with that approach.

Each time a function is called, we currently do a type lookup in
registered_types_cpp to figure out the python type info associated with
the return value.  This lookup isn't free, particularly for often-called
functions, but will usually yield an identical result each time the
function is called.

This commit adds a per-return-type cache into the return type lookup for
generic types (i.e. those inheriting from type_caster_generic) to avoid
the hash lookup.  The cache is invalidated whenever the runtime type
changes (to avoid the problem PR pybind#409 tests for).
@jagerman
Copy link
Member Author

Updated to use local static, return-type-level caching. The .so premium on tests with this version is down to 0.4% (g++ 6). The overhead reduction looks like this (using the test_cast code from bug #376 for a tight-loop call test):

upstream, -Os:
py_ref:         calls: 50000000   elapsed: 9.121s       ns/call: 182.4ns
C++ w/ lambda:  calls: 50000000   elapsed: 10.243s       ns/call: 204.9ns
mk_copy:        calls: 50000000   elapsed: 13.846s       ns/call: 276.9ns
return-type-caching, -Os:
py_ref:         calls: 50000000   elapsed: 7.943s       ns/call: 158.9ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.674s       ns/call: 173.5ns
mk_copy:        calls: 50000000   elapsed: 12.717s       ns/call: 254.3ns

upstream, -O2:
py_ref:         calls: 50000000   elapsed: 8.071s       ns/call: 161.4ns
C++ w/ lambda:  calls: 50000000   elapsed: 9.032s       ns/call: 180.6ns
mk_copy:        calls: 50000000   elapsed: 12.732s       ns/call: 254.6ns
return-type-caching, -O2:
py_ref:         calls: 50000000   elapsed: 7.083s       ns/call: 141.7ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.239s       ns/call: 164.8ns
mk_copy:        calls: 50000000   elapsed: 11.956s       ns/call: 239.1ns

upstream, -O3:
py_ref:         calls: 50000000   elapsed: 7.809s       ns/call: 156.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.421s       ns/call: 168.4ns
mk_copy:        calls: 50000000   elapsed: 12.066s       ns/call: 241.3ns
return-type-caching, -O3:
py_ref:         calls: 50000000   elapsed: 6.812s       ns/call: 136.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.444s       ns/call: 168.9ns
mk_copy:        calls: 50000000   elapsed: 11.896s       ns/call: 237.9ns

@wjakob
Copy link
Member

wjakob commented Sep 18, 2016

Can you explain the rows in

upstream, -O3:
py_ref:         calls: 50000000   elapsed: 7.809s       ns/call: 156.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.421s       ns/call: 168.4ns
mk_copy:        calls: 50000000   elapsed: 12.066s       ns/call: 241.3ns
return-type-caching, -O3:
py_ref:         calls: 50000000   elapsed: 6.812s       ns/call: 136.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.444s       ns/call: 168.9ns
mk_copy:        calls: 50000000   elapsed: 11.896s       ns/call: 237.9ns

i.e. what are mk_copy, py_ref, etc.

For this PR, I think I'd like to wait to see how it plays out in conjunction with the other planned optimizations before merging as the performance improvements are fairly small and probably just noticeable in very simple getter-style functions.

@jagerman
Copy link
Member Author

Indeed, the gains are not likely to be noticeable outside tight loop code with simple returns. That said, they are still there: but a 20ns gain per function call isn't likely to be significant when not invoking functions millions of times.

The code actually invoked is the latest version of test_cast.py/test_cast.cpp, posted in #376, but with the C++/Python/mk_untracked cases commented out. Basically what each does is:

  • py_ref is code that does a tight loop in python, each loop calling a bound function that updates and returns the same instance of a simple pybind-registered type using return_value_policy::reference.
  • C++ w/ lambda passes a Python lambda to a C++ function (as a pybind11::object), then does the looping and object creation in C++, with a call to the py::object's call operator inside the loop.
  • mk_copy is similar to py_ref, but it makes a copy of the returned instance via return_value_policy::copy.

@jagerman
Copy link
Member Author

jagerman commented Oct 2, 2016

Shall I close this for now? It probably belongs as part of the other iterator optimization changes in #376.

@wjakob
Copy link
Member

wjakob commented Oct 2, 2016

Ok, let's do that. @aldanor can still get the diffs from here when he has some time to work on the ticket.

@wjakob wjakob closed this Oct 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants