Cache return types of called functions #390

jagerman · 2016-09-06T17:41:44Z

Each time a function is called, we currently do a type lookup in registered_types_cpp to figure out the python type info associated with the return value. This hash lookup adds to the overhead, which can be a small but noticeable overhead for functions called in a tight python loop (one of the two main issues reported in #376). It can, however, be easily cached since the lookup is going to return an identical value each time (since pybind11 doesn't support de-registering types).

This commit adds an (optional) cache parameter to the lookup for generic types (i.e. those inheriting from type_caster_generic) so that these lookups can be cached.

Of course, adding a cache variable isn't free either: there is a small (0.93%) increase in the test .so size (on linux/g++6), but it seems worthwhile for a noticeable overhead reduction.

wjakob · 2016-09-11T15:26:27Z

The alternative to how this commit addresses the issue (storing a pointer per function call) would be to do that per type via a static member variable of a template class. Any thoughts on that approach?

The current inheritance testing isn't sufficient to detect a cache failure; the test added here breaks PR pybind#390, which caches the run-time-determined return type the first time a function is called, then reuses that cached type even though the run-time type could be different for a future call.

jagerman · 2016-09-11T22:46:41Z

That won't work, and actually, neither will this PR as currently written. The issue is that the actual return type is not always the same for a given function; currently, because we're using typeid on the pointer to be cast, we're actually do a run-time polymorphic type lookup when calling typeid to resolve the type; without this, the return_class_1 and _2 functions in test_inheritance wouldn't work:

    m.def("return_class_1", []() -> BaseClass* { return new DerivedClass1(); });
    m.def("return_class_2", []() -> BaseClass* { return new DerivedClass2(); });

we pick those up correctly as new instances of DerivedClass1/DerivedClass2, not BaseClass*, but at compile time there is no difference between those functions.

This PR is broken because it caches per-function, which means the problem doesn't show up in the above, but would show up in:

    m.def("return_class_n", [](bool one) -> BaseClass* { if (one) return new DerivedClass1(); else return new DerivedClass2(); });

i.e. if a function returns different types, we fail to notice (because the cache will contain whatever the first run-time type was). This is obviously a testing defect that I've submitted PR #409 to detect (separate from this PR because it's a useful test whether or not this PR proceeds).

jagerman · 2016-09-11T23:00:21Z

I think, however, that there's an easy enough solution for this PR: instead of just using the cache when we find it, we can use the cached value if we find it and its type agrees with the current value's typeid; if it doesn't, we do a lookup and replace the cached value.

In practice, this should work almost as well when something is called in a tight loop. It would take a function that continually oscillates in the actual type returned to defeat the cache, so that, for example, the following would never benefit from the cache:

static bool counter = 1;
m.def("f1", [&counter]() -> BaseClass* { if (counter++ % 2 == 0) return new Derived1(); else return new Derived2(); });

Changing % 2 to % 3 would result in a cache hit 1/3 of the time, etc.

This approach can be combined with the templated-static variable approach; it just reduces the granularity of the cache to any functions with the same return type, which for most cases is probably fine. It means that the following code will additionally be cache-defeating:

m.def("f1", [&counter]() -> BaseClass* { return new Derived1(); });
m.def("f2", [&counter]() -> BaseClass* { return new Derived2(); });

Combined with alternating calls from Python to f1 and f2. That's probably still a pretty low-likelihood situation, so I'll go ahead with that approach.

Each time a function is called, we currently do a type lookup in registered_types_cpp to figure out the python type info associated with the return value. This lookup isn't free, particularly for often-called functions, but will usually yield an identical result each time the function is called. This commit adds a per-return-type cache into the return type lookup for generic types (i.e. those inheriting from type_caster_generic) to avoid the hash lookup. The cache is invalidated whenever the runtime type changes (to avoid the problem PR pybind#409 tests for).

jagerman · 2016-09-12T02:06:09Z

Updated to use local static, return-type-level caching. The .so premium on tests with this version is down to 0.4% (g++ 6). The overhead reduction looks like this (using the test_cast code from bug #376 for a tight-loop call test):

upstream, -Os:
py_ref:         calls: 50000000   elapsed: 9.121s       ns/call: 182.4ns
C++ w/ lambda:  calls: 50000000   elapsed: 10.243s       ns/call: 204.9ns
mk_copy:        calls: 50000000   elapsed: 13.846s       ns/call: 276.9ns
return-type-caching, -Os:
py_ref:         calls: 50000000   elapsed: 7.943s       ns/call: 158.9ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.674s       ns/call: 173.5ns
mk_copy:        calls: 50000000   elapsed: 12.717s       ns/call: 254.3ns

upstream, -O2:
py_ref:         calls: 50000000   elapsed: 8.071s       ns/call: 161.4ns
C++ w/ lambda:  calls: 50000000   elapsed: 9.032s       ns/call: 180.6ns
mk_copy:        calls: 50000000   elapsed: 12.732s       ns/call: 254.6ns
return-type-caching, -O2:
py_ref:         calls: 50000000   elapsed: 7.083s       ns/call: 141.7ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.239s       ns/call: 164.8ns
mk_copy:        calls: 50000000   elapsed: 11.956s       ns/call: 239.1ns

upstream, -O3:
py_ref:         calls: 50000000   elapsed: 7.809s       ns/call: 156.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.421s       ns/call: 168.4ns
mk_copy:        calls: 50000000   elapsed: 12.066s       ns/call: 241.3ns
return-type-caching, -O3:
py_ref:         calls: 50000000   elapsed: 6.812s       ns/call: 136.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.444s       ns/call: 168.9ns
mk_copy:        calls: 50000000   elapsed: 11.896s       ns/call: 237.9ns

wjakob · 2016-09-18T16:07:06Z

Can you explain the rows in

upstream, -O3:
py_ref:         calls: 50000000   elapsed: 7.809s       ns/call: 156.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.421s       ns/call: 168.4ns
mk_copy:        calls: 50000000   elapsed: 12.066s       ns/call: 241.3ns
return-type-caching, -O3:
py_ref:         calls: 50000000   elapsed: 6.812s       ns/call: 136.2ns
C++ w/ lambda:  calls: 50000000   elapsed: 8.444s       ns/call: 168.9ns
mk_copy:        calls: 50000000   elapsed: 11.896s       ns/call: 237.9ns

i.e. what are mk_copy, py_ref, etc.

For this PR, I think I'd like to wait to see how it plays out in conjunction with the other planned optimizations before merging as the performance improvements are fairly small and probably just noticeable in very simple getter-style functions.

jagerman · 2016-09-18T17:54:43Z

Indeed, the gains are not likely to be noticeable outside tight loop code with simple returns. That said, they are still there: but a 20ns gain per function call isn't likely to be significant when not invoking functions millions of times.

The code actually invoked is the latest version of test_cast.py/test_cast.cpp, posted in #376, but with the C++/Python/mk_untracked cases commented out. Basically what each does is:

py_ref is code that does a tight loop in python, each loop calling a bound function that updates and returns the same instance of a simple pybind-registered type using return_value_policy::reference.
C++ w/ lambda passes a Python lambda to a C++ function (as a pybind11::object), then does the looping and object creation in C++, with a call to the py::object's call operator inside the loop.
mk_copy is similar to py_ref, but it makes a copy of the returned instance via return_value_policy::copy.

jagerman · 2016-10-02T20:03:18Z

Shall I close this for now? It probably belongs as part of the other iterator optimization changes in #376.

wjakob · 2016-10-02T20:04:43Z

Ok, let's do that. @aldanor can still get the diffs from here when he has some time to work on the ticket.

jagerman mentioned this pull request Sep 6, 2016

Iterators efficiency and instance tracking #376

Open

jagerman mentioned this pull request Sep 11, 2016

Added a test to detect invalid RTTI caching #409

Merged

jagerman force-pushed the return-type-caching branch from 33cdacc to 89565a7 Compare September 12, 2016 01:41

wjakob closed this Oct 2, 2016

rwgk mentioned this pull request Feb 9, 2023

FWD pybind11 google/pybind11clif#390

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache return types of called functions #390

Cache return types of called functions #390

jagerman commented Sep 6, 2016

wjakob commented Sep 11, 2016

jagerman commented Sep 11, 2016

jagerman commented Sep 11, 2016

jagerman commented Sep 12, 2016

wjakob commented Sep 18, 2016

jagerman commented Sep 18, 2016

jagerman commented Oct 2, 2016

wjakob commented Oct 2, 2016

Cache return types of called functions #390

Cache return types of called functions #390

Conversation

jagerman commented Sep 6, 2016

wjakob commented Sep 11, 2016

jagerman commented Sep 11, 2016

jagerman commented Sep 11, 2016

jagerman commented Sep 12, 2016

wjakob commented Sep 18, 2016

jagerman commented Sep 18, 2016

jagerman commented Oct 2, 2016

wjakob commented Oct 2, 2016