Skip to content

1.x: optimize merge/flatMap for empty sources #3761

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 2, 2016

Conversation

akarnokd
Copy link
Member

This PR improves the overhead when one merges/flatMaps empty() sequences.

Benchmark results: (i7 4770K, Windows 7 x64, Java 8u72):

image

For rare empty(), the overhead seems to be around the noise level.

@akarnokd
Copy link
Member Author

This is the comparison with Rsc:

image

@stealthcode
Copy link

Just a general question about perf testing... in the development of SyncOnSubscribe we wrote a perf test that used the Blackhole.consumeCPU(int) method (see perf test) because this would simulate the execution of some business logic causing registers and caches to clear. In a very short src code scan I didn't find where FlatMapAsFilterPerf does this. I can see that it uses the blackhole to consume the data (clearly this is necessary). Do you think it would be valuable to add some simulated business logic to each flatmap Func1 definition?

@akarnokd
Copy link
Member Author

My perfs measure the overhead of the infrastructure where the subscriber does nothing else. This is like an upper bound for the throughput you can achieve. Clearly, if you have sleep(100) in the consumer, almost none of the optimization will show up as a gain. Same goes for consumeCPU but on a nanosecond-scale. Therefore, I don't see the value but you can always experiment.

@stealthcode
Copy link

I hear what you are saying, however sleep is very different than consuming cpu cycles. I completely agree that testing the lower bounds of performance is valuable. Right now we are testing very common use cases. However another common use case is where other work is done on business logic. Using the Blackhole.consumeCPU() api in some tests could level the playing field between two implementations when one implementation disproportionately favors cache locality.

@stealthcode
Copy link

Also there is the matter of the JIT-er. I am not entirely sure but wouldn't this prevent inlining the Func1? This surely is a common use case that we are missing in these perf tests.

@akarnokd
Copy link
Member Author

Our infrastructure is full of atomic operators that take 21-45 cycles on a good day and cause write buffer flushes even with synchronous code. I think the consumeCPU comes in handy when one benchmarks queues concurrently as it can help offset the sides just enough to not step on each other.

Primarily, call depth/stack depth is the limiting factory for JIT, the fewer layers there are and the smaller the methods are, JIT can do more. This is why I advocate for flatMap() instead of merge() because merge(map()) allocates more and pushes through more layers than flatMap() which has the function call and result use right next to each other. JIT inlines such Func1 quite nicely and with such barebone perfs, failures of inline also show up as a throughput loss.

However, just by looking at the code, only JIT experts can tell what happens. There is the JITWatch tool that does a better job but requires some nasty DLLs to be built for Windows and thus I don't use it.

@stevegury
Copy link
Member

I think that what @stealthcode is referring to is the fact that most microbenchmarks test a tiny piece of code in a contented way. AFAIK consumeCPU can help removing the contention without impacting the measure.

Regarding the JIT, as you mentioned call depth is a limiting factor, but AFAIK the main one is the byte-code size of the method. Thus, a big method is less likely to be inlined, and then it's less likely that beneficial optimizations will take place (dead-code elimination, escape-analysis, ...).
By optimizing a piece code by adding a special case, you're always at risk of making the code big enough to prevent inlining. My rule of thumb is to check if the special case is actually seen in a production system (vs. a microbenchmark).

That being said, the modification you proposed is relatively minimal (1 test, 1 method call), and the impact on the byte-code size is small. So 👍 for this change.

PS: JITWatch is a very good tool, especially when you want to learn what the JVM is doing.

@artem-zinnatullin
Copy link
Contributor

👍 // comparison looks fantastic

@stevegury
Copy link
Member

Just to be clear, my previous comment was a 👍

@akarnokd akarnokd merged commit 3721666 into ReactiveX:1.x May 2, 2016
@akarnokd akarnokd deleted the FlatMapEmpty1x branch May 2, 2016 21:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants