1.x: optimize merge/flatMap for empty sources #3761

akarnokd · 2016-03-14T15:27:53Z

This PR improves the overhead when one merges/flatMaps empty() sequences.

Benchmark results: (i7 4770K, Windows 7 x64, Java 8u72):

For rare empty(), the overhead seems to be around the noise level.

akarnokd · 2016-03-14T20:20:28Z

This is the comparison with Rsc:

stealthcode · 2016-03-14T20:40:56Z

Just a general question about perf testing... in the development of SyncOnSubscribe we wrote a perf test that used the Blackhole.consumeCPU(int) method (see perf test) because this would simulate the execution of some business logic causing registers and caches to clear. In a very short src code scan I didn't find where FlatMapAsFilterPerf does this. I can see that it uses the blackhole to consume the data (clearly this is necessary). Do you think it would be valuable to add some simulated business logic to each flatmap Func1 definition?

akarnokd · 2016-03-14T20:49:21Z

My perfs measure the overhead of the infrastructure where the subscriber does nothing else. This is like an upper bound for the throughput you can achieve. Clearly, if you have sleep(100) in the consumer, almost none of the optimization will show up as a gain. Same goes for consumeCPU but on a nanosecond-scale. Therefore, I don't see the value but you can always experiment.

stealthcode · 2016-03-14T20:54:50Z

I hear what you are saying, however sleep is very different than consuming cpu cycles. I completely agree that testing the lower bounds of performance is valuable. Right now we are testing very common use cases. However another common use case is where other work is done on business logic. Using the Blackhole.consumeCPU() api in some tests could level the playing field between two implementations when one implementation disproportionately favors cache locality.

stealthcode · 2016-03-14T20:57:00Z

Also there is the matter of the JIT-er. I am not entirely sure but wouldn't this prevent inlining the Func1? This surely is a common use case that we are missing in these perf tests.

akarnokd · 2016-03-14T21:12:54Z

Our infrastructure is full of atomic operators that take 21-45 cycles on a good day and cause write buffer flushes even with synchronous code. I think the consumeCPU comes in handy when one benchmarks queues concurrently as it can help offset the sides just enough to not step on each other.

Primarily, call depth/stack depth is the limiting factory for JIT, the fewer layers there are and the smaller the methods are, JIT can do more. This is why I advocate for flatMap() instead of merge() because merge(map()) allocates more and pushes through more layers than flatMap() which has the function call and result use right next to each other. JIT inlines such Func1 quite nicely and with such barebone perfs, failures of inline also show up as a throughput loss.

However, just by looking at the code, only JIT experts can tell what happens. There is the JITWatch tool that does a better job but requires some nasty DLLs to be built for Windows and thus I don't use it.

stevegury · 2016-03-15T17:32:10Z

I think that what @stealthcode is referring to is the fact that most microbenchmarks test a tiny piece of code in a contented way. AFAIK consumeCPU can help removing the contention without impacting the measure.

Regarding the JIT, as you mentioned call depth is a limiting factor, but AFAIK the main one is the byte-code size of the method. Thus, a big method is less likely to be inlined, and then it's less likely that beneficial optimizations will take place (dead-code elimination, escape-analysis, ...).
By optimizing a piece code by adding a special case, you're always at risk of making the code big enough to prevent inlining. My rule of thumb is to check if the special case is actually seen in a production system (vs. a microbenchmark).

That being said, the modification you proposed is relatively minimal (1 test, 1 method call), and the impact on the byte-code size is small. So 👍 for this change.

PS: JITWatch is a very good tool, especially when you want to learn what the JVM is doing.

artem-zinnatullin · 2016-03-23T22:52:26Z

👍 // comparison looks fantastic

stevegury · 2016-05-02T21:03:12Z

Just to be clear, my previous comment was a 👍

1.x: optimize merge/flatMap for empty sources

85948fe

akarnokd added Enhancement Performance labels Mar 14, 2016

akarnokd mentioned this pull request Mar 14, 2016

Optimisations with Observable.empty() #1653

Closed

akarnokd mentioned this pull request Apr 4, 2016

Release 1.1.3 proposed content #3830

Closed

akarnokd mentioned this pull request Apr 28, 2016

Release 1.1.4 preparations #3894

Closed

akarnokd merged commit 3721666 into ReactiveX:1.x May 2, 2016

akarnokd deleted the FlatMapEmpty1x branch May 2, 2016 21:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

1.x: optimize merge/flatMap for empty sources #3761

1.x: optimize merge/flatMap for empty sources #3761

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

stealthcode commented Mar 14, 2016

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

stealthcode commented Mar 14, 2016

Uh oh!

stealthcode commented Mar 14, 2016

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

stevegury commented Mar 15, 2016

Uh oh!

artem-zinnatullin commented Mar 23, 2016

Uh oh!

stevegury commented May 2, 2016

Uh oh!

Uh oh!

1.x: optimize merge/flatMap for empty sources #3761

1.x: optimize merge/flatMap for empty sources #3761

Uh oh!

Conversation

akarnokd commented Mar 14, 2016

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

stealthcode commented Mar 14, 2016

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

stealthcode commented Mar 14, 2016

Uh oh!

stealthcode commented Mar 14, 2016

Uh oh!

akarnokd commented Mar 14, 2016

Uh oh!

stevegury commented Mar 15, 2016

Uh oh!

artem-zinnatullin commented Mar 23, 2016

Uh oh!

stevegury commented May 2, 2016

Uh oh!

Uh oh!