Add arrow-avro Decoder Benchmarks #8025

jecsand838 · 2025-07-30T05:07:23Z

Which issue does this PR close?

Part of Add Avro Support #4886

Rationale for this change

This change introduces a comprehensive benchmark suite for the arrow-avro decoder. Having robust benchmarks is crucial for several reasons:

It allows for the measurement and tracking of decoding performance over time.
It helps identify performance regressions or improvements as the codebase evolves.
It provides a standardized way to evaluate the impact of optimizations and new features.

What changes are included in this PR?

This PR adds a new benchmark file: arrow-avro/benches/decoder.rs.

The key components of this new file are:

Comprehensive Type Coverage: Adds benchmark scenarios for a wide range of data types, including:
- Primitive types (Int32, Int64, Float32, Float64, Boolean)
- Binary and String types (Binary(Bytes), String, StringView)
- Logical types (Date32, TimeMillis, TimeMicros, TimestampMillis, TimestampMicros, Decimal128, UUID, Interval, Enum)
- Complex types (Map, Array, Nested(Struct))
- FixedSizeBinary
- A Mixed schema with multiple fields
Update to criterion 7.0.0
Made mod schema public

Are these changes tested?

These changes are covered by the benchmark tests themselves.

Are there any user-facing changes?

N/A

Introduce benchmarks for the `arrow-avro` decoder to evaluate performance under various scenarios, such as different data types and schemas. Additionally, update the `criterion` and other dependencies, and make the `schema` module public to enable broader usage.

klion26

Thanks for the contribution, I executed the benchmark with the command cargo bench --features=arrow,async,test_common,experimental --bench decoder, it can output the time and throughput like below

..... Only part of the result data is retained ....

Nested(Struct)/100      time:   [6.6012 µs 6.6925 µs 6.7970 µs]
                        thrpt:  [145.36 MiB/s 147.63 MiB/s 149.67 MiB/s]
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
Nested(Struct)/10000    time:   [467.44 µs 470.22 µs 473.82 µs]
                        thrpt:  [224.91 MiB/s 226.63 MiB/s 227.98 MiB/s]
Nested(Struct)/1000000  time:   [3.2511 ms 3.3813 ms 3.5785 ms]
                        thrpt:  [3.1209 GiB/s 3.3030 GiB/s 3.4352 GiB/s]

klion26 · 2025-08-06T00:55:15Z

arrow-avro/Cargo.toml

    "thread_rng",
 ] }
-criterion = { version = "0.6.0", default-features = false }
+criterion = { version = "0.7.0", default-features = false }


Are there any features we need in 0.7.0?

I just figured it was a good idea to bump it up. I noticed dependabot was also trying to update it.

klion26 · 2025-08-06T00:56:25Z

arrow-avro/benches/decoder.rs

+    };
+}
+
+dataset!(INT_DATA, INT_SCHEMA, gen_int);


Nice generation logic

klion26 · 2025-08-06T00:59:51Z

arrow-avro/benches/decoder.rs

+            b.iter_batched_ref(
+                || new_decoder(schema_json, DEFAULT_BATCH, utf8view),
+                |decoder| {
+                    decoder.decode(black_box(datum)).unwrap();


Do we need the black_box here? From the doc, black_box is used to It prevents the compiler from making optimizations, seems that datum is an array precomputed above.

That is true, I included black_box because I was having difficultly getting consistent benchmarks. I can experiment with removing it.

I ended up putting the black_box around the entire decoder.decode method call. This was a good catch!

klion26 · 2025-08-06T01:03:57Z

arrow-avro/benches/decoder.rs

+                |decoder| {
+                    decoder.decode(black_box(datum)).unwrap();
+                    let batch = decoder.flush().unwrap().unwrap();
+                    black_box(batch.get_array_memory_size());


Is the black_box here to make sure that the batch.get_array_memory_size() was executed every time? Is there any reason we want to include get_array_memory_size in the benchmark?

That's a good call out, I'll remove it. We don't need that there.

I went ahead and removed the get_array_memory_size in my latest push.

klion26 · 2025-08-06T01:05:07Z

arrow-avro/benches/decoder.rs

+        }
+        group.bench_function(BenchmarkId::from_parameter(rows), |b| {
+            b.iter_batched_ref(
+                || new_decoder(schema_json, DEFAULT_BATCH, utf8view),


Not sure if we need to benchmark some other batch size or not

That's a good idea! I can add different batch sizes to this benchmark.

I added benchmarking on different batch sizes in my latest push as well.

klion26 · 2025-08-06T01:06:20Z

arrow-avro/benches/decoder.rs

+        .with_schema(schema)
+        .with_batch_size(batch_size)
+        .with_utf8_view(utf8view)
+        .build_decoder(io::empty())


Is that we use io::empty() here because we always set schema above?

That's about to be removed as .build_decoder doesn't need a Reader once #8006 is finalized and merged in.

alamb · 2025-08-06T12:05:17Z

Thanks for the review @klion26 -- @jecsand838 maybe you can address the comments and then I'll take a final review over this PR

jecsand838 · 2025-08-06T18:23:19Z

@klion26 Thank you for the solid review! I just pushed up changes which should address all of your comments. I can bump criterion back down if needed as well. @alamb

alamb · 2025-08-06T20:20:25Z

@klion26 Thank you for the solid review! I just pushed up changes which should address all of your comments. I can bump criterion back down if needed as well. @alamb

I think updating criterion sounds good to me

klion26

@jecsand838 thanks for the update, LGTM, cc @alamb

klion26 · 2025-08-07T08:33:30Z

arrow-avro/benches/decoder.rs

+            }
+            _ => {}
+        }
+        group.bench_function(BenchmarkId::from_parameter(rows), |b| {


Do we need to add the batch size here?

alamb · 2025-08-07T13:04:23Z

Thanks again @jecsand838 and @klion26 for the review. I merged the PR figuring we should continue to move things forward -- let's do any additional things in a follow on PR

github-actions bot added the arrow Changes to the arrow crate label Jul 30, 2025

jecsand838 force-pushed the avro-decoder-benchmarks branch 3 times, most recently from 6bbd823 to 112ffb7 Compare July 30, 2025 05:45

jecsand838 force-pushed the avro-decoder-benchmarks branch from 112ffb7 to 3b39d66 Compare July 30, 2025 05:55

klion26 reviewed Aug 6, 2025

View reviewed changes

Address PR Comments

3fdf9fe

jecsand838 requested a review from klion26 August 7, 2025 02:21

klion26 approved these changes Aug 7, 2025

View reviewed changes

klion26 reviewed Aug 7, 2025

View reviewed changes

alamb merged commit a4bcd6d into apache:main Aug 7, 2025
24 checks passed

jecsand838 deleted the avro-decoder-benchmarks branch August 7, 2025 14:25

Add arrow-avro Decoder Benchmarks #8025

Add arrow-avro Decoder Benchmarks #8025

Uh oh!

Conversation

jecsand838 commented Jul 30, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

klion26 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 6, 2025

Uh oh!

jecsand838 commented Aug 6, 2025

Uh oh!

alamb commented Aug 6, 2025

Uh oh!

klion26 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jecsand838 Aug 6, 2025 •

edited

Loading