RFC: Local Benchmarking and Evaluation Tooling #11214

GregoryComer · 2025-05-29T09:04:05Z

GregoryComer
May 29, 2025
Collaborator

RFC: Local Benchmarking and Evaluation Tooling

Note: This initial draft is being actively iterated on. An updated draft will be posted within the next few days.

The purpose of this RFC is to solicit feedback from ExecuTorch users and contributors for on-device benchmarking and accuracy evaluation tooling. This document focuses on the high-level problem and goals, with some consideration given to the interface and high-level design.

Questions for Reviewers

Is the problem statement accurate to user experience?
Will the proposed tooling solve the problem?
Do we need LLM specific tooling?
To what extent can we leverage existing benchmarking or devtool work to meet these use cases?

Problem

It is difficult to understand the performance and accuracy implications of backend delegation and quantization. When deploying a model on-device, users need to know how lowering decisions affect both model accuracy and inference time. Modelling experts typically do not want to write app code or fiddle with complex builds to do this.

Backends can and do perform arbitrary transformations of the graph. They might, for example, convert delegated portions of the graph to run in 16-bit floating point. Or run as integer-only.

For accuracy evaluation, we provide python bindings for some backends. This may be a solution, but is not currently available for all backends, and we do not provide any guarantees of numerical parity compared to on-device execution.

Quantized models can also be run using normal PyTorch eager mode before lowering to ExecuTorch. However, it is widely reported that the actual on-device kernels using true quantized arithmetic (often integer accumulation in the inner loop), can differ by a material amount from eager mode fake quant.

Goals

Provide a method to evaluate the accuracy of a mobile-delegated model.
- The user should be able to provide input tensors.
Provide a method to evaluate the inference and load time of a mobile-delegated model.
- Timing should be captured on a local device.
- The user should be able to provide input shapes to test the impact of input size on inference times.
Neither tool should require writing app code or use of a mobile IDE.
Tooling should integrate well with a Python and/or local CLI environment, to the extent possible.
- Commands and calls should be made from Python or a local host CLI.
- Result data should be reported on the host machine. This is

Note to reviewers: Accuracy and inference time may both be captured by a single tool or by separate tools. I’ve included them in a single RFC, as they are commonly evaluated together and might be solved by a single tool. See the design section below.

Non-Goals

Support non-mobile platforms.
Support transformer / LLM-specific evaluation flows.
- The tooling should support LLMs by treating encode/decode calls just like any other method.
Including .pte creation in the tooling.
- It is expected that the user will generate candidate delegated and quantized .ptes outside of this tooling. These .ptes will be inputs to the tool.

Design (Accuracy Evaluation)

To meet the accuracy evaluation goals described above, there are a few options. We can attempt to leverage python bindings to run inference from a Python environment. Or we can provide a tool that runs on a local device.

Option 1 - Python Runtime Bindings

We can recommend python runtime bindings (Pybindings) as the preferred approach to evaluate delegated model accuracy. This would require ensuring that python bindings are available on host for all P0 backends and updating documentation to recommend this approach for evaluation.

Pros

Integrates seamlessly with existing Python evaluation flows and techniques.
Users can easily test custom inputs and evaluate outputs using complex loss calculations or evaluation metrics.
Requires little to no new investment in ExecuTorch devtools.
Does not require local mobile devices.

Cons

Numerics on host may deviate from on-device numerics.
Simulators which run on host may be significantly slower than on-device.
Not all backends may provide host-side simulators or bindings.
Users need to capture inference time using a separate tool.

Option 2 - Match Eager to Backend/Quantized Numerics

If we can better match on-device numerics in eager mode, users can simply use their quantized and/or delegated Python model for evaluation. For quantization, this might take the form of providing a way to run pt2e quantized operators using “real” quantized kernels.

For backends, this could involve updating backend logic to ensure that eager execution of lowered partitions - running the post to_edge_transform_and_lower graph - includes major transformations, such as running in fp16 or integer-only.

Pros

Integrates seamlessly with existing Python evaluation flows and techniques.
Users can easily test custom inputs and evaluate outputs using complex loss calculations or evaluation metrics.

Cons

It’s unclear if this is technically feasible.
Numerics are still going to differ somewhat from on-device.

Option 3 - Run On-Device

We can build an evaluation flow that runs on a local Android or iOS device. The user would bundle input tensors, trigger a run from the host machine, and get output tensors and/or accuracy statistics back.

The tool would ideally be triggered either from CLI or a Python environment and would be able to bring output tensors back into a Python environment as torch tensors for evaluation.

Pros

Running on-device gives the most accurate picture of numerics, though numerics may vary between devices.
Inference and load time can be captured and reported at the same time.

Cons

Requires additional investment in building the tooling.
Automation of deployment and build may be not-trivial or impractical, especially for iOS targets.
Requires local devices with the target hardware.

Design (Performance Evaluation)

We can provide a way to capture inference time, load time, and process memory deltas for a given .pte.

Inputs

One or more .pte files.
Benchmarking parameters, including
- Number of iterations.
- Number of warmup iterations.

Outputs

A report, available either as a CLI/app output or in a machine readable file format, including:
- Mean and standard deviation for load time, inference times, and process memory increases.
- Full timing data for each iteration.
An etdump file, if the user opts to capture one.

Option 1 - CLI Tooling

Provide a command line utility to build, deploy, and run a benchmarking utility on a locally connected Android or iOS device. This would be an all-in-one tool, which would handle all of the build and deployment, to the extent possible.

Pros

Very simple to use.

Cons

Automation of deployment and build may be not-trivial or impractical, especially for iOS targets.

Option 2 - Extend Demo / Benchmarking Apps

We can invest in our demo and/or benchmarking apps to provide a streamlined and easy way to quickly upload user .ptes and capture data.

Pros

Some or all of this functionality may already exist. This may require minimal new development.

Cons

Users need to build and deploy the demo apps, which may add a bit of friction. Modeling experts are often not mobile developers.
Users need to manage copying .pte files to the target device and getting output data and etdump files back to the host.

Note to reviewers: Existing or planned benchmarking work may cover some or all of this. If so, feel free to let me know. The more we can re-use existing work, the better.

Alternatives

If integrating a user evaluation pipeline is not required, the design for on-device execution can be simplified by reporting accuracy statistics compared to bundled reference results. This may include absolute and relative error, signal to quantized noise ratio (SQNR), error L2 norm, cross entropy loss, and other common error statistics.

We can also do nothing and rely on the current status quo. However, accuracy evaluation for hardware backends has been a blocker for adoption of hardware acceleration. Not providing a solution to this problem has the risk to reduce adoption of hardware acceleration on ExecuTorch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Local Benchmarking and Evaluation Tooling #11214

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

RFC: Local Benchmarking and Evaluation Tooling #11214

Uh oh!

Uh oh!

GregoryComer May 29, 2025 Collaborator

RFC: Local Benchmarking and Evaluation Tooling

Questions for Reviewers

Problem

Goals

Non-Goals

Design (Accuracy Evaluation)

Option 1 - Python Runtime Bindings

Pros

Cons

Option 2 - Match Eager to Backend/Quantized Numerics

Pros

Cons

Option 3 - Run On-Device

Pros

Cons

Design (Performance Evaluation)

Inputs

Outputs

Option 1 - CLI Tooling

Pros

Cons

Option 2 - Extend Demo / Benchmarking Apps

Pros

Cons

Alternatives

Replies: 0 comments

GregoryComer
May 29, 2025
Collaborator