RFC: Local Benchmarking and Evaluation Tooling #11214
GregoryComer
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RFC: Local Benchmarking and Evaluation Tooling
Note: This initial draft is being actively iterated on. An updated draft will be posted within the next few days.
The purpose of this RFC is to solicit feedback from ExecuTorch users and contributors for on-device benchmarking and accuracy evaluation tooling. This document focuses on the high-level problem and goals, with some consideration given to the interface and high-level design.
Questions for Reviewers
Problem
It is difficult to understand the performance and accuracy implications of backend delegation and quantization. When deploying a model on-device, users need to know how lowering decisions affect both model accuracy and inference time. Modelling experts typically do not want to write app code or fiddle with complex builds to do this.
Backends can and do perform arbitrary transformations of the graph. They might, for example, convert delegated portions of the graph to run in 16-bit floating point. Or run as integer-only.
For accuracy evaluation, we provide python bindings for some backends. This may be a solution, but is not currently available for all backends, and we do not provide any guarantees of numerical parity compared to on-device execution.
Quantized models can also be run using normal PyTorch eager mode before lowering to ExecuTorch. However, it is widely reported that the actual on-device kernels using true quantized arithmetic (often integer accumulation in the inner loop), can differ by a material amount from eager mode fake quant.
Goals
Note to reviewers: Accuracy and inference time may both be captured by a single tool or by separate tools. I’ve included them in a single RFC, as they are commonly evaluated together and might be solved by a single tool. See the design section below.
Non-Goals
Design (Accuracy Evaluation)
To meet the accuracy evaluation goals described above, there are a few options. We can attempt to leverage python bindings to run inference from a Python environment. Or we can provide a tool that runs on a local device.
Option 1 - Python Runtime Bindings
We can recommend python runtime bindings (Pybindings) as the preferred approach to evaluate delegated model accuracy. This would require ensuring that python bindings are available on host for all P0 backends and updating documentation to recommend this approach for evaluation.
Pros
Cons
Option 2 - Match Eager to Backend/Quantized Numerics
If we can better match on-device numerics in eager mode, users can simply use their quantized and/or delegated Python model for evaluation. For quantization, this might take the form of providing a way to run pt2e quantized operators using “real” quantized kernels.
For backends, this could involve updating backend logic to ensure that eager execution of lowered partitions - running the post to_edge_transform_and_lower graph - includes major transformations, such as running in fp16 or integer-only.
Pros
Cons
Option 3 - Run On-Device
We can build an evaluation flow that runs on a local Android or iOS device. The user would bundle input tensors, trigger a run from the host machine, and get output tensors and/or accuracy statistics back.
The tool would ideally be triggered either from CLI or a Python environment and would be able to bring output tensors back into a Python environment as torch tensors for evaluation.
Pros
Cons
Design (Performance Evaluation)
We can provide a way to capture inference time, load time, and process memory deltas for a given .pte.
Inputs
Outputs
Option 1 - CLI Tooling
Provide a command line utility to build, deploy, and run a benchmarking utility on a locally connected Android or iOS device. This would be an all-in-one tool, which would handle all of the build and deployment, to the extent possible.
Pros
Cons
Option 2 - Extend Demo / Benchmarking Apps
We can invest in our demo and/or benchmarking apps to provide a streamlined and easy way to quickly upload user .ptes and capture data.
Pros
Cons
Note to reviewers: Existing or planned benchmarking work may cover some or all of this. If so, feel free to let me know. The more we can re-use existing work, the better.
Alternatives
If integrating a user evaluation pipeline is not required, the design for on-device execution can be simplified by reporting accuracy statistics compared to bundled reference results. This may include absolute and relative error, signal to quantized noise ratio (SQNR), error L2 norm, cross entropy loss, and other common error statistics.
We can also do nothing and rely on the current status quo. However, accuracy evaluation for hardware backends has been a blocker for adoption of hardware acceleration. Not providing a solution to this problem has the risk to reduce adoption of hardware acceleration on ExecuTorch.
Beta Was this translation helpful? Give feedback.
All reactions