diff --git a/docs/source/backend-arm-ethos-u-dedicated_sram.png b/docs/source/backend-arm-ethos-u-dedicated_sram.png new file mode 100644 index 00000000000..43b17266cf1 Binary files /dev/null and b/docs/source/backend-arm-ethos-u-dedicated_sram.png differ diff --git a/docs/source/backend-arm-ethos-u-shared_sram.png b/docs/source/backend-arm-ethos-u-shared_sram.png new file mode 100644 index 00000000000..d52d685d6e7 Binary files /dev/null and b/docs/source/backend-arm-ethos-u-shared_sram.png differ diff --git a/docs/source/backend-arm-ethos-u-sram_only.png b/docs/source/backend-arm-ethos-u-sram_only.png new file mode 100644 index 00000000000..bc09addbffa Binary files /dev/null and b/docs/source/backend-arm-ethos-u-sram_only.png differ diff --git a/docs/source/backends-arm-ethos-u.md b/docs/source/backends-arm-ethos-u.md index 8062f6ae1c5..ae14cb9901f 100644 --- a/docs/source/backends-arm-ethos-u.md +++ b/docs/source/backends-arm-ethos-u.md @@ -72,17 +72,116 @@ with open("mv2_arm_ethos_u55.pte", "wb") as file: edge_program_manager.write_to_file(file) ``` +### Ethos-U memory modes +The Ethos-U NPU provides two distinct memory interfaces: +- One interface for **low-latency, high-bandwidth memory** +Typically on-chip memory such as **SRAM**. +- One interface for **higher-latency, lower-bandwidth memory** +Typically external (off-chip) memory such as **Flash** or **DRAM**. + +On all Ethos-U NPUs(Ethos-U55, Ethos-U65, Ethos-U85), the low-latency interface is usually the SRAM of the SoC. +The external memory type depends on the SoC: +- On a low-power microcontorller, the external memory is usually Flash. +- On systems with Cortex-A and rich operating system, the external memory is typically DRAM. + +When running an inference, the Ethos-U compiler and Ethos-U driver make use of three logical memory regions: +- Ethos-U scratch buffer - a contiguous block of memory used by the NPU to store the intermediate tensors produced and consumed during inference. +- Neural Network - a contiguous block of memory holding constant data such as weights, biases, quantization parameters required to run an inference. +- Ethos-U fast scratch buffer - a contiguous block of memory, assumed to reside in on-chip memory in order to hide the higher latency/lower bandwidth of external memory. Only applicable for Ethos-U65 and Ethos-U85 on systems +with Cortex-A and the external memory is assumed to be DRAM. + +The placement of the scratch buffer and the Neural Network determine the memory mode to be used in the Ethos-U +compile specificiation. We support three different placements of the scratch buffer and the ML model. + +#### 1. Sram-Only Memory Mode +- Ethos-U scratch buffer resides in the SRAM. +- Neural Network resides in the SRAM. +- Ethos-U fast scratch buffer is not used. +- Characteristics: + - Provides the best performance since all the memory traffic passes via the low-latency/high-bandwidth memory. + - The performance uplift is especially noticeable on memory-bound workloads on the external interface. + - Available on Ethos-U55, Ethos-U65 and Ethos-U85. +- Limitations: + - Embedded SoCs often have limited SRAM and NNs are becoming larger. This memory mode may be unsuitable for a system running a big model relative to the amount of SRAM available on the SoC. +Below, you can see a visual representation of the placement of the two logical memory regions for the Sram Only configuration. + +![](backend-arm-ethos-u-sram_only.png) + +#### 2. Shared-Sram Memory Mode +- Ethos-U scratch buffer resides in the SRAM. +- Neural Network resides in the External memory. +- Ethos-U fast scratch buffer is not used. +- Characteristics: + - Intermediate tensors are stored in the SRAM, leveraging its low-latency and high-bandwidth. + - The Ethos-U compiler can prefetch weights from the external memory to the SRAM ahead of time so that when the NPU needs the data, it will already be avaialbe in the on-chip memory. + - In this mode, the external interface is Read-Only, the on-chip memory interface is Read/Write + - Shared-Sram offers great balance between performance and low SRAM usage. + - Available on Ethos-U55, Ethos-U65 and Ethos-U85. +- Limitations: + - You need to have enough space in the SRAM to hold the peak intermediate tensor. +Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration. + +![](backend-arm-ethos-u-shared_sram.png) + +#### 3. Dedicated-Sram Memory Mode +- Ethos-U scratch buffer resides in the External memory. +- Neural Network resides in the External memory. +- Ethos-U fast scratch buffer resides in the on-chip memory. +- Characteristics: + - Used when the peak intermediate tensor is too big to fit into the on-chip memory. + - Enables silicon acceleration of large models. + - The NPU stores the results from the intermediate computations in the external memory. + - The dedicated SRAM acts as a software managed cache, improving performance by pre-fetching frequently accessed tensors to the on-chip memory. + - Available on Ethos-U65 and Ethos-U85. +- Limitations: + - The SRAM space must be dedicated exculisely to the Ethos-U(the host processor should not access it). + - Not available on Ethos-U55. +Below, you can see a visual representation of the placement of the two logical memory regions for the Shared_Sram configuration. + +![](backend-arm-ethos-u-dedicated_sram.png) + +Here is a table comparing the three memory modes: + +| Memory Mode | Ethos-U Scratch Buffer Placement | Neural Network Placement | When to Use | Trade-off | +|--------------------|----------------------------------|----------------------------|------------ |---------------------------------------------------------------------------| +| **SRAM-Only** | On-chip SRAM | On-chip SRAM | When the ML model, the Ethos-U scratch buffer and the wider software stack fit within the SRAM of the SoC | Limited by SRAM size; often not feasible for larger NNs | +| **Shared-SRAM** | On-chip SRAM | External memory (Flash/DRAM) | Most common mode on Cortex-M and Ethos-U systems; balances good performance and SRAM usage | Requires enough SRAM to hold the largest intermediate tensor | +| **Dedicated-SRAM** | External memory | External memory (Flash/DRAM) | Most common mode for Cortex-A and Ethos-U systems. For very large models where the peak intermediates cannot fit in SRAM | Need high-bandwidth external memory to deliver good performance | + + +The memory modes are defined within the [vela.ini file](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/ethosu/config_files/Arm/vela.ini?ref_type=heads). When you install +ExecuTorch for the Ethos-U backend, you automatically install the compiler containing the vela.ini file so you can directly create a compile specification with these memory modes. + +#### Interpreting the output from the Ethos-U compiler regarding the memory footprint +As part of the `to_edge_transform_and_lower` step, you will see a memory footprint information presented as: + +``` +Total SRAM used 2467.27 KiB +Total Off-chip Flash used 12.20 KiB +```` +The `Total SRAM used` indicates the peak SRAM utilization needed by the NPU in order to perform an inference. In the snippet above, the Ethos-U compiler requires 2467.27 KiB of SRAM in order to schedule the inference. +Therefore, from an application standpoint, you need to ensure you have at least 2467.27 KiB of SRAM on the SoC to run this model. The Ethos-U compiler provides a scheduling algorithm allowing to +lower the peak SRAM usage within reasonable limits, you need to add the `--optimise Size` or `--arena-cache-size` CLI options for to the compile spec. You can read more about the options of the +Ethos-U compiler in the documentation [here](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-vela/-/blob/main/OPTIONS.md#optimise). If the peak SRAM usage remains too high in +Shared Sram memory mode, you would need to us the Dedicated Sram mode in order to store the Neural Network and the Ethos-U scratch buffer in the external memory. +The main advantage of the Dedicated_Sram memory mode is that you can run large models and still benefit from the low-latency/high-bandwidth of the SRAM, used as a cache. + +It is important to highlight that when you specify a memory mode in the compile spec, in the runtime, the user is expected to place the scratch buffer and NN in the correct memory location. +In other words, when you specify for ex. Shared Sram memory mode, the runtime application logic should place the ethos-U scratch buffer in the on-chip memory and the NN in the external memory for optimal performance. + +You can see how we are doing this coupling between the memory mode and runtime application the [Ethos-U porting guide](../../examples/arm/ethos-u-porting-guide.md). + ### Partitioner API `EthosUPartitioner` tries to partition as much of the model as possible. It will never delegate unsupported operators, but a user can pass additional checks to the constructor to avoid partitioning additional operators. To do this, subclass `OperatorSupportBase` and implement the function `is_node_supported`. A few such checks exist in `executorch.exir.backend.operator_support`: - `DontPartition`: Don't partition operators based on operator type. - `DontPartitionModule`: Don't partition operators based on which python module the operator comes from. -- `DontPartitionName`: Don't partition opertors based on the operator name. +- `DontPartitionName`: Don't partition operators based on the operator name. ### Quantization -A fully integer model is required for using the Arm Ethos-U backend. As discussed above, you can quantize floating point models with the the `EthosUQuantizer`. Quantizers are backend specific, which means the `EthosUQuantizer` is configured to quantize models correctly for the target. +A fully integer model is required for using the Arm Ethos-U backend. As discussed above, you can quantize floating point models with the `EthosUQuantizer`. Quantizers are backend specific, which means the `EthosUQuantizer` is configured to quantize models correctly for the target. ## Runtime Integration diff --git a/examples/arm/ethos-u-porting-guide.md b/examples/arm/ethos-u-porting-guide.md new file mode 100644 index 00000000000..a5ff0c50c6c --- /dev/null +++ b/examples/arm/ethos-u-porting-guide.md @@ -0,0 +1,128 @@ +# Introduction + +As you have seen in in the [Arm® Ethos™-U NPU backend tutorial](../../docs/source/backends-arm-ethos-u.md), ExecuTorch has two distinct parts: +- Ahead-of-time(AoT) compile flow +- Ethos-U on-device runtime + +In the porting guide, we guide you through the main steps to port your SoC with an Ethos-U to the ExecuTorch Ethos-U backend in order to leverage the ExecuTorch Ethos-U enablement. We assume +you are familiar with the concepts introduced in `backends-arm-ethos-u.md`, you have already generated a pte in the AoT flow and want to deploy the ML model on device. +Fundamentally, there are two big approaches you can take in porting a SoC with an Ethos-U NPU towards the ExecuTorch runtime. +- You can use the enablement we have done in ExecuTorch for the Arm® Corstone™-300(Arm® Cortex®-M55 and Arm® Ethos™-U55 reference design) and +Arm® Corstone™-320(Arm® Cortex®-M85 and Arm® Ethos™-U85 reference design) and migrate from the Corstone platform towards a new platform. +- If the SoC comes with an SDK that is not based on ExecuTorch, you can replace the runtime SDKs with the corresponding APIs from ExecuTorch runtime. + +It is important to understand that irrespective of whether the SoC comes with or without SDK, there is +[a single Ethos-U driver](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-core-driver) that any SoC with Ethos-U relies on. For that reason, there will be +overlap between the two approaches when it comes to the enablement of the Ethos-U NPU. + +## Functioning of a system with an Ethos-U +A system with an Ethos-U and Cortex-M or Arm® Cortex®-A functions in the following way: +- The CPU(Cortex-M or Cortex-A) dispatches an inference job to the Ethos-U NPU. The inference job is in the form of command stream - a sequence of instructions that the NPU +executes. The command stream is generated as part of the `to_edge_transform_and_lower` AoT compile stage. The command stream is embedded within the pte file and at runtime, +the pte file is stored in the memory of the SoC. +- The Ethos-U NPU autonomously reads the command stream from memory. When the NPU finishes processing the command stream, it raises an interrupt to the CPU to signal +that the inference job is complete. The CPU executes the interrupt handler and resumes its normal execution. + +### Ethos-U memory regions +In order to allow this functioning, the Ethos-U driver defines the following regions that the NPU hardware will access: +- Ethos-U scratch buffer - a contiguous block of memory used by the NPU to store the intermediate tensors produced and consumed during inference. Applicable for any Ethos-U NPU. +- Neural Network - a contiguous block of memory holding constant data such as weights, biases, quantization parameters required to run an inference. Applicable for +any Ethos-U NPU. +- Ethos-U fast scratch buffer - a contiguous block of memory for the case when the Ethos-U scratch buffer and Neural Network are both in the external memory. +Applicable only for Ethos-U65 and Ethos-U85 in Dedicated_Sram memory mode. + +### Ethos-U driver +The key function of the Ethos-U driver enabling the interaction with the NPU is +(ethosu_invoke_v3)[https://github.com/pytorch/executorch/blob/main/backends/arm/runtime/EthosUBackend.cpp#L324]. The `ethosu_invoke_v3` function takes as input a driver handle, +a pointer towards the command stream and the size of the command stream, base address as well as the size of the base addresses. For a system with Cortex-M, there is a 1:1 +mapping between base pointer and region, so we will pass three base pointers and each base pointer will correspond to one region. Then, as part of the compilation stage +in `to_edge_transform_and_lower`, the Ethos-U compiler will generate command stream taking into account the three regions. In other words, at runtime, the ethos-U driver +knows the address of the command stream, its size, as well as the address for the locations in memory needed to store the intermediate tensors. +The Ethos-U driver will pass these address to the NPU and the NPU will issue memory requests to the on-chip or external memories in order to access the necessary data +(e.g. read weights, store an intermediate result into the scratch buffer, etc). The `backends/arm/runtime/EthosUBackend.cpp` already integrates the Ethos-U driver and it +already supports the three memory modes of the Ethos-U. Therefore, you should reuse `backends/arm/runtime/EthosUBackend.cpp` as is, without modifications. +The key question for any porting is how to initialize the NPU and +make sure it works. Let's analyse these questions in the following section. + +**Note:** Interrupts work differently between Cortex-M and Cortex-A and a system with Cortex-A will use more base pointers and won't have a 1:1 mapping between Ethos-U +driver base pointer and Ethos-U region. The `backends/arm/runtime/EthosUBackend.cpp` is for a system with Cortex-M. Going forward, we assume we have a system with Cortex-M, similar to the Corstone platforms. + +## NPU initialization +In order to initialize the NPU hardware, the software needs to provide correct information about: +- The base address of the Ethos-U NPU on the memory map of the SoC. +- The interrupt assignment for the Ethos-U. You also need to provide interrupt priority. + +In the `executorch/examples/arm/executor_runner/arm_executor_runner.cpp` sample application, we inherit the Corstone-300/Corstone-320 NPU initialization done in the [core-platform project](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-core-platform). +We include the core-platform project as a dependency in the `backends/arm/scripts/corstone_utils.cmake` script. Note that in [corstone_utils.sh](https://github.com/pytorch/executorch/blob/main/backends/arm/scripts/corstone_utils.cmake#L69) +depending on whether we target Ethos-U55 or Ethos-U85, we include the corresponding target from core-platform. Then, inside core-platform, the NPU base address and interrupt assignment are defined in the +[target.cpp](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-core-platform/-/blob/main/targets/corstone-320/target.cpp?ref_type=heads#L44) as per the memory map of the Corstone-300/Corstone-320. +It's worth mentioning that the code in core-platform(code we reuse in the `examples/arm/executor_runner/arm_executor_runner.cpp`) also calls the `ethosu_init` +function to initialize the NPU. The `ethosu_init` function is [defined in the Ethos-U driver](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-core-driver/-/blob/main/src/ethosu_driver.c?ref_type=heads#L409). The ethos-u driver itself is +included within the core-platform CMake. In other words, to initialize the NPU in the ExecuTorch executor runner application, we reuse the Ethos-U initialization that has been done in the core-platform project. +Core-platform includes a [tutorial](https://gitlab.arm.com/artificial-intelligence/ethos-u/ethos-u-core-platform/-/blob/main/PORTING.md?ref_type=heads) on porting a new target. If you port your target to +core-platform, you can then easily reuse it in the ExecuTorch runtime. + +Also, as explained in the comments in `backends/arm/scripts/corstone_utils.cmake`, note that REGIONCFG register of the Ethos-U controls the memory(on-chip or external memory) used by the NPU to +access the the Ethos-U scratch buffer, ML model and Ethos-U fast scratch buffer. The REGIONCFG is defined in the Ethos-U driver and you need to configure it differently depending on the memory mode. +You can see in `backends/arm/scripts/corstone_utils.cmake` how we overwrite the register to the correct value depending on the memory mode. + +For the Corstone platforms, we make use of the Timing Adapters to model different memory latencies. The Timing Adapter is only applicable to the Corstone targets and should not be included for another SoC. + +## Corstone linker scripts + +The linker scripts point to the linker where to place various objects in memory when the application is loaded onto the target. +In the `arm_executor_runner.cpp` application, we reuse the linker scripts from the core-platform project. Note that the Global Offset Table(.got symbols) needs to be 16-byte aligned. The linker scripts are highly specific to the memory map of system. +For example on the Corstone-300, in order to allow us to build a lot of portable kernels, we relocate the portable kernels from the .text section, living in the ITCM, to a bigger memory. Also on the Corstone-300. the linker script defines two +load regions- rom_exec and rom_dram corresponding to loading the application in the ITCM and in the DDR. When you deploy the application, the boot loader copies the two binaries from the rom_exec/rom_dram regions to their physical address in memory - +a process known as scatter loader. Upon powering on the device, the very first instruction that the Cortex-M executes is the ResetHandler function. The ResetHandler is the first entry in the Vector Interrupt Table, and the location of the Vector Interrupt +Table is specified in the linker script(the `KEEP(*(.vectors))` symbol). The assembly boot code powering on the Cortex-M is itself defined in `core_software/cmsis_6/CMSIS/CoreValidation/Layer/Target/CM55S/RTE/Device/ARMCM55/startup_ARMCM55.c`. +The CMSIS start-up code for Cortex-M is added as part of the build system of the core-platform applications. + +## Coupling between the AoT compile specification memory mode, linker script and the application logic +It is important to note that when you specify a memory mode in the Python script to generate the pte file, in the runtime, the user is +expected to place the scratch buffer and NN in the correct memory location. + +For example, if you generate a pte file with compile specification for Shared Sram, the scratch buffer should be placed in the SRAM and the NN in the external memory in the runtime application code. You can see we are following +this approach in the `examples/arm/executor_runner/arm_executor_runner.cpp` example application. In the linker scripts for the application(`examples/arm/executor_runner/Corstone-320.ld` and +`examples/arm/executor_runner/Corstone-300.ld`) we check the value of `ETHOSU_ARENA` to determine whether the ethos-u scratch buffer is placed in the on-chip memory or in the external memory. In this +way, depending on the `ETHOSU_ARENA` parameter, the linker knows whether the symbol is to be placed in the .ddr or .sram.bss sections. The `ETHOSU_ARENA` parameter is set in the `backends/arm/scripts/corstone_utils.cmake` and +its value is derived based on the memory mode parameter that is passed to the `examples/arm/run.sh` shell script. Then, at link time, the .ddr section is always placed in the external memory and the .sram.bss is always placed in the SRAM. +Finally, note that in the `examples/arm/executor_runner/arm_executor_runner.cpp` application code, we place the buffers for the Ethos-u scratch and the neural network in the correct symbol from the linker script. For instance, +the Ethos-u scratch buffer corresponds to the the `.bss.tensor_arena` section in the linker script, In the application code, when we allocate memory for the Ethos-u scratch buffer, we place this array in the .bss.tensor_arena section in the memory map. + +``` +unsigned char __attribute__(( + section(".bss.tensor_arena"), + aligned(16))) temp_allocation_pool[temp_allocation_pool_size]; +``` +and the `.bss.tensor_arena` section is placed in the correct location in the memory map thanks to +the `ETHOSU_ARENA` parameter. + +There is a tight coupling between the memory mode for the Ethos-U and the placement of the ethos-u scratch buffer, +ethos-u-fast scratch buffer (only applicable for Dedicated_Sram) and the neural network in the memory map of the +SoC. The `arm_executor_runner.cpp` application built with the `examples/arm/run.sh` shell script and corresponding linker scripts are aimed to serve as +example implementation for the correct placement of the various objects in memory. + +It's also worth mentioning that in the AoT Python flow, by default the input to the pte file is in FP32. Therefore, the pte file contains a Quantize node, an Ethos-U custom delegate and a Dequantize node. +Sometimes, you may want to feed quantized input to the Ethos-U custom delegate straight away, for example if you have a camera input outputting RGB data in (u)int8. You can apply the `QuantizeInputs` and +`QuantizeOutputs` passes in the AoT flow for that purpose. Here is a snippet showing how to achieve it: +``` +edge_program_manager = to_edge_transform_and_lower(...) +from executorch.exir.passes.quantize_io_pass import QuantizeInputs +from executorch.exir.passes.quantize_io_pass import QuantizeOutputs +# Apply the QuantizeInputs & QuantizeOutputs passes to input & output tensor 0 +edge_program_manager.transform(passes=[QuantizeInputs(edge_program_manager, [0]), + QuantizeOutputs(edge_program_manager, [0])]) +# Convert edge program to executorch +executorch_program_manager = edge_program_manager.to_executorch( + config=ExecutorchBackendConfig(extract_delegate_segments=False) + ) +``` +If you apply the `QuantizeInputs` pass in the AoT flow, when you populate the input tensor in the runtime application logic, you need to use int8 numbers and not FP32. In the `examples/arm/executor_runner/arm_executor_runner.cpp` application, you can see +how we populate the input tensor depending on its data type. + +## Conclusion +The ExecuTorch project already provides an Arm Ethos-U backend in `executorch/backends/arm/runtime/` that you can reuse as is. The key steps to bring up a new platform is to reuse the Ethos-U driver +and ensure that the NPU base address and interrupt assignment matches your SoC. The `examples/arm/executor_runner/arm_executor_runner.cpp` is an example application running on the Corstone platform. +For the `arm_executor_runner.cpp` application, we are relying on the NPU initialization done in the core-platform project and we integrate core-platform in the +`backends/arm/scripts/corstone_utils.cmake` script. Then, we inherit the core-platform integration of the Ethos-U driver and the CMSIS boot code for the Cortex-M core. \ No newline at end of file