From 2762d8dadb90dd30b870996053dbf13ee0e04d46 Mon Sep 17 00:00:00 2001 From: davidberenstein1957 Date: Tue, 10 Jun 2025 15:57:13 +0200 Subject: [PATCH 1/9] Add Pruna optimization framework documentation - Introduced a new section for Pruna in the table of contents. - Added comprehensive documentation for Pruna, detailing its optimization techniques, installation instructions, and examples for optimizing and evaluating models --- docs/source/en/_toctree.yml | 2 + docs/source/en/optimization/pruna.md | 172 +++++++++++++++++++++++++++ 2 files changed, 174 insertions(+) create mode 100644 docs/source/en/optimization/pruna.md diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index f13b7d54aec4..b44c122030c2 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -180,6 +180,8 @@ title: Caching - local: optimization/memory title: Reduce memory usage + - local: optimization/pruna + title: Pruna - local: optimization/xformers title: xFormers - local: optimization/tome diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md new file mode 100644 index 000000000000..84ad22b54f23 --- /dev/null +++ b/docs/source/en/optimization/pruna.md @@ -0,0 +1,172 @@ +# Pruna + +[Pruna](https://github.com/pruna-ai/pruna) is a powerful model optimization framework that helps you unlock maximum performance from your AI models. With Pruna, you can dramatically accelerate inference speeds, reduce memory usage, and optimize model efficiency, all while maintaining a similar output quality. + +Pruna provides a comprehensive suite of cutting-edge optimization algorithms, each carefully designed to address specific performance bottlenecks. From quantization and pruning to advanced caching and compilation techniques, Pruna gives you the tools to fine-tune your models for optimal performance. A general overview of the optimization methods supported by Pruna is shown as follows. + +| Technique | Description | Speed | Memory | Quality | +|--------------|-----------------------------------------------------------------------------------------------|:-----:|:------:|:-------:| +| `batcher` | Groups multiple inputs together to be processed simultaneously, improving computational efficiency and reducing processing time. | ✅ | ❌ | ➖ | +| `cacher` | Stores intermediate results of computations to speed up subsequent operations. | ✅ | ➖ | ➖ | +| `compiler` | Optimises the model with instructions for specific hardware. | ✅ | ➖ | ➖ | +| `distiller` | Trains a smaller, simpler model to mimic a larger, more complex model. | ✅ | ✅ | ❌ | +| `quantizer` | Reduces the precision of weights and activations, lowering memory requirements. | ✅ | ✅ | ❌ | +| `pruner` | Removes less important or redundant connections and neurons, resulting in a sparser, more efficient network. | ✅ | ✅ | ❌ | +| `recoverer` | Restores the performance of a model after compression. | ➖ | ➖ | ✅ | +| `factorizer` | Factorization batches several small matrix multiplications into one large fused operation. | ✅ | ➖ | ➖ | +| `enhancer` | Enhances the model output by applying post-processing algorithms such as denoising or upscaling. | ❌ | - | ✅ | + +✅ (improves), ➖ (approx. the same), ❌ (worsens) + +Explore the full range of optimization methods in [the Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). + +You can install Pruna using the following command: + +```bash +pip install pruna +``` + +Now that you have installed Pruna, you can start to use it to optimize your models. Let's start with optimizing a model. + +## Optimizing models + +After that you can easily optimize any diffusers model by defining a simple `SmashConfig`, which holds the configuration for the optimization. + +For diffusers models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows. + +
+ +
+ +Let's take a look at an example on how to optimize [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) with Pruna. + +
+ +
+ +This combination accelerates inference by up to 4.2× and cuts peak GPU memory usage from 34.7 GB to 28.0 GB, all while maintaining virtually the same output quality. If you want to learn more about the optimization techniques used in this example, you can have a look at [the Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html). + +```diff +import torch +from diffusers import FluxPipeline + +from pruna import PrunaModel, SmashConfig, smash + +# load the model +# Try segmind/Segmind-Vega or black-forest-labs/FLUX.1-schnell with a small GPU memory +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16 +).to("cuda") + +# define the configuration +smash_config = SmashConfig() +smash_config["factorizer"] = "qkv_diffusers" +smash_config["compiler"] = "torch_compile" +smash_config["torch_compile_target"] = "module_list" +smash_config["cacher"] = "fora" +smash_config["fora_interval"] = 2 + +# for the best results in terms of speed you can add these configs +# however they will increase your warmup time from 1.5 min to 10 min +# smash_config["torch_compile_mode"] = "max-autotune-no-cudagraphs" +# smash_config["quantizer"] = "torchao" +# smash_config["torchao_quant_type"] = "fp8dq" +# smash_config["torchao_excluded_modules"] = "norm+embedding" + +# optimize the model +smashed_pipe = smash(pipe, smash_config) + +# run the model +smashed_pipe("a knitted purple prune").images[0] + +# save the model +smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") + +# load the model +smashed_pipe = PrunaModel.from_hub("/FLUX.1-dev-smashed") +``` + +The resulting generated image is shown as follows. + +
+ +
+ +As you can see, Pruna is a very simple and easy to use framework that allows you to optimize your models with minimal effort. We also saw the results look good to the naked eye but the cool thing is that you can also use Pruna to benchmark and evaluate your optimized models. + +## Evaluating and benchmarking optimized models + +Pruna provides a simple way to evaluate the quality of your optimized models. You can use the `EvaluationAgent` to evaluate the quality of your optimized models. If you want to learn more about the evaluation of optimized models, you can have a look at [the Pruna documentation on evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html). + +Let's take a look at an example on how to evaluate the quality of the optimized model. + +```python +import torch +from diffusers import FluxPipeline + +from pruna import PrunaModel +from pruna.data.pruna_datamodule import PrunaDataModule +from pruna.evaluation.evaluation_agent import EvaluationAgent +from pruna.evaluation.metrics import ( + ThroughputMetric, + TorchMetricWrapper, + TotalTimeMetric, +) +from pruna.evaluation.task import Task + +# define the device +device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" + +# load the model +# Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16 +).to("cpu") +smashed_pipe = PrunaModel.from_hub("PrunaAI/FLUX.1-dev-smashed") + +# Define the metrics +metrics = [ + TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), + ThroughputMetric(n_iterations=20, n_warmup_iterations=5), + TorchMetricWrapper("clip"), +] + +# Define the datamodule +datamodule = PrunaDataModule.from_string("LAION256") +datamodule.limit_datasets(10) + +# Define the task and evaluation agent +task = Task(metrics, datamodule=datamodule, device=device) +eval_agent = EvaluationAgent(task) + +# Evaluate base model and offload it to CPU +wrapped_pipe = PrunaModel(model=pipe) +wrapped_pipe.move_to_device(device) +base_model_results = eval_agent.evaluate(wrapped_pipe) +wrapped_pipe.move_to_device("cpu") + +# Evaluate smashed model and offload it to CPU +smashed_pipe.move_to_device(device) +smashed_model_results = eval_agent.evaluate(smashed_pipe) +smashed_pipe.move_to_device("cpu") +``` + +Now that you have seen how to optimize and evaluate your models, you can start to use Pruna to optimize your own models. Luckily, we have a lot of examples for you to get started. + +## Supported models + +Pruna aims to support a wide range of diffusers models and even supports different modalities, like text, image, audio, video, and Pruna are constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. + +## Reference + +[Pruna](https://github.com/pruna-ai/pruna) + +[Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/index.html) + +[Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) + +[Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms) + +[Pruna documentation on evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) From 19e1226ef1ebeb15850480e12216818b85597df0 Mon Sep 17 00:00:00 2001 From: davidberenstein1957 Date: Tue, 10 Jun 2025 16:00:08 +0200 Subject: [PATCH 2/9] Enhance Pruna documentation with image alt text and code block formatting - Added alt text to images for better accessibility and context. - Changed code block syntax from diff to python for improved clarity. --- docs/source/en/optimization/pruna.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index 84ad22b54f23..5481720d284d 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -35,18 +35,18 @@ After that you can easily optimize any diffusers model by defining a simple `Sma For diffusers models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows.
- + Overview of the supported optimization algorithms for diffusers models
Let's take a look at an example on how to optimize [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) with Pruna.
- + Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms
This combination accelerates inference by up to 4.2× and cuts peak GPU memory usage from 34.7 GB to 28.0 GB, all while maintaining virtually the same output quality. If you want to learn more about the optimization techniques used in this example, you can have a look at [the Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html). -```diff +```python import torch from diffusers import FluxPipeline From 323afe1844a343331853c6ca6f6e5bc23eba0772 Mon Sep 17 00:00:00 2001 From: davidberenstein1957 Date: Tue, 10 Jun 2025 16:12:10 +0200 Subject: [PATCH 3/9] Add installation section to Pruna documentation - Introduced a new installation section in the Pruna documentation to guide users on how to install the framework. - Enhanced the overall clarity and usability of the documentation for new users. --- docs/source/en/optimization/pruna.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index 5481720d284d..0981398c0c30 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -20,6 +20,8 @@ Pruna provides a comprehensive suite of cutting-edge optimization algorithms, ea Explore the full range of optimization methods in [the Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). +## Installation + You can install Pruna using the following command: ```bash From 703e3dd40fa3daa4f16a5f001bae5be0e579f874 Mon Sep 17 00:00:00 2001 From: David Berenstein Date: Wed, 11 Jun 2025 05:53:25 +0200 Subject: [PATCH 4/9] Update pruna.md --- docs/source/en/optimization/pruna.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index 0981398c0c30..a5c4889c1907 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -37,13 +37,13 @@ After that you can easily optimize any diffusers model by defining a simple `Sma For diffusers models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows.
- Overview of the supported optimization algorithms for diffusers models + Overview of the supported optimization algorithms for diffusers models
Let's take a look at an example on how to optimize [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) with Pruna.
- Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms + Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms
This combination accelerates inference by up to 4.2× and cuts peak GPU memory usage from 34.7 GB to 28.0 GB, all while maintaining virtually the same output quality. If you want to learn more about the optimization techniques used in this example, you can have a look at [the Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html). From f68941d1904eda9c4c244ae779a3535cfe197b02 Mon Sep 17 00:00:00 2001 From: David Berenstein Date: Wed, 11 Jun 2025 06:19:52 +0200 Subject: [PATCH 5/9] Update pruna.md --- docs/source/en/optimization/pruna.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index a5c4889c1907..2ee185118dcd 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -126,6 +126,7 @@ pipe = FluxPipeline.from_pretrained( "black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16 ).to("cpu") +wrapped_pipe = PrunaModel(model=pipe) smashed_pipe = PrunaModel.from_hub("PrunaAI/FLUX.1-dev-smashed") # Define the metrics @@ -144,7 +145,6 @@ task = Task(metrics, datamodule=datamodule, device=device) eval_agent = EvaluationAgent(task) # Evaluate base model and offload it to CPU -wrapped_pipe = PrunaModel(model=pipe) wrapped_pipe.move_to_device(device) base_model_results = eval_agent.evaluate(wrapped_pipe) wrapped_pipe.move_to_device("cpu") @@ -155,11 +155,11 @@ smashed_model_results = eval_agent.evaluate(smashed_pipe) smashed_pipe.move_to_device("cpu") ``` -Now that you have seen how to optimize and evaluate your models, you can start to use Pruna to optimize your own models. Luckily, we have a lot of examples for you to get started. +Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. ## Supported models -Pruna aims to support a wide range of diffusers models and even supports different modalities, like text, image, audio, video, and Pruna are constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. +Pruna aims to support a wide range of diffusers models and even supports different modalities, like text, image, audio, video, and Pruna is constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. ## Reference From e44e1096b4441a57505516e3f8903724cbc3159d Mon Sep 17 00:00:00 2001 From: davidberenstein1957 Date: Wed, 11 Jun 2025 06:36:11 +0200 Subject: [PATCH 6/9] Update Pruna documentation for model optimization and evaluation - Changed section titles for consistency and clarity, from "Optimizing models" to "Optimize models" and "Evaluating and benchmarking optimized models" to "Evaluate and benchmark models". - Enhanced descriptions to clarify the use of `diffusers` models and the evaluation process. - Added a new example for evaluating standalone `diffusers` models. - Updated references and links for better navigation within the documentation. --- docs/source/en/optimization/pruna.md | 80 ++++++++++++++++++++++------ 1 file changed, 65 insertions(+), 15 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index 2ee185118dcd..5ab0e92fc887 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -30,11 +30,11 @@ pip install pruna Now that you have installed Pruna, you can start to use it to optimize your models. Let's start with optimizing a model. -## Optimizing models +## Optimize diffusers models -After that you can easily optimize any diffusers model by defining a simple `SmashConfig`, which holds the configuration for the optimization. +After that you can easily optimize any `diffusers` model by defining a simple `SmashConfig`, which holds the configuration for the optimization. -For diffusers models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows. +For `diffusers` models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows.
Overview of the supported optimization algorithms for diffusers models @@ -89,15 +89,17 @@ smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") smashed_pipe = PrunaModel.from_hub("/FLUX.1-dev-smashed") ``` -The resulting generated image is shown as follows. +The resulting generated image and inference per optimization configuration are shown as follows.
-As you can see, Pruna is a very simple and easy to use framework that allows you to optimize your models with minimal effort. We also saw the results look good to the naked eye but the cool thing is that you can also use Pruna to benchmark and evaluate your optimized models. +Besides the results shown above, we have also used Pruna to create [FLUX-juiced, the fastest image generation endpoint alive](https://www.pruna.ai/blog/flux-juiced-the-fastest-image-generation-endpoint). We benchmarked our model against, FLUX.1-dev versions provided by different inference frameworks and surpassed them all. Full results of this benchmark can be found in [our blog post](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) and [our InferBench space](https://huggingface.co/spaces/PrunaAI/InferBench). -## Evaluating and benchmarking optimized models +As you can see, Pruna is a very simple and easy to use framework that allows you to optimize your models with minimal effort. We already saw that the results look good to the naked eye but the cool thing is that you can also use Pruna to benchmark and evaluate your optimized models. + +## Evaluate and benchmark diffusers models Pruna provides a simple way to evaluate the quality of your optimized models. You can use the `EvaluationAgent` to evaluate the quality of your optimized models. If you want to learn more about the evaluation of optimized models, you can have a look at [the Pruna documentation on evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html). @@ -155,20 +157,68 @@ smashed_model_results = eval_agent.evaluate(smashed_pipe) smashed_pipe.move_to_device("cpu") ``` -Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. +### Evaluate and benchmark standalone diffusers models -## Supported models +Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper. -Pruna aims to support a wide range of diffusers models and even supports different modalities, like text, image, audio, video, and Pruna is constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. +Let's take a look at an example on how to evaluate and benchmark a standalone `diffusers` model. -## Reference +```python +import torch +from diffusers import FluxPipeline -[Pruna](https://github.com/pruna-ai/pruna) +from pruna import PrunaModel +from pruna.data.pruna_datamodule import PrunaDataModule +from pruna.evaluation.evaluation_agent import EvaluationAgent +from pruna.evaluation.metrics import ( + ThroughputMetric, + TorchMetricWrapper, + TotalTimeMetric, +) +from pruna.evaluation.task import Task -[Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/index.html) +# define the device +device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" -[Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) +# load the model +# Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory +pipe = FluxPipeline.from_pretrained( + "black-forest-labs/FLUX.1-dev", + torch_dtype=torch.bfloat16 +).to("cpu") +wrapped_pipe = PrunaModel(model=pipe) + +# Define the metrics +metrics = [ + TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), + ThroughputMetric(n_iterations=20, n_warmup_iterations=5), + TorchMetricWrapper("clip"), +] + +# Define the datamodule +datamodule = PrunaDataModule.from_string("LAION256") +datamodule.limit_datasets(10) + +# Define the task and evaluation agent +task = Task(metrics, datamodule=datamodule, device=device) +eval_agent = EvaluationAgent(task) + +# Evaluate base model and offload it to CPU +wrapped_pipe.move_to_device(device) +base_model_results = eval_agent.evaluate(wrapped_pipe) +wrapped_pipe.move_to_device("cpu") +``` + +Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. + +## Supported models + +Pruna aims to support a wide range of `diffusers` models and even supports different modalities, like text, image, audio, video, and Pruna is constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. + +## Reference -[Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms) +- [Pruna](https://github.com/pruna-ai/pruna) +- [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms) +- [Pruna evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) +- [Pruna tutorials](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html) -[Pruna documentation on evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) From 6ea26446c889215b0196997d0fc73311f37f3cfd Mon Sep 17 00:00:00 2001 From: davidberenstein1957 Date: Wed, 11 Jun 2025 07:10:21 +0200 Subject: [PATCH 7/9] Refactor Pruna documentation for clarity and consistency - Removed outdated references to FLUX-juiced and streamlined the explanation of benchmarking. - Enhanced the description of evaluating standalone `diffusers` models. - Cleaned up code examples by removing unnecessary imports and comments for better readability. --- docs/source/en/optimization/pruna.md | 37 +++------------------------- 1 file changed, 3 insertions(+), 34 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index 5ab0e92fc887..db2e61555094 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -95,8 +95,6 @@ The resulting generated image and inference per optimization configuration are s
-Besides the results shown above, we have also used Pruna to create [FLUX-juiced, the fastest image generation endpoint alive](https://www.pruna.ai/blog/flux-juiced-the-fastest-image-generation-endpoint). We benchmarked our model against, FLUX.1-dev versions provided by different inference frameworks and surpassed them all. Full results of this benchmark can be found in [our blog post](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) and [our InferBench space](https://huggingface.co/spaces/PrunaAI/InferBench). - As you can see, Pruna is a very simple and easy to use framework that allows you to optimize your models with minimal effort. We already saw that the results look good to the naked eye but the cool thing is that you can also use Pruna to benchmark and evaluate your optimized models. ## Evaluate and benchmark diffusers models @@ -157,9 +155,11 @@ smashed_model_results = eval_agent.evaluate(smashed_pipe) smashed_pipe.move_to_device("cpu") ``` +Besides the results we can get from the `EvaluationAgent` above, we have also used a similar approach to create and benchmark [FLUX-juiced, the fastest image generation endpoint alive](https://www.pruna.ai/blog/flux-juiced-the-fastest-image-generation-endpoint). We benchmarked our model against, FLUX.1-dev versions provided by different inference frameworks and surpassed them all. Full results of this benchmark can be found in [our blog post](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) and [our InferBench space](https://huggingface.co/spaces/PrunaAI/InferBench). + ### Evaluate and benchmark standalone diffusers models -Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper. +Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper and run the `EvaluationAgent` on it. Let's take a look at an example on how to evaluate and benchmark a standalone `diffusers` model. @@ -168,17 +168,6 @@ import torch from diffusers import FluxPipeline from pruna import PrunaModel -from pruna.data.pruna_datamodule import PrunaDataModule -from pruna.evaluation.evaluation_agent import EvaluationAgent -from pruna.evaluation.metrics import ( - ThroughputMetric, - TorchMetricWrapper, - TotalTimeMetric, -) -from pruna.evaluation.task import Task - -# define the device -device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" # load the model # Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory @@ -187,26 +176,6 @@ pipe = FluxPipeline.from_pretrained( torch_dtype=torch.bfloat16 ).to("cpu") wrapped_pipe = PrunaModel(model=pipe) - -# Define the metrics -metrics = [ - TotalTimeMetric(n_iterations=20, n_warmup_iterations=5), - ThroughputMetric(n_iterations=20, n_warmup_iterations=5), - TorchMetricWrapper("clip"), -] - -# Define the datamodule -datamodule = PrunaDataModule.from_string("LAION256") -datamodule.limit_datasets(10) - -# Define the task and evaluation agent -task = Task(metrics, datamodule=datamodule, device=device) -eval_agent = EvaluationAgent(task) - -# Evaluate base model and offload it to CPU -wrapped_pipe.move_to_device(device) -base_model_results = eval_agent.evaluate(wrapped_pipe) -wrapped_pipe.move_to_device("cpu") ``` Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. From f6aeaad8dac30ca9124457ac48d23925459441d9 Mon Sep 17 00:00:00 2001 From: David Berenstein Date: Thu, 12 Jun 2025 09:52:40 +0200 Subject: [PATCH 8/9] Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> --- docs/source/en/optimization/pruna.md | 38 +++++++++++++--------------- 1 file changed, 18 insertions(+), 20 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index db2e61555094..b4aa610d21f2 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -1,8 +1,7 @@ # Pruna -[Pruna](https://github.com/pruna-ai/pruna) is a powerful model optimization framework that helps you unlock maximum performance from your AI models. With Pruna, you can dramatically accelerate inference speeds, reduce memory usage, and optimize model efficiency, all while maintaining a similar output quality. +[Pruna](https://github.com/PrunaAI/pruna) is a model optimization framework that offers various optimization methods - quantization, pruning, caching, compilation - for accelerating inference and reducing memory usage. A general overview of the optimization methods are shown below. -Pruna provides a comprehensive suite of cutting-edge optimization algorithms, each carefully designed to address specific performance bottlenecks. From quantization and pruning to advanced caching and compilation techniques, Pruna gives you the tools to fine-tune your models for optimal performance. A general overview of the optimization methods supported by Pruna is shown as follows. | Technique | Description | Speed | Memory | Quality | |--------------|-----------------------------------------------------------------------------------------------|:-----:|:------:|:-------:| @@ -18,35 +17,36 @@ Pruna provides a comprehensive suite of cutting-edge optimization algorithms, ea ✅ (improves), ➖ (approx. the same), ❌ (worsens) -Explore the full range of optimization methods in [the Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). +Explore the full range of optimization methods in the [Pruna documentation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html#configure-algorithms). ## Installation -You can install Pruna using the following command: +Install Pruna with the following command. ```bash pip install pruna ``` -Now that you have installed Pruna, you can start to use it to optimize your models. Let's start with optimizing a model. -## Optimize diffusers models +## Optimize Diffusers models -After that you can easily optimize any `diffusers` model by defining a simple `SmashConfig`, which holds the configuration for the optimization. - -For `diffusers` models, we support a broad range of optimization algorithms. The overview of the supported optimization algorithms is shown as follows. +A broad range of optimization algorithms are supported for Diffusers models as shown below.
Overview of the supported optimization algorithms for diffusers models
-Let's take a look at an example on how to optimize [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) with Pruna. +The example below optimizes [black-forest-labs/FLUX.1-dev](https://huggingface.co/black-forest-labs/FLUX.1-dev) +with a combination of factorizer, compiler, and cacher algorithms. This combination accelerates inference by up to 4.2x and cuts peak GPU memory usage from 34.7GB to 28.0GB, all while maintaining virtually the same output quality. + +> [!TIP] +> Refer to the [Pruna optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html) docs to learn more about the optimization techniques used in this example.
Optimization techniques used for FLUX.1-dev showing the combination of factorizer, compiler, and cacher algorithms
-This combination accelerates inference by up to 4.2× and cuts peak GPU memory usage from 34.7 GB to 28.0 GB, all while maintaining virtually the same output quality. If you want to learn more about the optimization techniques used in this example, you can have a look at [the Pruna documentation on optimization](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/configure.html). +Start by defining a `SmashConfig` with the optimization algorithms to use. To optimize the model, wrap the pipeline and the `SmashConfig` with `smash` and then use the pipeline as normal for inference. ```python import torch @@ -89,19 +89,19 @@ smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") smashed_pipe = PrunaModel.from_hub("/FLUX.1-dev-smashed") ``` -The resulting generated image and inference per optimization configuration are shown as follows.
-As you can see, Pruna is a very simple and easy to use framework that allows you to optimize your models with minimal effort. We already saw that the results look good to the naked eye but the cool thing is that you can also use Pruna to benchmark and evaluate your optimized models. -## Evaluate and benchmark diffusers models +## Evaluate and benchmark Diffusers models -Pruna provides a simple way to evaluate the quality of your optimized models. You can use the `EvaluationAgent` to evaluate the quality of your optimized models. If you want to learn more about the evaluation of optimized models, you can have a look at [the Pruna documentation on evaluation](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html). +Pruna provides the [EvaluationAgent](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) to evaluate the quality of your optimized models. -Let's take a look at an example on how to evaluate the quality of the optimized model. +Define the metrics, such as total time and throughput, and the dataset to evaluate on. Then pass them to `Task` to create a task and pass it to the `EvaluationAgent`. + +Call `evaluate` on the pipeline to execute the task passed to the `EvaluationAgent`. ```python import torch @@ -155,7 +155,8 @@ smashed_model_results = eval_agent.evaluate(smashed_pipe) smashed_pipe.move_to_device("cpu") ``` -Besides the results we can get from the `EvaluationAgent` above, we have also used a similar approach to create and benchmark [FLUX-juiced, the fastest image generation endpoint alive](https://www.pruna.ai/blog/flux-juiced-the-fastest-image-generation-endpoint). We benchmarked our model against, FLUX.1-dev versions provided by different inference frameworks and surpassed them all. Full results of this benchmark can be found in [our blog post](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) and [our InferBench space](https://huggingface.co/spaces/PrunaAI/InferBench). +> [!TIP] +> For more details about benchmarking Flux, check out the [Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6 times faster)!](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) blog post and the [InferBench](https://huggingface.co/spaces/PrunaAI/InferBench) Space. ### Evaluate and benchmark standalone diffusers models @@ -180,9 +181,6 @@ wrapped_pipe = PrunaModel(model=pipe) Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. -## Supported models - -Pruna aims to support a wide range of `diffusers` models and even supports different modalities, like text, image, audio, video, and Pruna is constantly expanding its support. An overview of some great combinations of models and modalities that have been succesfully optimized can be found on [the Pruna tutorial page](https://docs.pruna.ai/en/stable/docs_pruna/tutorials/index.html). Finally, a good thing is that Pruna also support `transformers` models. ## Reference From b680f1612df3f9935a9ec48419e7f49972304de7 Mon Sep 17 00:00:00 2001 From: davidberenstein1957 Date: Thu, 12 Jun 2025 10:09:01 +0200 Subject: [PATCH 9/9] Enhance Pruna documentation with new examples and clarifications - Added an image to illustrate the optimization process. - Updated the explanation for sharing and loading optimized models on the Hugging Face Hub. - Clarified the evaluation process for optimized models using the EvaluationAgent. - Improved descriptions for defining metrics and evaluating standalone diffusers models. --- docs/source/en/optimization/pruna.md | 46 +++++++++++++--------------- 1 file changed, 21 insertions(+), 25 deletions(-) diff --git a/docs/source/en/optimization/pruna.md b/docs/source/en/optimization/pruna.md index b4aa610d21f2..56c1f3af5957 100644 --- a/docs/source/en/optimization/pruna.md +++ b/docs/source/en/optimization/pruna.md @@ -81,7 +81,15 @@ smashed_pipe = smash(pipe, smash_config) # run the model smashed_pipe("a knitted purple prune").images[0] +``` +
+ +
+ +After optimization, we can share and load the optimized model using the Hugging Face Hub. + +```python # save the model smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") @@ -89,19 +97,16 @@ smashed_pipe.save_to_hub("/FLUX.1-dev-smashed") smashed_pipe = PrunaModel.from_hub("/FLUX.1-dev-smashed") ``` - -
- -
- - ## Evaluate and benchmark Diffusers models Pruna provides the [EvaluationAgent](https://docs.pruna.ai/en/stable/docs_pruna/user_manual/evaluate.html) to evaluate the quality of your optimized models. -Define the metrics, such as total time and throughput, and the dataset to evaluate on. Then pass them to `Task` to create a task and pass it to the `EvaluationAgent`. +We can metrics we care about, such as total time and throughput, and the dataset to evaluate on. We can define a model and pass it to the `EvaluationAgent`. + + + -Call `evaluate` on the pipeline to execute the task passed to the `EvaluationAgent`. +We can load and evaluate an optimized model by using the `EvaluationAgent` and pass it to the `Task`. ```python import torch @@ -122,11 +127,6 @@ device = "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is # load the model # Try PrunaAI/Segmind-Vega-smashed or PrunaAI/FLUX.1-dev-smashed with a small GPU memory -pipe = FluxPipeline.from_pretrained( - "black-forest-labs/FLUX.1-dev", - torch_dtype=torch.bfloat16 -).to("cpu") -wrapped_pipe = PrunaModel(model=pipe) smashed_pipe = PrunaModel.from_hub("PrunaAI/FLUX.1-dev-smashed") # Define the metrics @@ -144,26 +144,17 @@ datamodule.limit_datasets(10) task = Task(metrics, datamodule=datamodule, device=device) eval_agent = EvaluationAgent(task) -# Evaluate base model and offload it to CPU -wrapped_pipe.move_to_device(device) -base_model_results = eval_agent.evaluate(wrapped_pipe) -wrapped_pipe.move_to_device("cpu") - # Evaluate smashed model and offload it to CPU smashed_pipe.move_to_device(device) -smashed_model_results = eval_agent.evaluate(smashed_pipe) +smashed_pipe_results = eval_agent.evaluate(smashed_pipe) smashed_pipe.move_to_device("cpu") ``` -> [!TIP] -> For more details about benchmarking Flux, check out the [Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6 times faster)!](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) blog post and the [InferBench](https://huggingface.co/spaces/PrunaAI/InferBench) Space. - -### Evaluate and benchmark standalone diffusers models + + Instead of comparing the optimized model to the base model, you can also evaluate the standalone `diffusers` model. This is useful if you want to evaluate the performance of the model without the optimization. We can do so by using the `PrunaModel` wrapper and run the `EvaluationAgent` on it. -Let's take a look at an example on how to evaluate and benchmark a standalone `diffusers` model. - ```python import torch from diffusers import FluxPipeline @@ -179,8 +170,13 @@ pipe = FluxPipeline.from_pretrained( wrapped_pipe = PrunaModel(model=pipe) ``` + + + Now that you have seen how to optimize and evaluate your models, you can start using Pruna to optimize your own models. Luckily, we have many examples to help you get started. +> [!TIP] +> For more details about benchmarking Flux, check out the [Announcing FLUX-Juiced: The Fastest Image Generation Endpoint (2.6 times faster)!](https://huggingface.co/blog/PrunaAI/flux-fastest-image-generation-endpoint) blog post and the [InferBench](https://huggingface.co/spaces/PrunaAI/InferBench) Space. ## Reference