diff --git a/docs/source/en/api/pipelines/marigold.md b/docs/source/en/api/pipelines/marigold.md
index 93ca39e77b9c..e9ca0df067ba 100644
--- a/docs/source/en/api/pipelines/marigold.md
+++ b/docs/source/en/api/pipelines/marigold.md
@@ -1,4 +1,6 @@
-
-# Marigold Pipelines for Computer Vision Tasks
+# Marigold Computer Vision

-Marigold was proposed in [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), a CVPR 2024 Oral paper by [Bingxin Ke](http://www.kebingxin.com/), [Anton Obukhov](https://www.obukhov.ai/), [Shengyu Huang](https://shengyuh.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
-The idea is to repurpose the rich generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks.
-Initially, this idea was explored to fine-tune Stable Diffusion for Monocular Depth Estimation, as shown in the teaser above.
-Later,
-- [Tianfu Wang](https://tianfwang.github.io/) trained the first Latent Consistency Model (LCM) of Marigold, which unlocked fast single-step inference;
-- [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US) extended the approach to Surface Normals Estimation;
-- [Anton Obukhov](https://www.obukhov.ai/) contributed the pipelines and documentation into diffusers (enabled and supported by [YiYi Xu](https://yiyixuxu.github.io/) and [Sayak Paul](https://sayak.dev/)).
+Marigold was proposed in
+[Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145),
+a CVPR 2024 Oral paper by
+[Bingxin Ke](http://www.kebingxin.com/),
+[Anton Obukhov](https://www.obukhov.ai/),
+[Shengyu Huang](https://shengyuh.github.io/),
+[Nando Metzger](https://nandometzger.github.io/),
+[Rodrigo Caye Daudt](https://rcdaudt.github.io/), and
+[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
+The core idea is to **repurpose the generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional
+computer vision tasks**.
+This approach was explored by fine-tuning Stable Diffusion for **Monocular Depth Estimation**, as demonstrated in the
+teaser above.
+
+Marigold was later extended in the follow-up paper,
+[Marigold: Affordable Adaptation of Diffusion-Based Image Generators for Image Analysis](https://huggingface.co/papers/2312.02145),
+authored by
+[Bingxin Ke](http://www.kebingxin.com/),
+[Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US),
+[Tianfu Wang](https://tianfwang.github.io/),
+[Nando Metzger](https://nandometzger.github.io/),
+[Shengyu Huang](https://shengyuh.github.io/),
+[Bo Li](https://www.linkedin.com/in/bobboli0202/),
+[Anton Obukhov](https://www.obukhov.ai/), and
+[Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
+This work expanded Marigold to support new modalities such as **Surface Normals** and **Intrinsic Image Decomposition**
+(IID), introduced a training protocol for **Latent Consistency Models** (LCM), and demonstrated **High-Resolution** (HR)
+processing capability.
-The abstract from the paper is:
+
-*Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.*
+The early Marigold models (`v1-0` and earlier) were optimized for best results with at least 10 inference steps.
+LCM models were later developed to enable high-quality inference in just 1 to 4 steps.
+Marigold models `v1-1` and later use the DDIM scheduler to achieve optimal
+results in as few as 1 to 4 steps.
-## Available Pipelines
+
-Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image.
-Currently, the following tasks are implemented:
+## Available Pipelines
-| Pipeline | Predicted Modalities | Demos |
-|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:|
-| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) |
-| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) |
+Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a
+corresponding prediction.
+Currently, the following computer vision tasks are implemented:
+| Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities |
+|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) |
+| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) |
+| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),
[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) |
## Available Checkpoints
-The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization.
+All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face.
+They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train
+new model checkpoints.
+The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.
+
+| Checkpoint | Modality | Comment |
+|-----------------------------------------------------------------------------------------------------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. |
+| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. |
+| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. |
+| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image  \\(I\\)  is comprised of Albedo  \\(A\\), Diffuse shading  \\(S\\), and Non-diffuse residual  \\(R\\):  \\(I = A*S+R\\). |
-Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff
+between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to
+efficiently load the same components into multiple pipelines.
+Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section
+[here](../../using-diffusers/svd#reduce-memory-usage).
-Marigold pipelines were designed and tested only with `DDIMScheduler` and `LCMScheduler`.
-Depending on the scheduler, the number of inference steps required to get reliable predictions varies, and there is no universal value that works best across schedulers.
-Because of that, the default value of `num_inference_steps` in the `__call__` method of the pipeline is set to `None` (see the API reference).
-Unless set explicitly, its value will be taken from the checkpoint configuration `model_index.json`.
-This is done to ensure high-quality predictions when calling the pipeline with just the `image` argument.
+Marigold pipelines were designed and tested with the scheduler embedded in the model checkpoint.
+The optimal number of inference steps varies by scheduler, with no universal value that works best across all cases.
+To accommodate this, the `num_inference_steps` parameter in the pipeline's `__call__` method defaults to `None` (see the
+API reference).
+Unless set explicitly, it inherits the value from the `default_denoising_steps` field in the checkpoint configuration
+file (`model_index.json`).
+This ensures high-quality predictions when invoking the pipeline with only the `image` argument.
-See also Marigold [usage examples](marigold_usage).
+See also Marigold [usage examples](../../using-diffusers/marigold_usage).
+
+## Marigold Depth Prediction API
-## MarigoldDepthPipeline
[[autodoc]] MarigoldDepthPipeline
- - all
- __call__
-## MarigoldNormalsPipeline
+[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput
+
+[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth
+
+## Marigold Normals Estimation API
[[autodoc]] MarigoldNormalsPipeline
- - all
- __call__
-## MarigoldDepthOutput
-[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput
+[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput
+
+[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals
+
+## Marigold Intrinsic Image Decomposition API
+
+[[autodoc]] MarigoldIntrinsicsPipeline
+ - __call__
+
+[[autodoc]] pipelines.marigold.pipeline_marigold_intrinsics.MarigoldIntrinsicsOutput
-## MarigoldNormalsOutput
-[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput
\ No newline at end of file
+[[autodoc]] pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics
diff --git a/docs/source/en/api/pipelines/overview.md b/docs/source/en/api/pipelines/overview.md
index ece3ebb4c340..6a8e82a692e0 100644
--- a/docs/source/en/api/pipelines/overview.md
+++ b/docs/source/en/api/pipelines/overview.md
@@ -65,7 +65,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
| [Latte](latte) | text2image |
| [LEDITS++](ledits_pp) | image editing |
| [Lumina-T2X](lumina) | text2image |
-| [Marigold](marigold) | depth |
+| [Marigold](marigold) | depth-estimation, normals-estimation, intrinsic-decomposition |
| [MultiDiffusion](panorama) | text2image |
| [MusicLDM](musicldm) | text2audio |
| [PAG](pag) | text2image |
diff --git a/docs/source/en/using-diffusers/marigold_usage.md b/docs/source/en/using-diffusers/marigold_usage.md
index e9756b7f1c8e..b8e9a5838e8d 100644
--- a/docs/source/en/using-diffusers/marigold_usage.md
+++ b/docs/source/en/using-diffusers/marigold_usage.md
@@ -1,4 +1,6 @@
-
-# Marigold Pipelines for Computer Vision Tasks
+# Marigold Computer Vision
-[Marigold](../api/pipelines/marigold) is a novel diffusion-based dense prediction approach, and a set of pipelines for various computer vision tasks, such as monocular depth estimation.
+**Marigold** is a diffusion-based [method](https://huggingface.co/papers/2312.02145) and a collection of [pipelines](../api/pipelines/marigold) designed for
+dense computer vision tasks, including **monocular depth prediction**, **surface normals estimation**, and **intrinsic
+image decomposition**.
-This guide will show you how to use Marigold to obtain fast and high-quality predictions for images and videos.
+This guide will walk you through using Marigold to generate fast and high-quality predictions for images and videos.
-Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image.
-Currently, the following tasks are implemented:
+Each pipeline is tailored for a specific computer vision task, processing an input RGB image and generating a
+corresponding prediction.
+Currently, the following computer vision tasks are implemented:
-| Pipeline | Predicted Modalities | Demos |
-|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:|
-| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) |
-| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) |
+| Pipeline | Recommended Model Checkpoints | Spaces (Interactive Apps) | Predicted Modalities |
+|---------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | [Depth Estimation](https://huggingface.co/spaces/prs-eth/marigold) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) |
+| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [prs-eth/marigold-normals-v1-1](https://huggingface.co/prs-eth/marigold-normals-v1-1) | [Surface Normals Estimation](https://huggingface.co/spaces/prs-eth/marigold-normals) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) |
+| [MarigoldIntrinsicsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py) | [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1),
[prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | [Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) | [Albedo](https://en.wikipedia.org/wiki/Albedo), [Materials](https://www.n.aiq3d.com/wiki/roughnessmetalnessao-map), [Lighting](https://en.wikipedia.org/wiki/Diffuse_reflection) |
-The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization.
-These checkpoints are meant to work with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold).
-The original code can also be used to train new checkpoints.
+All original checkpoints are available under the [PRS-ETH](https://huggingface.co/prs-eth/) organization on Hugging Face.
+They are designed for use with diffusers pipelines and the [original codebase](https://github.com/prs-eth/marigold), which can also be used to train
+new model checkpoints.
+The following is a summary of the recommended checkpoints, all of which produce reliable results with 1 to 4 steps.
-| Checkpoint | Modality | Comment |
-|-----------------------------------------------------------------------------------------------|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| [prs-eth/marigold-v1-0](https://huggingface.co/prs-eth/marigold-v1-0) | Depth | The first Marigold Depth checkpoint, which predicts *affine-invariant depth* maps. The performance of this checkpoint in benchmarks was studied in the original [paper](https://huggingface.co/papers/2312.02145). Designed to be used with the `DDIMScheduler` at inference, it requires at least 10 steps to get reliable predictions. Affine-invariant depth prediction has a range of values in each pixel between 0 (near plane) and 1 (far plane); both planes are chosen by the model as part of the inference process. See the `MarigoldImageProcessor` reference for visualization utilities. |
-| [prs-eth/marigold-depth-lcm-v1-0](https://huggingface.co/prs-eth/marigold-depth-lcm-v1-0) | Depth | The fast Marigold Depth checkpoint, fine-tuned from `prs-eth/marigold-v1-0`. Designed to be used with the `LCMScheduler` at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. |
-| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | A preview checkpoint for the Marigold Normals pipeline. Designed to be used with the `DDIMScheduler` at inference, it requires at least 10 steps to get reliable predictions. The surface normals predictions are unit-length 3D vectors with values in the range from -1 to 1. *This checkpoint will be phased out after the release of `v1-0` version.* |
-| [prs-eth/marigold-normals-lcm-v0-1](https://huggingface.co/prs-eth/marigold-normals-lcm-v0-1) | Normals | The fast Marigold Normals checkpoint, fine-tuned from `prs-eth/marigold-normals-v0-1`. Designed to be used with the `LCMScheduler` at inference, it requires as little as 1 step to get reliable predictions. The prediction reliability saturates at 4 steps and declines after that. *This checkpoint will be phased out after the release of `v1-0` version.* |
-The examples below are mostly given for depth prediction, but they can be universally applied with other supported modalities.
+| Checkpoint | Modality | Comment |
+|-----------------------------------------------------------------------------------------------------|--------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| [prs-eth/marigold-depth-v1-1](https://huggingface.co/prs-eth/marigold-depth-v1-1) | Depth | Affine-invariant depth prediction assigns each pixel a value between 0 (near plane) and 1 (far plane), with both planes determined by the model during inference. |
+| [prs-eth/marigold-normals-v0-1](https://huggingface.co/prs-eth/marigold-normals-v0-1) | Normals | The surface normals predictions are unit-length 3D vectors in the screen space camera, with values in the range from -1 to 1. |
+| [prs-eth/marigold-iid-appearance-v1-1](https://huggingface.co/prs-eth/marigold-iid-appearance-v1-1) | Intrinsics | InteriorVerse decomposition is comprised of Albedo and two BRDF material properties: Roughness and Metallicity. |
+| [prs-eth/marigold-iid-lighting-v1-1](https://huggingface.co/prs-eth/marigold-iid-lighting-v1-1) | Intrinsics | HyperSim decomposition of an image \\(I\\) is comprised of Albedo \\(A\\), Diffuse shading \\(S\\), and Non-diffuse residual \\(R\\): \\(I = A*S+R\\). |
+
+The examples below are mostly given for depth prediction, but they can be universally applied to other supported
+modalities.
We showcase the predictions using the same input image of Albert Einstein generated by Midjourney.
This makes it easier to compare visualizations of the predictions across various modalities and checkpoints.
@@ -47,19 +56,21 @@ This makes it easier to compare visualizations of the predictions across various
-### Depth Prediction Quick Start
+## Depth Prediction
-To get the first depth prediction, load `prs-eth/marigold-depth-lcm-v1-0` checkpoint into `MarigoldDepthPipeline` pipeline, put the image through the pipeline, and save the predictions:
+To get a depth prediction, load the `prs-eth/marigold-depth-v1-1` checkpoint into [`MarigoldDepthPipeline`],
+put the image through the pipeline, and save the predictions:
```python
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
- "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
depth = pipe(image)
vis = pipe.image_processor.visualize_depth(depth.prediction)
@@ -69,10 +80,13 @@ depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
depth_16bit[0].save("einstein_depth_16bit.png")
```
-The visualization function for depth [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] applies one of [matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]` depth range into an RGB image.
-With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are assigned blue color.
+The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_depth`] function applies one of
+[matplotlib's colormaps](https://matplotlib.org/stable/users/explain/colors/colormaps.html) (`Spectral` by default) to map the predicted pixel values from a single-channel `[0, 1]`
+depth range into an RGB image.
+With the `Spectral` colormap, pixels with near depth are painted red, and far pixels are blue.
The 16-bit PNG file stores the single channel values mapped linearly from the `[0, 1]` range into `[0, 65535]`.
-Below are the raw and the visualized predictions; as can be seen, dark areas (mustache) are easier to distinguish in the visualization:
+Below are the raw and the visualized predictions. The darker and closer areas (mustache) are easier to distinguish in
+the visualization.
@@ -89,28 +103,33 @@ Below are the raw and the visualized predictions; as can be seen, dark areas (mu
-### Surface Normals Prediction Quick Start
+## Surface Normals Estimation
-Load `prs-eth/marigold-normals-lcm-v0-1` checkpoint into `MarigoldNormalsPipeline` pipeline, put the image through the pipeline, and save the predictions:
+Load the `prs-eth/marigold-normals-v1-1` checkpoint into [`MarigoldNormalsPipeline`], put the image through the
+pipeline, and save the predictions:
```python
import diffusers
import torch
pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
- "prs-eth/marigold-normals-lcm-v0-1", variant="fp16", torch_dtype=torch.float16
+ "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
normals = pipe(image)
vis = pipe.image_processor.visualize_normals(normals.prediction)
vis[0].save("einstein_normals.png")
```
-The visualization function for normals [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional prediction with pixel values in the range `[-1, 1]` into an RGB image.
-The visualization function supports flipping surface normals axes to make the visualization compatible with other choices of the frame of reference.
-Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis points right, `Y` axis points up, and `Z` axis points at the viewer.
+The [`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_normals`] maps the three-dimensional
+prediction with pixel values in the range `[-1, 1]` into an RGB image.
+The visualization function supports flipping surface normals axes to make the visualization compatible with other
+choices of the frame of reference.
+Conceptually, each pixel is painted according to the surface normal vector in the frame of reference, where `X` axis
+points right, `Y` axis points up, and `Z` axis points at the viewer.
Below is the visualized prediction:
@@ -122,208 +141,226 @@ Below is the visualized prediction:
-In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points straight at the viewer, meaning that its coordinates are `[0, 0, 1]`.
+In this example, the nose tip almost certainly has a point on the surface, in which the surface normal vector points
+straight at the viewer, meaning that its coordinates are `[0, 0, 1]`.
This vector maps to the RGB `[128, 128, 255]`, which corresponds to the violet-blue color.
-Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the red hue.
+Similarly, a surface normal on the cheek in the right part of the image has a large `X` component, which increases the
+red hue.
Points on the shoulders pointing up with a large `Y` promote green color.
-### Speeding up inference
+## Intrinsic Image Decomposition
-The above quick start snippets are already optimized for speed: they load the LCM checkpoint, use the `fp16` variant of weights and computation, and perform just one denoising diffusion step.
-The `pipe(image)` call completes in 280ms on RTX 3090 GPU.
-Internally, the input image is encoded with the Stable Diffusion VAE encoder, then the U-Net performs one denoising step, and finally, the prediction latent is decoded with the VAE decoder into pixel space.
-In this case, two out of three module calls are dedicated to converting between pixel and latent space of LDM.
-Because Marigold's latent space is compatible with the base Stable Diffusion, it is possible to speed up the pipeline call by more than 3x (85ms on RTX 3090) by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny):
+Marigold provides two models for Intrinsic Image Decomposition (IID): "Appearance" and "Lighting".
+Each model produces Albedo maps, derived from InteriorVerse and Hypersim annotations, respectively.
-```diff
- import diffusers
- import torch
+- The "Appearance" model also estimates Material properties: Roughness and Metallicity.
+- The "Lighting" model generates Diffuse Shading and Non-diffuse Residual.
- pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
- "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
- ).to("cuda")
+Here is the sample code saving predictions made by the "Appearance" model:
-+ pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
-+ "madebyollin/taesd", torch_dtype=torch.float16
-+ ).cuda()
+```python
+import diffusers
+import torch
- image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
- depth = pipe(image)
+pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
+ "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16
+).to("cuda")
+
+image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
+intrinsics = pipe(image)
+
+vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
+vis[0]["albedo"].save("einstein_albedo.png")
+vis[0]["roughness"].save("einstein_roughness.png")
+vis[0]["metallicity"].save("einstein_metallicity.png")
```
-As suggested in [Optimizations](../optimization/torch2.0#torch.compile), adding `torch.compile` may squeeze extra performance depending on the target hardware:
+Another example demonstrating the predictions made by the "Lighting" model:
-```diff
- import diffusers
- import torch
+```python
+import diffusers
+import torch
- pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
- "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
- ).to("cuda")
+pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
+ "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16
+).to("cuda")
-+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
- image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
- depth = pipe(image)
+intrinsics = pipe(image)
+
+vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
+vis[0]["albedo"].save("einstein_albedo.png")
+vis[0]["shading"].save("einstein_shading.png")
+vis[0]["residual"].save("einstein_residual.png")
```
-## Qualitative Comparison with Depth Anything
+Both models share the same pipeline while supporting different decomposition types.
+The exact decomposition parameterization (e.g., sRGB vs. linear space) is stored in the
+`pipe.target_properties` dictionary, which is passed into the
+[`~pipelines.marigold.marigold_image_processing.MarigoldImageProcessor.visualize_intrinsics`] function.
-With the above speed optimizations, Marigold delivers predictions with more details and faster than [Depth Anything](https://huggingface.co/docs/transformers/main/en/model_doc/depth_anything) with the largest checkpoint [LiheYoung/depth-anything-large-hf](https://huggingface.co/LiheYoung/depth-anything-large-hf):
+Below are some examples showcasing the predicted decomposition outputs.
+All modalities can be inspected in the
+[Intrinsic Image Decomposition](https://huggingface.co/spaces/prs-eth/marigold-iid) Space.
-

+
- Marigold LCM fp16 with Tiny AutoEncoder
+ Predicted albedo ("Appearance" model)
-

+
- Depth Anything Large
+ Predicted diffuse shading ("Lighting" model)
-## Maximizing Precision and Ensembling
+## Speeding up inference
-Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents.
-This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion.
-The ensembling path is activated automatically when the `ensemble_size` argument is set greater than `1`.
-When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`.
-The recommended values vary across checkpoints but primarily depend on the scheduler type.
-The effect of ensembling is particularly well-seen with surface normals:
+The above quick start snippets are already optimized for quality and speed, loading the checkpoint, utilizing the
+`fp16` variant of weights and computation, and performing the default number (4) of denoising diffusion steps.
+The first step to accelerate inference, at the expense of prediction quality, is to reduce the denoising diffusion
+steps to the minimum:
-```python
-import diffusers
+```diff
+ import diffusers
+ import torch
-model_path = "prs-eth/marigold-normals-v1-0"
+ pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
+ ).to("cuda")
-model_paper_kwargs = {
- diffusers.schedulers.DDIMScheduler: {
- "num_inference_steps": 10,
- "ensemble_size": 10,
- },
- diffusers.schedulers.LCMScheduler: {
- "num_inference_steps": 4,
- "ensemble_size": 5,
- },
-}
+ image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
+- depth = pipe(image)
++ depth = pipe(image, num_inference_steps=1)
+```
-image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+With this change, the `pipe` call completes in 280ms on RTX 3090 GPU.
+Internally, the input image is first encoded using the Stable Diffusion VAE encoder, followed by a single denoising
+step performed by the U-Net.
+Finally, the prediction latent is decoded with the VAE decoder into pixel space.
+In this setup, two out of three module calls are dedicated to converting between the pixel and latent spaces of the LDM.
+Since Marigold's latent space is compatible with Stable Diffusion 2.0, inference can be accelerated by more than 3x,
+reducing the call time to 85ms on an RTX 3090, by using a [lightweight replacement of the SD VAE](../api/models/autoencoder_tiny).
+Note that using a lightweight VAE may slightly reduce the visual quality of the predictions.
-pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(model_path).to("cuda")
-pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)]
+```diff
+ import diffusers
+ import torch
-depth = pipe(image, **pipe_kwargs)
+ pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
+ ).to("cuda")
-vis = pipe.image_processor.visualize_normals(depth.prediction)
-vis[0].save("einstein_normals.png")
++ pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
++ "madebyollin/taesd", torch_dtype=torch.float16
++ ).cuda()
+
+ image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
+ depth = pipe(image, num_inference_steps=1)
```
-
-
-

-
- Surface normals, no ensembling
-
-
-
-

-
- Surface normals, with ensembling
-
-
-
+So far, we have optimized the number of diffusion steps and model components. Self-attention operations account for a
+significant portion of computations.
+Speeding them up can be achieved by using a more efficient attention processor:
-As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more correct predictions.
-Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.
+```diff
+ import diffusers
+ import torch
++ from diffusers.models.attention_processor import AttnProcessor2_0
-## Quantitative Evaluation
+ pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
+ ).to("cuda")
-To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets), follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values for `num_inference_steps` and `ensemble_size`.
-Optionally seed randomness to ensure reproducibility. Maximizing `batch_size` will deliver maximum device utilization.
++ pipe.vae.set_attn_processor(AttnProcessor2_0())
++ pipe.unet.set_attn_processor(AttnProcessor2_0())
-```python
-import diffusers
-import torch
+ image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
-device = "cuda"
-seed = 2024
-model_path = "prs-eth/marigold-v1-0"
-
-model_paper_kwargs = {
- diffusers.schedulers.DDIMScheduler: {
- "num_inference_steps": 50,
- "ensemble_size": 10,
- },
- diffusers.schedulers.LCMScheduler: {
- "num_inference_steps": 4,
- "ensemble_size": 10,
- },
-}
+ depth = pipe(image, num_inference_steps=1)
+```
-image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+Finally, as suggested in [Optimizations](../optimization/torch2.0#torch.compile), enabling `torch.compile` can further enhance performance depending on
+the target hardware.
+However, compilation incurs a significant overhead during the first pipeline invocation, making it beneficial only when
+the same pipeline instance is called repeatedly, such as within a loop.
-generator = torch.Generator(device=device).manual_seed(seed)
-pipe = diffusers.MarigoldDepthPipeline.from_pretrained(model_path).to(device)
-pipe_kwargs = model_paper_kwargs[type(pipe.scheduler)]
+```diff
+ import diffusers
+ import torch
+ from diffusers.models.attention_processor import AttnProcessor2_0
+
+ pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
+ ).to("cuda")
-depth = pipe(image, generator=generator, **pipe_kwargs)
+ pipe.vae.set_attn_processor(AttnProcessor2_0())
+ pipe.unet.set_attn_processor(AttnProcessor2_0())
-# evaluate metrics
++ pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True)
++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+
+ image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
+ depth = pipe(image, num_inference_steps=1)
```
-## Using Predictive Uncertainty
+## Maximizing Precision and Ensembling
-The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random latents.
-As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater than 1 and set `output_uncertainty=True`.
-The resulting uncertainty will be available in the `uncertainty` field of the output.
-It can be visualized as follows:
+Marigold pipelines have a built-in ensembling mechanism combining multiple predictions from different random latents.
+This is a brute-force way of improving the precision of predictions, capitalizing on the generative nature of diffusion.
+The ensembling path is activated automatically when the `ensemble_size` argument is set greater or equal than `3`.
+When aiming for maximum precision, it makes sense to adjust `num_inference_steps` simultaneously with `ensemble_size`.
+The recommended values vary across checkpoints but primarily depend on the scheduler type.
+The effect of ensembling is particularly well-seen with surface normals:
-```python
-import diffusers
-import torch
+```diff
+ import diffusers
-pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
- "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
-).to("cuda")
+ pipe = diffusers.MarigoldNormalsPipeline.from_pretrained("prs-eth/marigold-normals-v1-1").to("cuda")
-image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
-depth = pipe(
- image,
- ensemble_size=10, # any number greater than 1; higher values yield higher precision
- output_uncertainty=True,
-)
+ image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
-uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
-uncertainty[0].save("einstein_depth_uncertainty.png")
+- depth = pipe(image)
++ depth = pipe(image, num_inference_steps=10, ensemble_size=5)
+
+ vis = pipe.image_processor.visualize_normals(depth.prediction)
+ vis[0].save("einstein_normals.png")
```
-

+
- Depth uncertainty
+ Surface normals, no ensembling
-

+
- Surface normals uncertainty
+ Surface normals, with ensembling
-The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to make consistent predictions.
-Evidently, the depth model is the least confident around edges with discontinuity, where the object depth changes drastically.
-The surface normals model is the least confident in fine-grained structures, such as hair, and dark areas, such as the collar.
+As can be seen, all areas with fine-grained structurers, such as hair, got more conservative and on average more
+correct predictions.
+Such a result is more suitable for precision-sensitive downstream tasks, such as 3D reconstruction.
## Frame-by-frame Video Processing with Temporal Consistency
-Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent initialization.
-This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the following videos:
+Due to Marigold's generative nature, each prediction is unique and defined by the random noise sampled for the latent
+initialization.
+This becomes an obvious drawback compared to traditional end-to-end dense regression networks, as exemplified in the
+following videos:
@@ -336,26 +373,32 @@ This becomes an obvious drawback compared to traditional end-to-end dense regres
-To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of diffusion.
-Empirically, we found that a convex combination of the very same starting point noise latent and the latent corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below:
+To address this issue, it is possible to pass `latents` argument to the pipelines, which defines the starting point of
+diffusion.
+Empirically, we found that a convex combination of the very same starting point noise latent and the latent
+corresponding to the previous frame prediction give sufficiently smooth results, as implemented in the snippet below:
```python
import imageio
-from PIL import Image
-from tqdm import tqdm
import diffusers
import torch
+from diffusers.models.attention_processor import AttnProcessor2_0
+from PIL import Image
+from tqdm import tqdm
device = "cuda"
-path_in = "obama.mp4"
+path_in = "https://huggingface.co/spaces/prs-eth/marigold-lcm/resolve/c7adb5427947d2680944f898cd91d386bf0d4924/files/video/obama.mp4"
path_out = "obama_depth.gif"
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
- "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
).to(device)
pipe.vae = diffusers.AutoencoderTiny.from_pretrained(
"madebyollin/taesd", torch_dtype=torch.float16
).to(device)
+pipe.unet.set_attn_processor(AttnProcessor2_0())
+pipe.vae = torch.compile(pipe.vae, mode="reduce-overhead", fullgraph=True)
+pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
pipe.set_progress_bar_config(disable=True)
with imageio.get_reader(path_in) as reader:
@@ -373,7 +416,11 @@ with imageio.get_reader(path_in) as reader:
latents = 0.9 * latents + 0.1 * last_frame_latent
depth = pipe(
- frame, match_input_resolution=False, latents=latents, output_latent=True
+ frame,
+ num_inference_steps=1,
+ match_input_resolution=False,
+ latents=latents,
+ output_latent=True,
)
last_frame_latent = depth.latent
out.append(pipe.image_processor.visualize_depth(depth.prediction)[0])
@@ -382,7 +429,8 @@ with imageio.get_reader(path_in) as reader:
```
Here, the diffusion process starts from the given computed latent.
-The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent initialization.
+The pipeline sets `output_latent=True` to access `out.latent` and computes its contribution to the next frame's latent
+initialization.
The result is much more stable now:
@@ -414,7 +462,7 @@ image = diffusers.utils.load_image(
)
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
- "prs-eth/marigold-depth-lcm-v1-0", torch_dtype=torch.float16, variant="fp16"
+ "prs-eth/marigold-depth-v1-1", torch_dtype=torch.float16, variant="fp16"
).to(device)
depth_image = pipe(image, generator=generator).prediction
@@ -463,4 +511,95 @@ controlnet_out[0].save("motorcycle_controlnet_out.png")
-Hopefully, you will find Marigold useful for solving your downstream tasks, be it a part of a more broad generative workflow, or a perception task, such as 3D reconstruction.
+## Quantitative Evaluation
+
+To evaluate Marigold quantitatively in standard leaderboards and benchmarks (such as NYU, KITTI, and other datasets),
+follow the evaluation protocol outlined in the paper: load the full precision fp32 model and use appropriate values
+for `num_inference_steps` and `ensemble_size`.
+Optionally seed randomness to ensure reproducibility.
+Maximizing `batch_size` will deliver maximum device utilization.
+
+```python
+import diffusers
+import torch
+
+device = "cuda"
+seed = 2024
+
+generator = torch.Generator(device=device).manual_seed(seed)
+pipe = diffusers.MarigoldDepthPipeline.from_pretrained("prs-eth/marigold-depth-v1-1").to(device)
+
+image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
+depth = pipe(
+ image,
+ num_inference_steps=4, # set according to the evaluation protocol from the paper
+ ensemble_size=10, # set according to the evaluation protocol from the paper
+ generator=generator,
+)
+
+# evaluate metrics
+```
+
+## Using Predictive Uncertainty
+
+The ensembling mechanism built into Marigold pipelines combines multiple predictions obtained from different random
+latents.
+As a side effect, it can be used to quantify epistemic (model) uncertainty; simply specify `ensemble_size` greater
+or equal than 3 and set `output_uncertainty=True`.
+The resulting uncertainty will be available in the `uncertainty` field of the output.
+It can be visualized as follows:
+
+```python
+import diffusers
+import torch
+
+pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
+ "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
+).to("cuda")
+
+image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+
+depth = pipe(
+ image,
+ ensemble_size=10, # any number >= 3
+ output_uncertainty=True,
+)
+
+uncertainty = pipe.image_processor.visualize_uncertainty(depth.uncertainty)
+uncertainty[0].save("einstein_depth_uncertainty.png")
+```
+
+
+
+

+
+ Depth uncertainty
+
+
+
+

+
+ Surface normals uncertainty
+
+
+
+

+
+ Albedo uncertainty
+
+
+
+
+The interpretation of uncertainty is easy: higher values (white) correspond to pixels, where the model struggles to
+make consistent predictions.
+- The depth model exhibits the most uncertainty around discontinuities, where object depth changes abruptly.
+- The surface normals model is least confident in fine-grained structures like hair and in dark regions such as the
+collar area.
+- Albedo uncertainty is represented as an RGB image, as it captures uncertainty independently for each color channel,
+unlike depth and surface normals. It is also higher in shaded regions and at discontinuities.
+
+## Conclusion
+
+We hope Marigold proves valuable for your downstream tasks, whether as part of a broader generative workflow or for
+perception-based applications like 3D reconstruction.
\ No newline at end of file
diff --git a/src/diffusers/__init__.py b/src/diffusers/__init__.py
index f4d395c7d011..71dd49886f6f 100644
--- a/src/diffusers/__init__.py
+++ b/src/diffusers/__init__.py
@@ -345,6 +345,7 @@
"Lumina2Text2ImgPipeline",
"LuminaText2ImgPipeline",
"MarigoldDepthPipeline",
+ "MarigoldIntrinsicsPipeline",
"MarigoldNormalsPipeline",
"MochiPipeline",
"MusicLDMPipeline",
@@ -845,6 +846,7 @@
Lumina2Text2ImgPipeline,
LuminaText2ImgPipeline,
MarigoldDepthPipeline,
+ MarigoldIntrinsicsPipeline,
MarigoldNormalsPipeline,
MochiPipeline,
MusicLDMPipeline,
diff --git a/src/diffusers/pipelines/__init__.py b/src/diffusers/pipelines/__init__.py
index 0410fef30e7e..8e7f9d68a5d4 100644
--- a/src/diffusers/pipelines/__init__.py
+++ b/src/diffusers/pipelines/__init__.py
@@ -261,6 +261,7 @@
_import_structure["marigold"].extend(
[
"MarigoldDepthPipeline",
+ "MarigoldIntrinsicsPipeline",
"MarigoldNormalsPipeline",
]
)
@@ -603,6 +604,7 @@
from .lumina2 import Lumina2Text2ImgPipeline
from .marigold import (
MarigoldDepthPipeline,
+ MarigoldIntrinsicsPipeline,
MarigoldNormalsPipeline,
)
from .mochi import MochiPipeline
diff --git a/src/diffusers/pipelines/marigold/__init__.py b/src/diffusers/pipelines/marigold/__init__.py
index b5ae03adfc11..168a8276be4e 100644
--- a/src/diffusers/pipelines/marigold/__init__.py
+++ b/src/diffusers/pipelines/marigold/__init__.py
@@ -23,6 +23,7 @@
else:
_import_structure["marigold_image_processing"] = ["MarigoldImageProcessor"]
_import_structure["pipeline_marigold_depth"] = ["MarigoldDepthOutput", "MarigoldDepthPipeline"]
+ _import_structure["pipeline_marigold_intrinsics"] = ["MarigoldIntrinsicsOutput", "MarigoldIntrinsicsPipeline"]
_import_structure["pipeline_marigold_normals"] = ["MarigoldNormalsOutput", "MarigoldNormalsPipeline"]
if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT:
@@ -35,6 +36,7 @@
else:
from .marigold_image_processing import MarigoldImageProcessor
from .pipeline_marigold_depth import MarigoldDepthOutput, MarigoldDepthPipeline
+ from .pipeline_marigold_intrinsics import MarigoldIntrinsicsOutput, MarigoldIntrinsicsPipeline
from .pipeline_marigold_normals import MarigoldNormalsOutput, MarigoldNormalsPipeline
else:
diff --git a/src/diffusers/pipelines/marigold/marigold_image_processing.py b/src/diffusers/pipelines/marigold/marigold_image_processing.py
index 51b9983db6f6..0723014ad37b 100644
--- a/src/diffusers/pipelines/marigold/marigold_image_processing.py
+++ b/src/diffusers/pipelines/marigold/marigold_image_processing.py
@@ -1,4 +1,22 @@
-from typing import List, Optional, Tuple, Union
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# --------------------------------------------------------------------------
+# More information and citation instructions are available on the
+# Marigold project website: https://marigoldcomputervision.github.io
+# --------------------------------------------------------------------------
+from typing import Any, Dict, List, Optional, Tuple, Union
import numpy as np
import PIL
@@ -379,7 +397,7 @@ def visualize_depth(
val_min: float = 0.0,
val_max: float = 1.0,
color_map: str = "Spectral",
- ) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
+ ) -> List[PIL.Image.Image]:
"""
Visualizes depth maps, such as predictions of the `MarigoldDepthPipeline`.
@@ -391,7 +409,7 @@ def visualize_depth(
color_map (`str`, *optional*, defaults to `"Spectral"`): Color map used to convert a single-channel
depth prediction into colored representation.
- Returns: `PIL.Image.Image` or `List[PIL.Image.Image]` with depth maps visualization.
+ Returns: `List[PIL.Image.Image]` with depth maps visualization.
"""
if val_max <= val_min:
raise ValueError(f"Invalid values range: [{val_min}, {val_max}].")
@@ -436,7 +454,7 @@ def export_depth_to_16bit_png(
depth: Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]],
val_min: float = 0.0,
val_max: float = 1.0,
- ) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
+ ) -> List[PIL.Image.Image]:
def export_depth_to_16bit_png_one(img, idx=None):
prefix = "Depth" + (f"[{idx}]" if idx else "")
if not isinstance(img, np.ndarray) and not torch.is_tensor(img):
@@ -478,7 +496,7 @@ def visualize_normals(
flip_x: bool = False,
flip_y: bool = False,
flip_z: bool = False,
- ) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
+ ) -> List[PIL.Image.Image]:
"""
Visualizes surface normals, such as predictions of the `MarigoldNormalsPipeline`.
@@ -492,7 +510,7 @@ def visualize_normals(
flip_z (`bool`, *optional*, defaults to `False`): Flips the Z axis of the normals frame of reference.
Default direction is facing the observer.
- Returns: `PIL.Image.Image` or `List[PIL.Image.Image]` with surface normals visualization.
+ Returns: `List[PIL.Image.Image]` with surface normals visualization.
"""
flip_vec = None
if any((flip_x, flip_y, flip_z)):
@@ -528,6 +546,99 @@ def visualize_normals_one(img, idx=None):
else:
raise ValueError(f"Unexpected input type: {type(normals)}")
+ @staticmethod
+ def visualize_intrinsics(
+ prediction: Union[
+ np.ndarray,
+ torch.Tensor,
+ List[np.ndarray],
+ List[torch.Tensor],
+ ],
+ target_properties: Dict[str, Any],
+ color_map: Union[str, Dict[str, str]] = "binary",
+ ) -> List[Dict[str, PIL.Image.Image]]:
+ """
+ Visualizes intrinsic image decomposition, such as predictions of the `MarigoldIntrinsicsPipeline`.
+
+ Args:
+ prediction (`Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]`):
+ Intrinsic image decomposition.
+ target_properties (`Dict[str, Any]`):
+ Decomposition properties. Expected entries: `target_names: List[str]` and a dictionary with keys
+ `prediction_space: str`, `sub_target_names: List[Union[str, Null]]` (must have 3 entries, null for
+ missing modalities), `up_to_scale: bool`, one for each target and sub-target.
+ color_map (`Union[str, Dict[str, str]]`, *optional*, defaults to `"Spectral"`):
+ Color map used to convert a single-channel predictions into colored representations. When a dictionary
+ is passed, each modality can be colored with its own color map.
+
+ Returns: `List[Dict[str, PIL.Image.Image]]` with intrinsic image decomposition visualization.
+ """
+ if "target_names" not in target_properties:
+ raise ValueError("Missing `target_names` in target_properties")
+ if not isinstance(color_map, str) and not (
+ isinstance(color_map, dict)
+ and all(isinstance(k, str) and isinstance(v, str) for k, v in color_map.items())
+ ):
+ raise ValueError("`color_map` must be a string or a dictionary of strings")
+ n_targets = len(target_properties["target_names"])
+
+ def visualize_targets_one(images, idx=None):
+ # img: [T, 3, H, W]
+ out = {}
+ for target_name, img in zip(target_properties["target_names"], images):
+ img = img.permute(1, 2, 0) # [H, W, 3]
+ prediction_space = target_properties[target_name].get("prediction_space", "srgb")
+ if prediction_space == "stack":
+ sub_target_names = target_properties[target_name]["sub_target_names"]
+ if len(sub_target_names) != 3 or any(
+ not (isinstance(s, str) or s is None) for s in sub_target_names
+ ):
+ raise ValueError(f"Unexpected target sub-names {sub_target_names} in {target_name}")
+ for i, sub_target_name in enumerate(sub_target_names):
+ if sub_target_name is None:
+ continue
+ sub_img = img[:, :, i]
+ sub_prediction_space = target_properties[sub_target_name].get("prediction_space", "srgb")
+ if sub_prediction_space == "linear":
+ sub_up_to_scale = target_properties[sub_target_name].get("up_to_scale", False)
+ if sub_up_to_scale:
+ sub_img = sub_img / max(sub_img.max().item(), 1e-6)
+ sub_img = sub_img ** (1 / 2.2)
+ cmap_name = (
+ color_map if isinstance(color_map, str) else color_map.get(sub_target_name, "binary")
+ )
+ sub_img = MarigoldImageProcessor.colormap(sub_img, cmap=cmap_name, bytes=True)
+ sub_img = PIL.Image.fromarray(sub_img.cpu().numpy())
+ out[sub_target_name] = sub_img
+ elif prediction_space == "linear":
+ up_to_scale = target_properties[target_name].get("up_to_scale", False)
+ if up_to_scale:
+ img = img / max(img.max().item(), 1e-6)
+ img = img ** (1 / 2.2)
+ elif prediction_space == "srgb":
+ pass
+ img = (img * 255).to(dtype=torch.uint8, device="cpu").numpy()
+ img = PIL.Image.fromarray(img)
+ out[target_name] = img
+ return out
+
+ if prediction is None or isinstance(prediction, list) and any(o is None for o in prediction):
+ raise ValueError("Input prediction is `None`")
+ if isinstance(prediction, (np.ndarray, torch.Tensor)):
+ prediction = MarigoldImageProcessor.expand_tensor_or_array(prediction)
+ if isinstance(prediction, np.ndarray):
+ prediction = MarigoldImageProcessor.numpy_to_pt(prediction) # [N*T,3,H,W]
+ if not (prediction.ndim == 4 and prediction.shape[1] == 3 and prediction.shape[0] % n_targets == 0):
+ raise ValueError(f"Unexpected input shape={prediction.shape}, expecting [N*T,3,H,W].")
+ N_T, _, H, W = prediction.shape
+ N = N_T // n_targets
+ prediction = prediction.reshape(N, n_targets, 3, H, W)
+ return [visualize_targets_one(img, idx) for idx, img in enumerate(prediction)]
+ elif isinstance(prediction, list):
+ return [visualize_targets_one(img, idx) for idx, img in enumerate(prediction)]
+ else:
+ raise ValueError(f"Unexpected input type: {type(prediction)}")
+
@staticmethod
def visualize_uncertainty(
uncertainty: Union[
@@ -537,9 +648,10 @@ def visualize_uncertainty(
List[torch.Tensor],
],
saturation_percentile=95,
- ) -> Union[PIL.Image.Image, List[PIL.Image.Image]]:
+ ) -> List[PIL.Image.Image]:
"""
- Visualizes dense uncertainties, such as produced by `MarigoldDepthPipeline` or `MarigoldNormalsPipeline`.
+ Visualizes dense uncertainties, such as produced by `MarigoldDepthPipeline`, `MarigoldNormalsPipeline`, or
+ `MarigoldIntrinsicsPipeline`.
Args:
uncertainty (`Union[np.ndarray, torch.Tensor, List[np.ndarray], List[torch.Tensor]]`):
@@ -547,14 +659,15 @@ def visualize_uncertainty(
saturation_percentile (`int`, *optional*, defaults to `95`):
Specifies the percentile uncertainty value visualized with maximum intensity.
- Returns: `PIL.Image.Image` or `List[PIL.Image.Image]` with uncertainty visualization.
+ Returns: `List[PIL.Image.Image]` with uncertainty visualization.
"""
def visualize_uncertainty_one(img, idx=None):
prefix = "Uncertainty" + (f"[{idx}]" if idx else "")
if img.min() < 0:
- raise ValueError(f"{prefix}: unexected data range, min={img.min()}.")
- img = img.squeeze(0).cpu().numpy()
+ raise ValueError(f"{prefix}: unexpected data range, min={img.min()}.")
+ img = img.permute(1, 2, 0) # [H,W,C]
+ img = img.squeeze(2).cpu().numpy() # [H,W] or [H,W,3]
saturation_value = np.percentile(img, saturation_percentile)
img = np.clip(img * 255 / saturation_value, 0, 255)
img = img.astype(np.uint8)
@@ -566,9 +679,9 @@ def visualize_uncertainty_one(img, idx=None):
if isinstance(uncertainty, (np.ndarray, torch.Tensor)):
uncertainty = MarigoldImageProcessor.expand_tensor_or_array(uncertainty)
if isinstance(uncertainty, np.ndarray):
- uncertainty = MarigoldImageProcessor.numpy_to_pt(uncertainty) # [N,1,H,W]
- if not (uncertainty.ndim == 4 and uncertainty.shape[1] == 1):
- raise ValueError(f"Unexpected input shape={uncertainty.shape}, expecting [N,1,H,W].")
+ uncertainty = MarigoldImageProcessor.numpy_to_pt(uncertainty) # [N,C,H,W]
+ if not (uncertainty.ndim == 4 and uncertainty.shape[1] in (1, 3)):
+ raise ValueError(f"Unexpected input shape={uncertainty.shape}, expecting [N,C,H,W] with C in (1,3).")
return [visualize_uncertainty_one(img, idx) for idx, img in enumerate(uncertainty)]
elif isinstance(uncertainty, list):
return [visualize_uncertainty_one(img, idx) for idx, img in enumerate(uncertainty)]
diff --git a/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py b/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py
index e5cd62e35773..da991aefbd4a 100644
--- a/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py
+++ b/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py
@@ -1,5 +1,5 @@
-# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
-# Copyright 2024 The HuggingFace Team. All rights reserved.
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
-# Marigold project website: https://marigoldmonodepth.github.io
+# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
from dataclasses import dataclass
from functools import partial
@@ -64,7 +64,7 @@
>>> import torch
>>> pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
-... "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
+... "prs-eth/marigold-depth-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
@@ -86,11 +86,12 @@ class MarigoldDepthOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
- Predicted depth maps with values in the range [0, 1]. The shape is always $numimages \times 1 \times height
- \times width$, regardless of whether the images were passed as a 4D array or a list.
+ Predicted depth maps with values in the range [0, 1]. The shape is $numimages \times 1 \times height \times
+ width$ for `torch.Tensor` or $numimages \times height \times width \times 1$ for `np.ndarray`.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages
- \times 1 \times height \times width$.
+ \times 1 \times height \times width$ for `torch.Tensor` or $numimages \times height \times width \times 1$
+ for `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
The shape is $numimages * numensemble \times 4 \times latentheight \times latentwidth$.
@@ -208,6 +209,11 @@ def check_inputs(
output_type: str,
output_uncertainty: bool,
) -> int:
+ actual_vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+ if actual_vae_scale_factor != self.vae_scale_factor:
+ raise ValueError(
+ f"`vae_scale_factor` computed at initialization ({self.vae_scale_factor}) differs from the actual one ({actual_vae_scale_factor})."
+ )
if num_inference_steps is None:
raise ValueError("`num_inference_steps` is not specified and could not be resolved from the model config.")
if num_inference_steps < 1:
@@ -320,6 +326,7 @@ def check_inputs(
return num_images
+ @torch.compiler.disable
def progress_bar(self, iterable=None, total=None, desc=None, leave=True):
if not hasattr(self, "_progress_bar_config"):
self._progress_bar_config = {}
@@ -370,11 +377,9 @@ def __call__(
same width and height.
num_inference_steps (`int`, *optional*, defaults to `None`):
Number of denoising diffusion steps during inference. The default value `None` results in automatic
- selection. The number of steps should be at least 10 with the full Marigold models, and between 1 and 4
- for Marigold-LCM models.
+ selection.
ensemble_size (`int`, defaults to `1`):
- Number of ensemble predictions. Recommended values are 5 and higher for better precision, or 1 for
- faster inference.
+ Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
processing_resolution (`int`, *optional*, defaults to `None`):
Effective processing resolution. When set to `0`, matches the larger input image dimension. This
produces crisper predictions, but may also lead to the overall loss of global context. The default
@@ -486,9 +491,7 @@ def __call__(
# `pred_latent` variable. The variable `image_latent` is of the same shape: it contains each input image encoded
# into latent space and replicated `E` times. The latents can be either generated (see `generator` to ensure
# reproducibility), or passed explicitly via the `latents` argument. The latter can be set outside the pipeline
- # code. For example, in the Marigold-LCM video processing demo, the latents initialization of a frame is taken
- # as a convex combination of the latents output of the pipeline for the previous frame and a newly-sampled
- # noise. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
+ # code. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
# dimensions are `(h, w)`. Encoding into latent space happens in batches of size `batch_size`.
# Model invocation: self.vae.encoder.
image_latent, pred_latent = self.prepare_latents(
@@ -733,6 +736,7 @@ def init_param(depth: torch.Tensor):
param = init_s.cpu().numpy()
else:
raise ValueError("Unrecognized alignment.")
+ param = param.astype(np.float64)
return param
@@ -775,7 +779,7 @@ def cost_fn(param: np.ndarray, depth: torch.Tensor) -> float:
if regularizer_strength > 0:
prediction, _ = ensemble(depth_aligned, return_uncertainty=False)
- err_near = (0.0 - prediction.min()).abs().item()
+ err_near = prediction.min().abs().item()
err_far = (1.0 - prediction.max()).abs().item()
cost += (err_near + err_far) * regularizer_strength
diff --git a/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py b/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py
new file mode 100644
index 000000000000..c809de18f469
--- /dev/null
+++ b/src/diffusers/pipelines/marigold/pipeline_marigold_intrinsics.py
@@ -0,0 +1,721 @@
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# --------------------------------------------------------------------------
+# More information and citation instructions are available on the
+# Marigold project website: https://marigoldcomputervision.github.io
+# --------------------------------------------------------------------------
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple, Union
+
+import numpy as np
+import torch
+from PIL import Image
+from tqdm.auto import tqdm
+from transformers import CLIPTextModel, CLIPTokenizer
+
+from ...image_processor import PipelineImageInput
+from ...models import (
+ AutoencoderKL,
+ UNet2DConditionModel,
+)
+from ...schedulers import (
+ DDIMScheduler,
+ LCMScheduler,
+)
+from ...utils import (
+ BaseOutput,
+ is_torch_xla_available,
+ logging,
+ replace_example_docstring,
+)
+from ...utils.torch_utils import randn_tensor
+from ..pipeline_utils import DiffusionPipeline
+from .marigold_image_processing import MarigoldImageProcessor
+
+
+if is_torch_xla_available():
+ import torch_xla.core.xla_model as xm
+
+ XLA_AVAILABLE = True
+else:
+ XLA_AVAILABLE = False
+
+logger = logging.get_logger(__name__) # pylint: disable=invalid-name
+
+
+EXAMPLE_DOC_STRING = """
+Examples:
+```py
+>>> import diffusers
+>>> import torch
+
+>>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
+... "prs-eth/marigold-iid-appearance-v1-1", variant="fp16", torch_dtype=torch.float16
+... ).to("cuda")
+
+>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+>>> intrinsics = pipe(image)
+
+>>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
+>>> vis[0]["albedo"].save("einstein_albedo.png")
+>>> vis[0]["roughness"].save("einstein_roughness.png")
+>>> vis[0]["metallicity"].save("einstein_metallicity.png")
+```
+```py
+>>> import diffusers
+>>> import torch
+
+>>> pipe = diffusers.MarigoldIntrinsicsPipeline.from_pretrained(
+... "prs-eth/marigold-iid-lighting-v1-1", variant="fp16", torch_dtype=torch.float16
+... ).to("cuda")
+
+>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
+>>> intrinsics = pipe(image)
+
+>>> vis = pipe.image_processor.visualize_intrinsics(intrinsics.prediction, pipe.target_properties)
+>>> vis[0]["albedo"].save("einstein_albedo.png")
+>>> vis[0]["shading"].save("einstein_shading.png")
+>>> vis[0]["residual"].save("einstein_residual.png")
+```
+"""
+
+
+@dataclass
+class MarigoldIntrinsicsOutput(BaseOutput):
+ """
+ Output class for Marigold Intrinsic Image Decomposition pipeline.
+
+ Args:
+ prediction (`np.ndarray`, `torch.Tensor`):
+ Predicted image intrinsics with values in the range [0, 1]. The shape is $(numimages * numtargets) \times 3
+ \times height \times width$ for `torch.Tensor` or $(numimages * numtargets) \times height \times width
+ \times 3$ for `np.ndarray`, where `numtargets` corresponds to the number of predicted target modalities of
+ the intrinsic image decomposition.
+ uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
+ Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $(numimages *
+ numtargets) \times 3 \times height \times width$ for `torch.Tensor` or $(numimages * numtargets) \times
+ height \times width \times 3$ for `np.ndarray`.
+ latent (`None`, `torch.Tensor`):
+ Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
+ The shape is $(numimages * numensemble) \times (numtargets * 4) \times latentheight \times latentwidth$.
+ """
+
+ prediction: Union[np.ndarray, torch.Tensor]
+ uncertainty: Union[None, np.ndarray, torch.Tensor]
+ latent: Union[None, torch.Tensor]
+
+
+class MarigoldIntrinsicsPipeline(DiffusionPipeline):
+ """
+ Pipeline for Intrinsic Image Decomposition (IID) using the Marigold method:
+ https://marigoldcomputervision.github.io.
+
+ This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods the
+ library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+
+ Args:
+ unet (`UNet2DConditionModel`):
+ Conditional U-Net to denoise the targets latent, conditioned on image latent.
+ vae (`AutoencoderKL`):
+ Variational Auto-Encoder (VAE) Model to encode and decode images and predictions to and from latent
+ representations.
+ scheduler (`DDIMScheduler` or `LCMScheduler`):
+ A scheduler to be used in combination with `unet` to denoise the encoded image latents.
+ text_encoder (`CLIPTextModel`):
+ Text-encoder, for empty text embedding.
+ tokenizer (`CLIPTokenizer`):
+ CLIP tokenizer.
+ prediction_type (`str`, *optional*):
+ Type of predictions made by the model.
+ target_properties (`Dict[str, Any]`, *optional*):
+ Properties of the predicted modalities, such as `target_names`, a `List[str]` used to define the number,
+ order and names of the predicted modalities, and any other metadata that may be required to interpret the
+ predictions.
+ default_denoising_steps (`int`, *optional*):
+ The minimum number of denoising diffusion steps that are required to produce a prediction of reasonable
+ quality with the given model. This value must be set in the model config. When the pipeline is called
+ without explicitly setting `num_inference_steps`, the default value is used. This is required to ensure
+ reasonable results with various model flavors compatible with the pipeline, such as those relying on very
+ short denoising schedules (`LCMScheduler`) and those with full diffusion schedules (`DDIMScheduler`).
+ default_processing_resolution (`int`, *optional*):
+ The recommended value of the `processing_resolution` parameter of the pipeline. This value must be set in
+ the model config. When the pipeline is called without explicitly setting `processing_resolution`, the
+ default value is used. This is required to ensure reasonable results with various model flavors trained
+ with varying optimal processing resolution values.
+ """
+
+ model_cpu_offload_seq = "text_encoder->unet->vae"
+ supported_prediction_types = ("intrinsics",)
+
+ def __init__(
+ self,
+ unet: UNet2DConditionModel,
+ vae: AutoencoderKL,
+ scheduler: Union[DDIMScheduler, LCMScheduler],
+ text_encoder: CLIPTextModel,
+ tokenizer: CLIPTokenizer,
+ prediction_type: Optional[str] = None,
+ target_properties: Optional[Dict[str, Any]] = None,
+ default_denoising_steps: Optional[int] = None,
+ default_processing_resolution: Optional[int] = None,
+ ):
+ super().__init__()
+
+ if prediction_type not in self.supported_prediction_types:
+ logger.warning(
+ f"Potentially unsupported `prediction_type='{prediction_type}'`; values supported by the pipeline: "
+ f"{self.supported_prediction_types}."
+ )
+
+ self.register_modules(
+ unet=unet,
+ vae=vae,
+ scheduler=scheduler,
+ text_encoder=text_encoder,
+ tokenizer=tokenizer,
+ )
+ self.register_to_config(
+ prediction_type=prediction_type,
+ target_properties=target_properties,
+ default_denoising_steps=default_denoising_steps,
+ default_processing_resolution=default_processing_resolution,
+ )
+
+ self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) if getattr(self, "vae", None) else 8
+
+ self.target_properties = target_properties
+ self.default_denoising_steps = default_denoising_steps
+ self.default_processing_resolution = default_processing_resolution
+
+ self.empty_text_embedding = None
+
+ self.image_processor = MarigoldImageProcessor(vae_scale_factor=self.vae_scale_factor)
+
+ @property
+ def n_targets(self):
+ return self.unet.config.out_channels // self.vae.config.latent_channels
+
+ def check_inputs(
+ self,
+ image: PipelineImageInput,
+ num_inference_steps: int,
+ ensemble_size: int,
+ processing_resolution: int,
+ resample_method_input: str,
+ resample_method_output: str,
+ batch_size: int,
+ ensembling_kwargs: Optional[Dict[str, Any]],
+ latents: Optional[torch.Tensor],
+ generator: Optional[Union[torch.Generator, List[torch.Generator]]],
+ output_type: str,
+ output_uncertainty: bool,
+ ) -> int:
+ actual_vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+ if actual_vae_scale_factor != self.vae_scale_factor:
+ raise ValueError(
+ f"`vae_scale_factor` computed at initialization ({self.vae_scale_factor}) differs from the actual one ({actual_vae_scale_factor})."
+ )
+ if num_inference_steps is None:
+ raise ValueError("`num_inference_steps` is not specified and could not be resolved from the model config.")
+ if num_inference_steps < 1:
+ raise ValueError("`num_inference_steps` must be positive.")
+ if ensemble_size < 1:
+ raise ValueError("`ensemble_size` must be positive.")
+ if ensemble_size == 2:
+ logger.warning(
+ "`ensemble_size` == 2 results are similar to no ensembling (1); "
+ "consider increasing the value to at least 3."
+ )
+ if ensemble_size == 1 and output_uncertainty:
+ raise ValueError(
+ "Computing uncertainty by setting `output_uncertainty=True` also requires setting `ensemble_size` "
+ "greater than 1."
+ )
+ if processing_resolution is None:
+ raise ValueError(
+ "`processing_resolution` is not specified and could not be resolved from the model config."
+ )
+ if processing_resolution < 0:
+ raise ValueError(
+ "`processing_resolution` must be non-negative: 0 for native resolution, or any positive value for "
+ "downsampled processing."
+ )
+ if processing_resolution % self.vae_scale_factor != 0:
+ raise ValueError(f"`processing_resolution` must be a multiple of {self.vae_scale_factor}.")
+ if resample_method_input not in ("nearest", "nearest-exact", "bilinear", "bicubic", "area"):
+ raise ValueError(
+ "`resample_method_input` takes string values compatible with PIL library: "
+ "nearest, nearest-exact, bilinear, bicubic, area."
+ )
+ if resample_method_output not in ("nearest", "nearest-exact", "bilinear", "bicubic", "area"):
+ raise ValueError(
+ "`resample_method_output` takes string values compatible with PIL library: "
+ "nearest, nearest-exact, bilinear, bicubic, area."
+ )
+ if batch_size < 1:
+ raise ValueError("`batch_size` must be positive.")
+ if output_type not in ["pt", "np"]:
+ raise ValueError("`output_type` must be one of `pt` or `np`.")
+ if latents is not None and generator is not None:
+ raise ValueError("`latents` and `generator` cannot be used together.")
+ if ensembling_kwargs is not None:
+ if not isinstance(ensembling_kwargs, dict):
+ raise ValueError("`ensembling_kwargs` must be a dictionary.")
+ if "reduction" in ensembling_kwargs and ensembling_kwargs["reduction"] not in ("median", "mean"):
+ raise ValueError("`ensembling_kwargs['reduction']` can be either `'median'` or `'mean'`.")
+
+ # image checks
+ num_images = 0
+ W, H = None, None
+ if not isinstance(image, list):
+ image = [image]
+ for i, img in enumerate(image):
+ if isinstance(img, np.ndarray) or torch.is_tensor(img):
+ if img.ndim not in (2, 3, 4):
+ raise ValueError(f"`image[{i}]` has unsupported dimensions or shape: {img.shape}.")
+ H_i, W_i = img.shape[-2:]
+ N_i = 1
+ if img.ndim == 4:
+ N_i = img.shape[0]
+ elif isinstance(img, Image.Image):
+ W_i, H_i = img.size
+ N_i = 1
+ else:
+ raise ValueError(f"Unsupported `image[{i}]` type: {type(img)}.")
+ if W is None:
+ W, H = W_i, H_i
+ elif (W, H) != (W_i, H_i):
+ raise ValueError(
+ f"Input `image[{i}]` has incompatible dimensions {(W_i, H_i)} with the previous images {(W, H)}"
+ )
+ num_images += N_i
+
+ # latents checks
+ if latents is not None:
+ if not torch.is_tensor(latents):
+ raise ValueError("`latents` must be a torch.Tensor.")
+ if latents.dim() != 4:
+ raise ValueError(f"`latents` has unsupported dimensions or shape: {latents.shape}.")
+
+ if processing_resolution > 0:
+ max_orig = max(H, W)
+ new_H = H * processing_resolution // max_orig
+ new_W = W * processing_resolution // max_orig
+ if new_H == 0 or new_W == 0:
+ raise ValueError(f"Extreme aspect ratio of the input image: [{W} x {H}]")
+ W, H = new_W, new_H
+ w = (W + self.vae_scale_factor - 1) // self.vae_scale_factor
+ h = (H + self.vae_scale_factor - 1) // self.vae_scale_factor
+ shape_expected = (num_images * ensemble_size, self.unet.config.out_channels, h, w)
+
+ if latents.shape != shape_expected:
+ raise ValueError(f"`latents` has unexpected shape={latents.shape} expected={shape_expected}.")
+
+ # generator checks
+ if generator is not None:
+ if isinstance(generator, list):
+ if len(generator) != num_images * ensemble_size:
+ raise ValueError(
+ "The number of generators must match the total number of ensemble members for all input images."
+ )
+ if not all(g.device.type == generator[0].device.type for g in generator):
+ raise ValueError("`generator` device placement is not consistent in the list.")
+ elif not isinstance(generator, torch.Generator):
+ raise ValueError(f"Unsupported generator type: {type(generator)}.")
+
+ return num_images
+
+ @torch.compiler.disable
+ def progress_bar(self, iterable=None, total=None, desc=None, leave=True):
+ if not hasattr(self, "_progress_bar_config"):
+ self._progress_bar_config = {}
+ elif not isinstance(self._progress_bar_config, dict):
+ raise ValueError(
+ f"`self._progress_bar_config` should be of type `dict`, but is {type(self._progress_bar_config)}."
+ )
+
+ progress_bar_config = dict(**self._progress_bar_config)
+ progress_bar_config["desc"] = progress_bar_config.get("desc", desc)
+ progress_bar_config["leave"] = progress_bar_config.get("leave", leave)
+ if iterable is not None:
+ return tqdm(iterable, **progress_bar_config)
+ elif total is not None:
+ return tqdm(total=total, **progress_bar_config)
+ else:
+ raise ValueError("Either `total` or `iterable` has to be defined.")
+
+ @torch.no_grad()
+ @replace_example_docstring(EXAMPLE_DOC_STRING)
+ def __call__(
+ self,
+ image: PipelineImageInput,
+ num_inference_steps: Optional[int] = None,
+ ensemble_size: int = 1,
+ processing_resolution: Optional[int] = None,
+ match_input_resolution: bool = True,
+ resample_method_input: str = "bilinear",
+ resample_method_output: str = "bilinear",
+ batch_size: int = 1,
+ ensembling_kwargs: Optional[Dict[str, Any]] = None,
+ latents: Optional[Union[torch.Tensor, List[torch.Tensor]]] = None,
+ generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+ output_type: str = "np",
+ output_uncertainty: bool = False,
+ output_latent: bool = False,
+ return_dict: bool = True,
+ ):
+ """
+ Function invoked when calling the pipeline.
+
+ Args:
+ image (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`),
+ `List[torch.Tensor]`: An input image or images used as an input for the intrinsic decomposition task.
+ For arrays and tensors, the expected value range is between `[0, 1]`. Passing a batch of images is
+ possible by providing a four-dimensional array or a tensor. Additionally, a list of images of two- or
+ three-dimensional arrays or tensors can be passed. In the latter case, all list elements must have the
+ same width and height.
+ num_inference_steps (`int`, *optional*, defaults to `None`):
+ Number of denoising diffusion steps during inference. The default value `None` results in automatic
+ selection.
+ ensemble_size (`int`, defaults to `1`):
+ Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
+ processing_resolution (`int`, *optional*, defaults to `None`):
+ Effective processing resolution. When set to `0`, matches the larger input image dimension. This
+ produces crisper predictions, but may also lead to the overall loss of global context. The default
+ value `None` resolves to the optimal value from the model config.
+ match_input_resolution (`bool`, *optional*, defaults to `True`):
+ When enabled, the output prediction is resized to match the input dimensions. When disabled, the longer
+ side of the output will equal to `processing_resolution`.
+ resample_method_input (`str`, *optional*, defaults to `"bilinear"`):
+ Resampling method used to resize input images to `processing_resolution`. The accepted values are:
+ `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
+ resample_method_output (`str`, *optional*, defaults to `"bilinear"`):
+ Resampling method used to resize output predictions to match the input resolution. The accepted values
+ are `"nearest"`, `"nearest-exact"`, `"bilinear"`, `"bicubic"`, or `"area"`.
+ batch_size (`int`, *optional*, defaults to `1`):
+ Batch size; only matters when setting `ensemble_size` or passing a tensor of images.
+ ensembling_kwargs (`dict`, *optional*, defaults to `None`)
+ Extra dictionary with arguments for precise ensembling control. The following options are available:
+ - reduction (`str`, *optional*, defaults to `"median"`): Defines the ensembling function applied in
+ every pixel location, can be either `"median"` or `"mean"`.
+ latents (`torch.Tensor`, *optional*, defaults to `None`):
+ Latent noise tensors to replace the random initialization. These can be taken from the previous
+ function call's output.
+ generator (`torch.Generator`, or `List[torch.Generator]`, *optional*, defaults to `None`):
+ Random number generator object to ensure reproducibility.
+ output_type (`str`, *optional*, defaults to `"np"`):
+ Preferred format of the output's `prediction` and the optional `uncertainty` fields. The accepted
+ values are: `"np"` (numpy array) or `"pt"` (torch tensor).
+ output_uncertainty (`bool`, *optional*, defaults to `False`):
+ When enabled, the output's `uncertainty` field contains the predictive uncertainty map, provided that
+ the `ensemble_size` argument is set to a value above 2.
+ output_latent (`bool`, *optional*, defaults to `False`):
+ When enabled, the output's `latent` field contains the latent codes corresponding to the predictions
+ within the ensemble. These codes can be saved, modified, and used for subsequent calls with the
+ `latents` argument.
+ return_dict (`bool`, *optional*, defaults to `True`):
+ Whether or not to return a [`~pipelines.marigold.MarigoldIntrinsicsOutput`] instead of a plain tuple.
+
+ Examples:
+
+ Returns:
+ [`~pipelines.marigold.MarigoldIntrinsicsOutput`] or `tuple`:
+ If `return_dict` is `True`, [`~pipelines.marigold.MarigoldIntrinsicsOutput`] is returned, otherwise a
+ `tuple` is returned where the first element is the prediction, the second element is the uncertainty
+ (or `None`), and the third is the latent (or `None`).
+ """
+
+ # 0. Resolving variables.
+ device = self._execution_device
+ dtype = self.dtype
+
+ # Model-specific optimal default values leading to fast and reasonable results.
+ if num_inference_steps is None:
+ num_inference_steps = self.default_denoising_steps
+ if processing_resolution is None:
+ processing_resolution = self.default_processing_resolution
+
+ # 1. Check inputs.
+ num_images = self.check_inputs(
+ image,
+ num_inference_steps,
+ ensemble_size,
+ processing_resolution,
+ resample_method_input,
+ resample_method_output,
+ batch_size,
+ ensembling_kwargs,
+ latents,
+ generator,
+ output_type,
+ output_uncertainty,
+ )
+
+ # 2. Prepare empty text conditioning.
+ # Model invocation: self.tokenizer, self.text_encoder.
+ if self.empty_text_embedding is None:
+ prompt = ""
+ text_inputs = self.tokenizer(
+ prompt,
+ padding="do_not_pad",
+ max_length=self.tokenizer.model_max_length,
+ truncation=True,
+ return_tensors="pt",
+ )
+ text_input_ids = text_inputs.input_ids.to(device)
+ self.empty_text_embedding = self.text_encoder(text_input_ids)[0] # [1,2,1024]
+
+ # 3. Preprocess input images. This function loads input image or images of compatible dimensions `(H, W)`,
+ # optionally downsamples them to the `processing_resolution` `(PH, PW)`, where
+ # `max(PH, PW) == processing_resolution`, and pads the dimensions to `(PPH, PPW)` such that these values are
+ # divisible by the latent space downscaling factor (typically 8 in Stable Diffusion). The default value `None`
+ # of `processing_resolution` resolves to the optimal value from the model config. It is a recommended mode of
+ # operation and leads to the most reasonable results. Using the native image resolution or any other processing
+ # resolution can lead to loss of either fine details or global context in the output predictions.
+ image, padding, original_resolution = self.image_processor.preprocess(
+ image, processing_resolution, resample_method_input, device, dtype
+ ) # [N,3,PPH,PPW]
+
+ # 4. Encode input image into latent space. At this step, each of the `N` input images is represented with `E`
+ # ensemble members. Each ensemble member is an independent diffused prediction, just initialized independently.
+ # Latents of each such predictions across all input images and all ensemble members are represented in the
+ # `pred_latent` variable. The variable `image_latent` contains each input image encoded into latent space and
+ # replicated `E` times. The variable `pred_latent` contains latents initialization, where the latent space is
+ # replicated `T` times relative to the single latent space of `image_latent`, where `T` is the number of the
+ # predicted targets. The latents can be either generated (see `generator` to ensure reproducibility), or passed
+ # explicitly via the `latents` argument. The latter can be set outside the pipeline code. This behavior can be
+ # achieved by setting the `output_latent` argument to `True`. The latent space dimensions are `(h, w)`. Encoding
+ # into latent space happens in batches of size `batch_size`.
+ # Model invocation: self.vae.encoder.
+ image_latent, pred_latent = self.prepare_latents(
+ image, latents, generator, ensemble_size, batch_size
+ ) # [N*E,4,h,w], [N*E,T*4,h,w]
+
+ del image
+
+ batch_empty_text_embedding = self.empty_text_embedding.to(device=device, dtype=dtype).repeat(
+ batch_size, 1, 1
+ ) # [B,1024,2]
+
+ # 5. Process the denoising loop. All `N * E` latents are processed sequentially in batches of size `batch_size`.
+ # The unet model takes concatenated latent spaces of the input image and the predicted modality as an input, and
+ # outputs noise for the predicted modality's latent space. The number of denoising diffusion steps is defined by
+ # `num_inference_steps`. It is either set directly, or resolves to the optimal value specific to the loaded
+ # model.
+ # Model invocation: self.unet.
+ pred_latents = []
+
+ for i in self.progress_bar(
+ range(0, num_images * ensemble_size, batch_size), leave=True, desc="Marigold predictions..."
+ ):
+ batch_image_latent = image_latent[i : i + batch_size] # [B,4,h,w]
+ batch_pred_latent = pred_latent[i : i + batch_size] # [B,T*4,h,w]
+ effective_batch_size = batch_image_latent.shape[0]
+ text = batch_empty_text_embedding[:effective_batch_size] # [B,2,1024]
+
+ self.scheduler.set_timesteps(num_inference_steps, device=device)
+ for t in self.progress_bar(self.scheduler.timesteps, leave=False, desc="Diffusion steps..."):
+ batch_latent = torch.cat([batch_image_latent, batch_pred_latent], dim=1) # [B,(1+T)*4,h,w]
+ noise = self.unet(batch_latent, t, encoder_hidden_states=text, return_dict=False)[0] # [B,T*4,h,w]
+ batch_pred_latent = self.scheduler.step(
+ noise, t, batch_pred_latent, generator=generator
+ ).prev_sample # [B,T*4,h,w]
+
+ if XLA_AVAILABLE:
+ xm.mark_step()
+
+ pred_latents.append(batch_pred_latent)
+
+ pred_latent = torch.cat(pred_latents, dim=0) # [N*E,T*4,h,w]
+
+ del (
+ pred_latents,
+ image_latent,
+ batch_empty_text_embedding,
+ batch_image_latent,
+ batch_pred_latent,
+ text,
+ batch_latent,
+ noise,
+ )
+
+ # 6. Decode predictions from latent into pixel space. The resulting `N * E` predictions have shape `(PPH, PPW)`,
+ # which requires slight postprocessing. Decoding into pixel space happens in batches of size `batch_size`.
+ # Model invocation: self.vae.decoder.
+ pred_latent_for_decoding = pred_latent.reshape(
+ num_images * ensemble_size * self.n_targets, self.vae.config.latent_channels, *pred_latent.shape[2:]
+ ) # [N*E*T,4,PPH,PPW]
+ prediction = torch.cat(
+ [
+ self.decode_prediction(pred_latent_for_decoding[i : i + batch_size])
+ for i in range(0, pred_latent_for_decoding.shape[0], batch_size)
+ ],
+ dim=0,
+ ) # [N*E*T,3,PPH,PPW]
+
+ del pred_latent_for_decoding
+ if not output_latent:
+ pred_latent = None
+
+ # 7. Remove padding. The output shape is (PH, PW).
+ prediction = self.image_processor.unpad_image(prediction, padding) # [N*E*T,3,PH,PW]
+
+ # 8. Ensemble and compute uncertainty (when `output_uncertainty` is set). This code treats each of the `N*T`
+ # groups of `E` ensemble predictions independently. For each group it computes an ensembled prediction of shape
+ # `(PH, PW)` and an optional uncertainty map of the same dimensions. After computing this pair of outputs for
+ # each group independently, it stacks them respectively into batches of `N*T` almost final predictions and
+ # uncertainty maps.
+ uncertainty = None
+ if ensemble_size > 1:
+ prediction = prediction.reshape(
+ num_images, ensemble_size, self.n_targets, *prediction.shape[1:]
+ ) # [N,E,T,3,PH,PW]
+ prediction = [
+ self.ensemble_intrinsics(prediction[i], output_uncertainty, **(ensembling_kwargs or {}))
+ for i in range(num_images)
+ ] # [ [[T,3,PH,PW], [T,3,PH,PW]], ... ]
+ prediction, uncertainty = zip(*prediction) # [[T,3,PH,PW], ... ], [[T,3,PH,PW], ... ]
+ prediction = torch.cat(prediction, dim=0) # [N*T,3,PH,PW]
+ if output_uncertainty:
+ uncertainty = torch.cat(uncertainty, dim=0) # [N*T,3,PH,PW]
+ else:
+ uncertainty = None
+
+ # 9. If `match_input_resolution` is set, the output prediction and the uncertainty are upsampled to match the
+ # input resolution `(H, W)`. This step may introduce upsampling artifacts, and therefore can be disabled.
+ # Depending on the downstream use-case, upsampling can be also chosen based on the tolerated artifacts by
+ # setting the `resample_method_output` parameter (e.g., to `"nearest"`).
+ if match_input_resolution:
+ prediction = self.image_processor.resize_antialias(
+ prediction, original_resolution, resample_method_output, is_aa=False
+ ) # [N*T,3,H,W]
+ if uncertainty is not None and output_uncertainty:
+ uncertainty = self.image_processor.resize_antialias(
+ uncertainty, original_resolution, resample_method_output, is_aa=False
+ ) # [N*T,1,H,W]
+
+ # 10. Prepare the final outputs.
+ if output_type == "np":
+ prediction = self.image_processor.pt_to_numpy(prediction) # [N*T,H,W,3]
+ if uncertainty is not None and output_uncertainty:
+ uncertainty = self.image_processor.pt_to_numpy(uncertainty) # [N*T,H,W,3]
+
+ # 11. Offload all models
+ self.maybe_free_model_hooks()
+
+ if not return_dict:
+ return (prediction, uncertainty, pred_latent)
+
+ return MarigoldIntrinsicsOutput(
+ prediction=prediction,
+ uncertainty=uncertainty,
+ latent=pred_latent,
+ )
+
+ def prepare_latents(
+ self,
+ image: torch.Tensor,
+ latents: Optional[torch.Tensor],
+ generator: Optional[torch.Generator],
+ ensemble_size: int,
+ batch_size: int,
+ ) -> Tuple[torch.Tensor, torch.Tensor]:
+ def retrieve_latents(encoder_output):
+ if hasattr(encoder_output, "latent_dist"):
+ return encoder_output.latent_dist.mode()
+ elif hasattr(encoder_output, "latents"):
+ return encoder_output.latents
+ else:
+ raise AttributeError("Could not access latents of provided encoder_output")
+
+ image_latent = torch.cat(
+ [
+ retrieve_latents(self.vae.encode(image[i : i + batch_size]))
+ for i in range(0, image.shape[0], batch_size)
+ ],
+ dim=0,
+ ) # [N,4,h,w]
+ image_latent = image_latent * self.vae.config.scaling_factor
+ image_latent = image_latent.repeat_interleave(ensemble_size, dim=0) # [N*E,4,h,w]
+ N_E, C, H, W = image_latent.shape
+
+ pred_latent = latents
+ if pred_latent is None:
+ pred_latent = randn_tensor(
+ (N_E, self.n_targets * C, H, W),
+ generator=generator,
+ device=image_latent.device,
+ dtype=image_latent.dtype,
+ ) # [N*E,T*4,h,w]
+
+ return image_latent, pred_latent
+
+ def decode_prediction(self, pred_latent: torch.Tensor) -> torch.Tensor:
+ if pred_latent.dim() != 4 or pred_latent.shape[1] != self.vae.config.latent_channels:
+ raise ValueError(
+ f"Expecting 4D tensor of shape [B,{self.vae.config.latent_channels},H,W]; got {pred_latent.shape}."
+ )
+
+ prediction = self.vae.decode(pred_latent / self.vae.config.scaling_factor, return_dict=False)[0] # [B,3,H,W]
+
+ prediction = torch.clip(prediction, -1.0, 1.0) # [B,3,H,W]
+ prediction = (prediction + 1.0) / 2.0
+
+ return prediction # [B,3,H,W]
+
+ @staticmethod
+ def ensemble_intrinsics(
+ targets: torch.Tensor,
+ output_uncertainty: bool = False,
+ reduction: str = "median",
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+ """
+ Ensembles the intrinsic decomposition represented by the `targets` tensor with expected shape `(B, T, 3, H,
+ W)`, where B is the number of ensemble members for a given prediction of size `(H x W)`, and T is the number of
+ predicted targets.
+
+ Args:
+ targets (`torch.Tensor`):
+ Input ensemble of intrinsic image decomposition maps.
+ output_uncertainty (`bool`, *optional*, defaults to `False`):
+ Whether to output uncertainty map.
+ reduction (`str`, *optional*, defaults to `"mean"`):
+ Reduction method used to ensemble aligned predictions. The accepted values are: `"median"` and
+ `"mean"`.
+
+ Returns:
+ A tensor of aligned and ensembled intrinsic decomposition maps with shape `(T, 3, H, W)` and optionally a
+ tensor of uncertainties of shape `(T, 3, H, W)`.
+ """
+ if targets.dim() != 5 or targets.shape[2] != 3:
+ raise ValueError(f"Expecting 4D tensor of shape [B,T,3,H,W]; got {targets.shape}.")
+ if reduction not in ("median", "mean"):
+ raise ValueError(f"Unrecognized reduction method: {reduction}.")
+
+ B, T, _, H, W = targets.shape
+ uncertainty = None
+ if reduction == "mean":
+ prediction = torch.mean(targets, dim=0) # [T,3,H,W]
+ if output_uncertainty:
+ uncertainty = torch.std(targets, dim=0) # [T,3,H,W]
+ elif reduction == "median":
+ prediction = torch.median(targets, dim=0, keepdim=True).values # [1,T,3,H,W]
+ if output_uncertainty:
+ uncertainty = torch.abs(targets - prediction) # [B,T,3,H,W]
+ uncertainty = torch.median(uncertainty, dim=0).values # [T,3,H,W]
+ prediction = prediction.squeeze(0) # [T,3,H,W]
+ else:
+ raise ValueError(f"Unrecognized reduction method: {reduction}.")
+ return prediction, uncertainty
diff --git a/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py b/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py
index 22f155f92022..192ed590a489 100644
--- a/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py
+++ b/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py
@@ -1,5 +1,5 @@
-# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
-# Copyright 2024 The HuggingFace Team. All rights reserved.
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
-# Marigold project website: https://marigoldmonodepth.github.io
+# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
from dataclasses import dataclass
from typing import Any, Dict, List, Optional, Tuple, Union
@@ -62,7 +62,7 @@
>>> import torch
>>> pipe = diffusers.MarigoldNormalsPipeline.from_pretrained(
-... "prs-eth/marigold-normals-lcm-v0-1", variant="fp16", torch_dtype=torch.float16
+... "prs-eth/marigold-normals-v1-1", variant="fp16", torch_dtype=torch.float16
... ).to("cuda")
>>> image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
@@ -81,11 +81,12 @@ class MarigoldNormalsOutput(BaseOutput):
Args:
prediction (`np.ndarray`, `torch.Tensor`):
- Predicted normals with values in the range [-1, 1]. The shape is always $numimages \times 3 \times height
- \times width$, regardless of whether the images were passed as a 4D array or a list.
+ Predicted normals with values in the range [-1, 1]. The shape is $numimages \times 3 \times height \times
+ width$ for `torch.Tensor` or $numimages \times height \times width \times 3$ for `np.ndarray`.
uncertainty (`None`, `np.ndarray`, `torch.Tensor`):
Uncertainty maps computed from the ensemble, with values in the range [0, 1]. The shape is $numimages
- \times 1 \times height \times width$.
+ \times 1 \times height \times width$ for `torch.Tensor` or $numimages \times height \times width \times 1$
+ for `np.ndarray`.
latent (`None`, `torch.Tensor`):
Latent features corresponding to the predictions, compatible with the `latents` argument of the pipeline.
The shape is $numimages * numensemble \times 4 \times latentheight \times latentwidth$.
@@ -164,6 +165,7 @@ def __init__(
tokenizer=tokenizer,
)
self.register_to_config(
+ prediction_type=prediction_type,
use_full_z_range=use_full_z_range,
default_denoising_steps=default_denoising_steps,
default_processing_resolution=default_processing_resolution,
@@ -194,6 +196,11 @@ def check_inputs(
output_type: str,
output_uncertainty: bool,
) -> int:
+ actual_vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1)
+ if actual_vae_scale_factor != self.vae_scale_factor:
+ raise ValueError(
+ f"`vae_scale_factor` computed at initialization ({self.vae_scale_factor}) differs from the actual one ({actual_vae_scale_factor})."
+ )
if num_inference_steps is None:
raise ValueError("`num_inference_steps` is not specified and could not be resolved from the model config.")
if num_inference_steps < 1:
@@ -304,6 +311,7 @@ def check_inputs(
return num_images
+ @torch.compiler.disable
def progress_bar(self, iterable=None, total=None, desc=None, leave=True):
if not hasattr(self, "_progress_bar_config"):
self._progress_bar_config = {}
@@ -354,11 +362,9 @@ def __call__(
same width and height.
num_inference_steps (`int`, *optional*, defaults to `None`):
Number of denoising diffusion steps during inference. The default value `None` results in automatic
- selection. The number of steps should be at least 10 with the full Marigold models, and between 1 and 4
- for Marigold-LCM models.
+ selection.
ensemble_size (`int`, defaults to `1`):
- Number of ensemble predictions. Recommended values are 5 and higher for better precision, or 1 for
- faster inference.
+ Number of ensemble predictions. Higher values result in measurable improvements and visual degradation.
processing_resolution (`int`, *optional*, defaults to `None`):
Effective processing resolution. When set to `0`, matches the larger input image dimension. This
produces crisper predictions, but may also lead to the overall loss of global context. The default
@@ -394,7 +400,7 @@ def __call__(
within the ensemble. These codes can be saved, modified, and used for subsequent calls with the
`latents` argument.
return_dict (`bool`, *optional*, defaults to `True`):
- Whether or not to return a [`~pipelines.marigold.MarigoldDepthOutput`] instead of a plain tuple.
+ Whether or not to return a [`~pipelines.marigold.MarigoldNormalsOutput`] instead of a plain tuple.
Examples:
@@ -462,9 +468,7 @@ def __call__(
# `pred_latent` variable. The variable `image_latent` is of the same shape: it contains each input image encoded
# into latent space and replicated `E` times. The latents can be either generated (see `generator` to ensure
# reproducibility), or passed explicitly via the `latents` argument. The latter can be set outside the pipeline
- # code. For example, in the Marigold-LCM video processing demo, the latents initialization of a frame is taken
- # as a convex combination of the latents output of the pipeline for the previous frame and a newly-sampled
- # noise. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
+ # code. This behavior can be achieved by setting the `output_latent` argument to `True`. The latent space
# dimensions are `(h, w)`. Encoding into latent space happens in batches of size `batch_size`.
# Model invocation: self.vae.encoder.
image_latent, pred_latent = self.prepare_latents(
diff --git a/src/diffusers/utils/dummy_torch_and_transformers_objects.py b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
index e80c07424608..8bb9ec1cb321 100644
--- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py
+++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py
@@ -1217,6 +1217,21 @@ def from_pretrained(cls, *args, **kwargs):
requires_backends(cls, ["torch", "transformers"])
+class MarigoldIntrinsicsPipeline(metaclass=DummyObject):
+ _backends = ["torch", "transformers"]
+
+ def __init__(self, *args, **kwargs):
+ requires_backends(self, ["torch", "transformers"])
+
+ @classmethod
+ def from_config(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+ @classmethod
+ def from_pretrained(cls, *args, **kwargs):
+ requires_backends(cls, ["torch", "transformers"])
+
+
class MarigoldNormalsPipeline(metaclass=DummyObject):
_backends = ["torch", "transformers"]
diff --git a/tests/pipelines/marigold/test_marigold_depth.py b/tests/pipelines/marigold/test_marigold_depth.py
index fcb9adca7a7b..a5700bae7bb5 100644
--- a/tests/pipelines/marigold/test_marigold_depth.py
+++ b/tests/pipelines/marigold/test_marigold_depth.py
@@ -1,5 +1,5 @@
-# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
-# Copyright 2024 The HuggingFace Team. All rights reserved.
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
-# Marigold project website: https://marigoldmonodepth.github.io
+# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
import gc
import random
diff --git a/tests/pipelines/marigold/test_marigold_intrinsics.py b/tests/pipelines/marigold/test_marigold_intrinsics.py
new file mode 100644
index 000000000000..b24e686a4dfe
--- /dev/null
+++ b/tests/pipelines/marigold/test_marigold_intrinsics.py
@@ -0,0 +1,571 @@
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# --------------------------------------------------------------------------
+# More information and citation instructions are available on the
+# Marigold project website: https://marigoldcomputervision.github.io
+# --------------------------------------------------------------------------
+import gc
+import random
+import unittest
+
+import numpy as np
+import torch
+from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer
+
+import diffusers
+from diffusers import (
+ AutoencoderKL,
+ AutoencoderTiny,
+ DDIMScheduler,
+ MarigoldIntrinsicsPipeline,
+ UNet2DConditionModel,
+)
+from diffusers.utils.testing_utils import (
+ enable_full_determinism,
+ floats_tensor,
+ load_image,
+ require_torch_gpu,
+ slow,
+ torch_device,
+)
+
+from ..test_pipelines_common import PipelineTesterMixin, to_np
+
+
+enable_full_determinism()
+
+
+class MarigoldIntrinsicsPipelineTesterMixin(PipelineTesterMixin):
+ def _test_inference_batch_single_identical(
+ self,
+ batch_size=2,
+ expected_max_diff=1e-4,
+ additional_params_copy_to_batched_inputs=["num_inference_steps"],
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ for components in pipe.components.values():
+ if hasattr(components, "set_default_attn_processor"):
+ components.set_default_attn_processor()
+
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+ inputs = self.get_dummy_inputs(torch_device)
+ # Reset generator in case it is has been used in self.get_dummy_inputs
+ inputs["generator"] = self.get_generator(0)
+
+ logger = diffusers.logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # batchify inputs
+ batched_inputs = {}
+ batched_inputs.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ batched_inputs[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+ batched_inputs[name][-1] = 100 * "very long"
+
+ else:
+ batched_inputs[name] = batch_size * [value]
+
+ if "generator" in inputs:
+ batched_inputs["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_inputs["batch_size"] = batch_size
+
+ for arg in additional_params_copy_to_batched_inputs:
+ batched_inputs[arg] = inputs[arg]
+
+ output = pipe(**inputs)
+ output_batch = pipe(**batched_inputs)
+
+ assert output_batch[0].shape[0] == batch_size * output[0].shape[0] # only changed here
+
+ max_diff = np.abs(to_np(output_batch[0][0]) - to_np(output[0][0])).max()
+ assert max_diff < expected_max_diff
+
+ def _test_inference_batch_consistent(
+ self, batch_sizes=[2], additional_params_copy_to_batched_inputs=["num_inference_steps"], batch_generator=True
+ ):
+ components = self.get_dummy_components()
+ pipe = self.pipeline_class(**components)
+ pipe.to(torch_device)
+ pipe.set_progress_bar_config(disable=None)
+
+ inputs = self.get_dummy_inputs(torch_device)
+ inputs["generator"] = self.get_generator(0)
+
+ logger = diffusers.logging.get_logger(pipe.__module__)
+ logger.setLevel(level=diffusers.logging.FATAL)
+
+ # prepare batched inputs
+ batched_inputs = []
+ for batch_size in batch_sizes:
+ batched_input = {}
+ batched_input.update(inputs)
+
+ for name in self.batch_params:
+ if name not in inputs:
+ continue
+
+ value = inputs[name]
+ if name == "prompt":
+ len_prompt = len(value)
+ # make unequal batch sizes
+ batched_input[name] = [value[: len_prompt // i] for i in range(1, batch_size + 1)]
+
+ # make last batch super long
+ batched_input[name][-1] = 100 * "very long"
+
+ else:
+ batched_input[name] = batch_size * [value]
+
+ if batch_generator and "generator" in inputs:
+ batched_input["generator"] = [self.get_generator(i) for i in range(batch_size)]
+
+ if "batch_size" in inputs:
+ batched_input["batch_size"] = batch_size
+
+ batched_inputs.append(batched_input)
+
+ logger.setLevel(level=diffusers.logging.WARNING)
+ for batch_size, batched_input in zip(batch_sizes, batched_inputs):
+ output = pipe(**batched_input)
+ assert len(output[0]) == batch_size * pipe.n_targets # only changed here
+
+
+class MarigoldIntrinsicsPipelineFastTests(MarigoldIntrinsicsPipelineTesterMixin, unittest.TestCase):
+ pipeline_class = MarigoldIntrinsicsPipeline
+ params = frozenset(["image"])
+ batch_params = frozenset(["image"])
+ image_params = frozenset(["image"])
+ image_latents_params = frozenset(["latents"])
+ callback_cfg_params = frozenset([])
+ test_xformers_attention = False
+ required_optional_params = frozenset(
+ [
+ "num_inference_steps",
+ "generator",
+ "output_type",
+ ]
+ )
+
+ def get_dummy_components(self, time_cond_proj_dim=None):
+ torch.manual_seed(0)
+ unet = UNet2DConditionModel(
+ block_out_channels=(32, 64),
+ layers_per_block=2,
+ time_cond_proj_dim=time_cond_proj_dim,
+ sample_size=32,
+ in_channels=12,
+ out_channels=8,
+ down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
+ up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
+ cross_attention_dim=32,
+ )
+ torch.manual_seed(0)
+ scheduler = DDIMScheduler(
+ beta_start=0.00085,
+ beta_end=0.012,
+ prediction_type="v_prediction",
+ set_alpha_to_one=False,
+ steps_offset=1,
+ beta_schedule="scaled_linear",
+ clip_sample=False,
+ thresholding=False,
+ )
+ torch.manual_seed(0)
+ vae = AutoencoderKL(
+ block_out_channels=[32, 64],
+ in_channels=3,
+ out_channels=3,
+ down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
+ up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
+ latent_channels=4,
+ )
+ torch.manual_seed(0)
+ text_encoder_config = CLIPTextConfig(
+ bos_token_id=0,
+ eos_token_id=2,
+ hidden_size=32,
+ intermediate_size=37,
+ layer_norm_eps=1e-05,
+ num_attention_heads=4,
+ num_hidden_layers=5,
+ pad_token_id=1,
+ vocab_size=1000,
+ )
+ text_encoder = CLIPTextModel(text_encoder_config)
+ tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
+
+ components = {
+ "unet": unet,
+ "scheduler": scheduler,
+ "vae": vae,
+ "text_encoder": text_encoder,
+ "tokenizer": tokenizer,
+ "prediction_type": "intrinsics",
+ }
+ return components
+
+ def get_dummy_tiny_autoencoder(self):
+ return AutoencoderTiny(in_channels=3, out_channels=3, latent_channels=4)
+
+ def get_dummy_inputs(self, device, seed=0):
+ image = floats_tensor((1, 3, 32, 32), rng=random.Random(seed)).to(device)
+ image = image / 2 + 0.5
+ if str(device).startswith("mps"):
+ generator = torch.manual_seed(seed)
+ else:
+ generator = torch.Generator(device=device).manual_seed(seed)
+ inputs = {
+ "image": image,
+ "num_inference_steps": 1,
+ "processing_resolution": 0,
+ "generator": generator,
+ "output_type": "np",
+ }
+ return inputs
+
+ def _test_marigold_intrinsics(
+ self,
+ generator_seed: int = 0,
+ expected_slice: np.ndarray = None,
+ atol: float = 1e-4,
+ **pipe_kwargs,
+ ):
+ device = "cpu"
+ components = self.get_dummy_components()
+
+ pipe = self.pipeline_class(**components)
+ pipe.to(device)
+ pipe.set_progress_bar_config(disable=None)
+
+ pipe_inputs = self.get_dummy_inputs(device, seed=generator_seed)
+ pipe_inputs.update(**pipe_kwargs)
+
+ prediction = pipe(**pipe_inputs).prediction
+
+ prediction_slice = prediction[0, -3:, -3:, -1].flatten()
+
+ if pipe_inputs.get("match_input_resolution", True):
+ self.assertEqual(prediction.shape, (2, 32, 32, 3), "Unexpected output resolution")
+ else:
+ self.assertTrue(prediction.shape[0] == 2 and prediction.shape[3] == 3, "Unexpected output dimensions")
+ self.assertEqual(
+ max(prediction.shape[1:3]),
+ pipe_inputs.get("processing_resolution", 768),
+ "Unexpected output resolution",
+ )
+
+ np.set_printoptions(precision=5, suppress=True)
+ msg = f"{prediction_slice}"
+ self.assertTrue(np.allclose(prediction_slice, expected_slice, atol=atol), msg)
+ # self.assertTrue(np.allclose(prediction_slice, expected_slice, atol=atol))
+
+ def test_marigold_depth_dummy_defaults(self):
+ self._test_marigold_intrinsics(
+ expected_slice=np.array([0.6423, 0.40664, 0.41185, 0.65832, 0.63935, 0.43971, 0.51786, 0.55216, 0.47683]),
+ )
+
+ def test_marigold_depth_dummy_G0_S1_P32_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.6423, 0.40664, 0.41185, 0.65832, 0.63935, 0.43971, 0.51786, 0.55216, 0.47683]),
+ num_inference_steps=1,
+ processing_resolution=32,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G0_S1_P16_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.53132, 0.44487, 0.40164, 0.5326, 0.49073, 0.46979, 0.53324, 0.51366, 0.50387]),
+ num_inference_steps=1,
+ processing_resolution=16,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G2024_S1_P32_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=2024,
+ expected_slice=np.array([0.40257, 0.39468, 0.51373, 0.4161, 0.40162, 0.58535, 0.43581, 0.47834, 0.48951]),
+ num_inference_steps=1,
+ processing_resolution=32,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G0_S2_P32_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.49636, 0.4518, 0.42722, 0.59044, 0.6362, 0.39011, 0.53522, 0.55153, 0.48699]),
+ num_inference_steps=2,
+ processing_resolution=32,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G0_S1_P64_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.55547, 0.43511, 0.4887, 0.56399, 0.63867, 0.56337, 0.47889, 0.52925, 0.49235]),
+ num_inference_steps=1,
+ processing_resolution=64,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G0_S1_P32_E3_B1_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.57249, 0.49824, 0.54438, 0.57733, 0.52404, 0.5255, 0.56493, 0.56336, 0.48579]),
+ num_inference_steps=1,
+ processing_resolution=32,
+ ensemble_size=3,
+ ensembling_kwargs={"reduction": "mean"},
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G0_S1_P32_E4_B2_M1(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.6294, 0.5575, 0.53414, 0.61077, 0.57156, 0.53974, 0.52956, 0.55467, 0.48751]),
+ num_inference_steps=1,
+ processing_resolution=32,
+ ensemble_size=4,
+ ensembling_kwargs={"reduction": "mean"},
+ batch_size=2,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_depth_dummy_G0_S1_P16_E1_B1_M0(self):
+ self._test_marigold_intrinsics(
+ generator_seed=0,
+ expected_slice=np.array([0.63511, 0.68137, 0.48783, 0.46689, 0.58505, 0.36757, 0.58465, 0.54302, 0.50387]),
+ num_inference_steps=1,
+ processing_resolution=16,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=False,
+ )
+
+ def test_marigold_depth_dummy_no_num_inference_steps(self):
+ with self.assertRaises(ValueError) as e:
+ self._test_marigold_intrinsics(
+ num_inference_steps=None,
+ expected_slice=np.array([0.0]),
+ )
+ self.assertIn("num_inference_steps", str(e))
+
+ def test_marigold_depth_dummy_no_processing_resolution(self):
+ with self.assertRaises(ValueError) as e:
+ self._test_marigold_intrinsics(
+ processing_resolution=None,
+ expected_slice=np.array([0.0]),
+ )
+ self.assertIn("processing_resolution", str(e))
+
+
+@slow
+@require_torch_gpu
+class MarigoldIntrinsicsPipelineIntegrationTests(unittest.TestCase):
+ def setUp(self):
+ super().setUp()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def tearDown(self):
+ super().tearDown()
+ gc.collect()
+ torch.cuda.empty_cache()
+
+ def _test_marigold_intrinsics(
+ self,
+ is_fp16: bool = True,
+ device: str = "cuda",
+ generator_seed: int = 0,
+ expected_slice: np.ndarray = None,
+ model_id: str = "prs-eth/marigold-iid-appearance-v1-1",
+ image_url: str = "https://marigoldmonodepth.github.io/images/einstein.jpg",
+ atol: float = 1e-4,
+ **pipe_kwargs,
+ ):
+ from_pretrained_kwargs = {}
+ if is_fp16:
+ from_pretrained_kwargs["variant"] = "fp16"
+ from_pretrained_kwargs["torch_dtype"] = torch.float16
+
+ pipe = MarigoldIntrinsicsPipeline.from_pretrained(model_id, **from_pretrained_kwargs)
+ if device == "cuda":
+ pipe.enable_model_cpu_offload()
+ pipe.set_progress_bar_config(disable=None)
+
+ generator = torch.Generator(device=device).manual_seed(generator_seed)
+
+ image = load_image(image_url)
+ width, height = image.size
+
+ prediction = pipe(image, generator=generator, **pipe_kwargs).prediction
+
+ prediction_slice = prediction[0, -3:, -3:, -1].flatten()
+
+ if pipe_kwargs.get("match_input_resolution", True):
+ self.assertEqual(prediction.shape, (2, height, width, 3), "Unexpected output resolution")
+ else:
+ self.assertTrue(prediction.shape[0] == 2 and prediction.shape[3] == 3, "Unexpected output dimensions")
+ self.assertEqual(
+ max(prediction.shape[1:3]),
+ pipe_kwargs.get("processing_resolution", 768),
+ "Unexpected output resolution",
+ )
+
+ msg = f"{prediction_slice}"
+ self.assertTrue(np.allclose(prediction_slice, expected_slice, atol=atol), msg)
+ # self.assertTrue(np.allclose(prediction_slice, expected_slice, atol=atol))
+
+ def test_marigold_intrinsics_einstein_f32_cpu_G0_S1_P32_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=False,
+ device="cpu",
+ generator_seed=0,
+ expected_slice=np.array([0.9162, 0.9162, 0.9162, 0.9162, 0.9162, 0.9162, 0.9162, 0.9162, 0.9162]),
+ num_inference_steps=1,
+ processing_resolution=32,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f32_cuda_G0_S1_P768_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=False,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.62127, 0.61906, 0.61687, 0.61946, 0.61903, 0.61961, 0.61808, 0.62099, 0.62894]),
+ num_inference_steps=1,
+ processing_resolution=768,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G0_S1_P768_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.62109, 0.61914, 0.61719, 0.61963, 0.61914, 0.61963, 0.61816, 0.62109, 0.62891]),
+ num_inference_steps=1,
+ processing_resolution=768,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G2024_S1_P768_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=2024,
+ expected_slice=np.array([0.64111, 0.63916, 0.63623, 0.63965, 0.63916, 0.63965, 0.6377, 0.64062, 0.64941]),
+ num_inference_steps=1,
+ processing_resolution=768,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G0_S2_P768_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.60254, 0.60059, 0.59961, 0.60156, 0.60107, 0.60205, 0.60254, 0.60449, 0.61133]),
+ num_inference_steps=2,
+ processing_resolution=768,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G0_S1_P512_E1_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.64551, 0.64453, 0.64404, 0.64502, 0.64844, 0.65039, 0.64502, 0.65039, 0.65332]),
+ num_inference_steps=1,
+ processing_resolution=512,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G0_S1_P768_E3_B1_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.61572, 0.61377, 0.61182, 0.61426, 0.61377, 0.61426, 0.61279, 0.61572, 0.62354]),
+ num_inference_steps=1,
+ processing_resolution=768,
+ ensemble_size=3,
+ ensembling_kwargs={"reduction": "mean"},
+ batch_size=1,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G0_S1_P768_E4_B2_M1(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.61914, 0.6167, 0.61475, 0.61719, 0.61719, 0.61768, 0.61572, 0.61914, 0.62695]),
+ num_inference_steps=1,
+ processing_resolution=768,
+ ensemble_size=4,
+ ensembling_kwargs={"reduction": "mean"},
+ batch_size=2,
+ match_input_resolution=True,
+ )
+
+ def test_marigold_intrinsics_einstein_f16_cuda_G0_S1_P512_E1_B1_M0(self):
+ self._test_marigold_intrinsics(
+ is_fp16=True,
+ device="cuda",
+ generator_seed=0,
+ expected_slice=np.array([0.65332, 0.64697, 0.64648, 0.64844, 0.64697, 0.64111, 0.64941, 0.64209, 0.65332]),
+ num_inference_steps=1,
+ processing_resolution=512,
+ ensemble_size=1,
+ batch_size=1,
+ match_input_resolution=False,
+ )
diff --git a/tests/pipelines/marigold/test_marigold_normals.py b/tests/pipelines/marigold/test_marigold_normals.py
index c86c600be8e5..bc2662196c38 100644
--- a/tests/pipelines/marigold/test_marigold_normals.py
+++ b/tests/pipelines/marigold/test_marigold_normals.py
@@ -1,5 +1,5 @@
-# Copyright 2024 Marigold authors, PRS ETH Zurich. All rights reserved.
-# Copyright 2024 The HuggingFace Team. All rights reserved.
+# Copyright 2023-2025 Marigold Team, ETH Zürich. All rights reserved.
+# Copyright 2024-2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -14,7 +14,7 @@
# limitations under the License.
# --------------------------------------------------------------------------
# More information and citation instructions are available on the
-# Marigold project website: https://marigoldmonodepth.github.io
+# Marigold project website: https://marigoldcomputervision.github.io
# --------------------------------------------------------------------------
import gc
import random