Skip to content

Commit b3d10d6

Browse files
toshasyiyixuxuyiyixuxusayakpaul
authored
[Pipeline] Marigold depth and normals estimation (#7847)
* implement marigold depth and normals pipelines in diffusers core * remove bibtex * remove deprecations * remove save_memory argument * remove validate_vae * remove config output * remove batch_size autodetection * remove presets logic move default denoising_steps and processing_resolution into the model config make default ensemble_size 1 * remove no_grad * add fp16 to the example usage * implement is_matplotlib_available use is_matplotlib_available, is_scipy_available for conditional imports in the marigold depth pipeline * move colormap, visualize_depth, and visualize_normals into export_utils.py * make the denoising loop more lucid fix the outputs to always be 4d tensors or lists of pil images support a 4d input_image case attempt to support model_cpu_offload_seq move check_inputs into a separate function change default batch_size to 1, remove any logic to make it bigger implicitly * style * rename denoising_steps into num_inference_steps * rename input_image into image * rename input_latent into latents * remove decode_image change decode_prediction to use the AutoencoderKL.decode method * move clean_latent outside of progress_bar * refactor marigold-reusable image processing bits into MarigoldImageProcessor class * clean up the usage example docstring * make ensemble functions members of the pipelines * add early checks in check_inputs rename E into ensemble_size in depth ensembling * fix vae_scale_factor computation * better compatibility with torch.compile better variable naming * move export_depth_to_png to export_utils * remove encode_prediction * improve visualize_depth and visualize_normals to accept multi-dimensional data and lists remove visualization functions from the pipelines move exporting depth as 16-bit PNGs functionality from the depth pipeline update example docstrings * do not shortcut vae.config variables * change all asserts to raise ValueError * rename output_prediction_type to output_type * better variable names clean up variable deletion code * better variable names * pass desc and leave kwargs into the diffusers progress_bar implement nested progress bar for images and steps loops * implement scale_invariant and shift_invariant flags in the ensemble_depth function add scale_invariant and shift_invariant flags readout from the model config further refactor ensemble_depth support ensembling without alignment add ensemble_depth docstring * fix generator device placement checks * move encode_empty_text body into the pipeline call * minor empty text encoding simplifications * adjust pipelines' class docstrings to explain the added construction arguments * improve the scipy failure condition add comments improve docstrings change the default use_full_z_range to True * make input image values range check configurable in the preprocessor refactor load_image_canonical in preprocessor to reject unknown types and return the image in the expected 4D format of tensor and on right device support a list of everything as inputs to the pipeline, change type to PipelineImageInput implement a check that all input list elements have the same dimensions improve docstrings of pipeline outputs remove check_input pipeline argument * remove forgotten print * add prediction_type model config * add uncertainty visualization into export utils fix NaN values in normals uncertainties * change default of output_uncertainty to False better handle the case of an attempt to export or visualize none * fix `output_uncertainty=False` * remove kwargs fix check_inputs according to the new inputs of the pipeline * rename prepare_latent into prepare_latents as in other pipelines annotate prepare_latents in normals pipeline with "Copied from" annotate encode_image in normals pipeline with "Copied from" * move nested-capable `progress_bar` method into the pipelines revert the original `progress_bar` method in pipeline_utils * minor message improvement * fix cpu offloading * move colormap, visualize_depth, export_depth_to_16bit_png, visualize_normals, visualize_uncertainty to marigold_image_processing.py update example docstrings * fix missing comma * change torch.FloatTensor to torch.Tensor * fix importing of MarigoldImageProcessor * fix vae offloading fix batched image encoding remove separate encode_image function and use vae.encode instead * implement marigold's intial tests relax generator checks in line with other pipelines implement return_dict __call__ argument in line with other pipelines * fix num_images computation * remove MarigoldImageProcessor and outputs from import structure update tests * update docstrings * update init * update * style * fix * fix * up * up * up * add simple test * up * update expected np input/output to be channel last * move expand_tensor_or_array into the MarigoldImageProcessor * rewrite tests to follow conventions - hardcoded slices instead of image artifacts write more smoke tests * add basic docs. * add anton's contribution statement * remove todos. * fix assertion values for marigold depth slow tests * fix assertion values for depth normals. * remove print * support AutoencoderTiny in the pipelines * update documentation page add Available Pipelines section add Available Checkpoints section add warning about num_inference_steps * fix missing import in docstring fix wrong value in visualize_depth docstring * [doc] add marigold to pipelines overview * [doc] add section "usage examples" * fix an issue with latents check in the pipelines * add "Frame-by-frame Video Processing with Consistency" section * grammarly * replace tables with images with css-styled images (blindly) * style * print * fix the assertions. * take from the github runner. * take the slices from action artifacts * style. * update with the slices from the runner. * remove unnecessary code blocks. * Revert "[doc] add marigold to pipelines overview" This reverts commit a505165150afd8dab23c474d1a054ea505a56a5f. * remove invitation for new modalities * split out marigold usage examples * doc cleanup --------- Co-authored-by: yiyixuxu <[email protected]> Co-authored-by: yiyixuxu <yixu310@gmail,com> Co-authored-by: sayakpaul <[email protected]>
1 parent b82f9f5 commit b3d10d6

File tree

15 files changed

+3569
-1
lines changed

15 files changed

+3569
-1
lines changed

docs/source/en/_toctree.yml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,8 @@
9393
title: Trajectory Consistency Distillation-LoRA
9494
- local: using-diffusers/svd
9595
title: Stable Video Diffusion
96+
- local: using-diffusers/marigold_usage
97+
title: Marigold Computer Vision
9698
title: Specific pipeline examples
9799
- sections:
98100
- local: training/overview
@@ -295,6 +297,8 @@
295297
title: Latent Diffusion
296298
- local: api/pipelines/ledits_pp
297299
title: LEDITS++
300+
- local: api/pipelines/marigold
301+
title: Marigold
298302
- local: api/pipelines/panorama
299303
title: MultiDiffusion
300304
- local: api/pipelines/musicldm
@@ -445,4 +449,4 @@
445449
title: Video Processor
446450
title: Internal classes
447451
isExpanded: false
448-
title: API
452+
title: API
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
<!--Copyright 2024 Marigold authors and The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Marigold Pipelines for Computer Vision Tasks
14+
15+
![marigold](https://marigoldmonodepth.github.io/images/teaser_collage_compressed.jpg)
16+
17+
Marigold was proposed in [Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://huggingface.co/papers/2312.02145), a CVPR 2024 Oral paper by [Bingxin Ke](http://www.kebingxin.com/), [Anton Obukhov](https://www.obukhov.ai/), [Shengyu Huang](https://shengyuh.github.io/), [Nando Metzger](https://nandometzger.github.io/), [Rodrigo Caye Daudt](https://rcdaudt.github.io/), and [Konrad Schindler](https://scholar.google.com/citations?user=FZuNgqIAAAAJ&hl=en).
18+
The idea is to repurpose the rich generative prior of Text-to-Image Latent Diffusion Models (LDMs) for traditional computer vision tasks.
19+
Initially, this idea was explored to fine-tune Stable Diffusion for Monocular Depth Estimation, as shown in the teaser above.
20+
Later,
21+
- [Tianfu Wang](https://tianfwang.github.io/) trained the first Latent Consistency Model (LCM) of Marigold, which unlocked fast single-step inference;
22+
- [Kevin Qu](https://www.linkedin.com/in/kevin-qu-b3417621b/?locale=en_US) extended the approach to Surface Normals Estimation;
23+
- [Anton Obukhov](https://www.obukhov.ai/) contributed the pipelines and documentation into diffusers (enabled and supported by [YiYi Xu](https://yiyixuxu.github.io/) and [Sayak Paul](https://sayak.dev/)).
24+
25+
The abstract from the paper is:
26+
27+
*Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.*
28+
29+
## Available Pipelines
30+
31+
Each pipeline supports one Computer Vision task, which takes an input RGB image as input and produces a *prediction* of the modality of interest, such as a depth map of the input image.
32+
Currently, the following tasks are implemented:
33+
34+
| Pipeline | Predicted Modalities | Demos |
35+
|---------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------:|
36+
| [MarigoldDepthPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_depth.py) | [Depth](https://en.wikipedia.org/wiki/Depth_map), [Disparity](https://en.wikipedia.org/wiki/Binocular_disparity) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-lcm), [Slow Original Demo (DDIM)](https://huggingface.co/spaces/prs-eth/marigold) |
37+
| [MarigoldNormalsPipeline](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/marigold/pipeline_marigold_normals.py) | [Surface normals](https://en.wikipedia.org/wiki/Normal_mapping) | [Fast Demo (LCM)](https://huggingface.co/spaces/prs-eth/marigold-normals-lcm) |
38+
39+
40+
## Available Checkpoints
41+
42+
The original checkpoints can be found under the [PRS-ETH](https://huggingface.co/prs-eth/) Hugging Face organization.
43+
44+
<Tip>
45+
46+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. Also, to know more about reducing the memory usage of this pipeline, refer to the ["Reduce memory usage"] section [here](../../using-diffusers/svd#reduce-memory-usage).
47+
48+
</Tip>
49+
50+
<Tip warning={true}>
51+
52+
Marigold pipelines were designed and tested only with `DDIMScheduler` and `LCMScheduler`.
53+
Depending on the scheduler, the number of inference steps required to get reliable predictions varies, and there is no universal value that works best across schedulers.
54+
Because of that, the default value of `num_inference_steps` in the `__call__` method of the pipeline is set to `None` (see the API reference).
55+
Unless set explicitly, its value will be taken from the checkpoint configuration `model_index.json`.
56+
This is done to ensure high-quality predictions when calling the pipeline with just the `image` argument.
57+
58+
</Tip>
59+
60+
See also Marigold [usage examples](marigold_usage).
61+
62+
## MarigoldDepthPipeline
63+
[[autodoc]] MarigoldDepthPipeline
64+
- all
65+
- __call__
66+
67+
## MarigoldNormalsPipeline
68+
[[autodoc]] MarigoldNormalsPipeline
69+
- all
70+
- __call__
71+
72+
## MarigoldDepthOutput
73+
[[autodoc]] pipelines.marigold.pipeline_marigold_depth.MarigoldDepthOutput
74+
75+
## MarigoldNormalsOutput
76+
[[autodoc]] pipelines.marigold.pipeline_marigold_normals.MarigoldNormalsOutput

0 commit comments

Comments
 (0)