Skip to content

Conversation

@sayakpaul
Copy link
Member

@sayakpaul sayakpaul commented Nov 26, 2025

What does this PR do?

Test code:

from diffusers import Flux2Pipeline 
import torch 

pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "a photo of a forest with mist swirling around the tree trunks. The word 'FLUX.2' is painted over it in big, red brush strokes with visible texture"
image = pipe(
    prompt=prompt,
    height=768,
    width=1360,
    generator=torch.Generator(device="cuda").manual_seed(42),
    caption_upsample_temperature=0.15,
).images[0]
image.save("upsampling.png")
print(f"{torch.cuda.max_memory_reserved() / (1024 ** 3)=}")
Generated upsampled prompt:
A serene and atmospheric forest scene, captured in a high-resolution photograph, showcases towering, ancient trees with thick, gnarled trunks and sprawling branches that create a dense canopy overhead. The forest floor is carpeted with a lush layer of moss, ferns, and fallen leaves, adding a sense of depth and texture to the image. Mist swirls gently around the tree trunks, creating a dreamy, ethereal atmosphere. The mist is illuminated by soft, diffused light that filters through the canopy, casting dappled shadows and highlighting the intricate details of the bark and foliage. The overall color palette is muted and natural, with shades of green, brown, and gray dominating the scene. In the foreground, the word "FLUX.2" is painted in bold, red brush strokes with visible texture, standing out against the natural backdrop. The paint appears wet and glossy, with visible brushstrokes and subtle drips, adding a dynamic and artistic element to the image. The text is centrally placed and slightly tilted, drawing the viewer\'s eye and adding a sense of movement to the otherwise tranquil scene.

Output

No prompt upsampling Prompt upsampling
Image 1 Image 2

Notes

  • I decided to create a system_messages.py script under src/diffusers/pipelines/flux2 so that other pipelines derived from Flux2 can easily use it.
  • If caption_upsample_temperature is set (defaults to None), we perform the process.
  • The image processor changes are to accommodate this method in the original codebase. Open to other suggestions, of course, on how to best accommodate them.
  • The original codebase implements an OpenRouter client with a larger Pixtral model for doing prompt upsampling remotely. I think we can do that through Inference Endpoints? (cc: @apolinario @ariG23498)

@sayakpaul sayakpaul requested review from dg845 and yiyixuxu November 26, 2025 06:59
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -0,0 +1,29 @@
"""
These system prompts come from:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed internally, this new-line character thingy messes up the quality a bit. Hence, I have decided to keep these system messages one-to-one same as the original implementation linked above.

If we run make style && make quality, this order will be completely destroyed. We can change the pyproject.toml to exclude this path from getting formatted. But before we do that, let's see if this is the best we have.

Copy link
Collaborator

@yiyixuxu yiyixuxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot on working on this!
I left some feedbacks!


@staticmethod
def _resize_to_target_area(image: PIL.Image.Image, target_area: int = 1024 * 1024) -> Tuple[int, int]:
def _resize_to_target_area(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh do you want to add a new method called something like _resize_if_exceeds_area? or rename this one if we only use it this way

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup.

I created _resize_if_exceeds_area() which is basically:

def _resize_if_exceeds_area(image, target_area=1024 * 1024) -> PIL.Image.Image:
    image_width, image_height = image.size
    pixel_count = image_width * image_height
    if pixel_count <= target_area:
        return image
    return Flux2ImageProcessor._resize_to_target_area(image, target_area)


# Adapted from
# https://github.com/black-forest-labs/flux2/blob/5a5d316b1b42f6b59a8c9194b77c8256be848432/src/flux2/text_encoder.py#L49C5-L66C19
def _validate_and_process_images(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a seperate step to validate and process image and then run format_input?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We now first _validate_and_process_images() and then pass the resultant images to format_input().

@ariG23498
Copy link
Contributor

ariG23498 commented Nov 27, 2025

@sayakpaul this is really nice. Do you want me to start working on an Endpoint? We can take this conversation to private slack and see how this works.

Update:

After inspecting, it turns out that the model needs upwards of 300 GBs to run. (from the official model card)

note that running this model on GPU requires over 300 GB of GPU RAM

Building a free Inference Endpoint does not seem to be feasible by me and @sayakpaul and hence we are benching this project. Another option would be to route through Inference Providers, but we have not seen a need (other than this specific one) to let our providers host this model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants