Skip to content

clip-guided stable diffusion correctness #596

@keturn

Description

@keturn

Describe the bug

the clip-guided pipeline uses these text-embeddings:

# perform clip guidance
if clip_guidance_scale > 0:
text_embeddings_for_guidance = (
text_embeddings.chunk(2)[0] if do_classifier_free_guidance else text_embeddings

defined earlier as:

# For classifier free guidance, we need to do two forward passes.
# Here we concatenate the unconditional and text embeddings into a single batch
# to avoid doing two forward passes
text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

which I read as using the unconditioned (i.e. null prompt) embeddings.

Is that the way it's supposed to work? That doesn't feel like how it's supposed to work. Like, if the normal classifier-free guidance function is turned off, it would be using the embeddings from the text prompt, not the nulls.

But this pipeline was added without any tests or samples or other reference material, so I really don't know.

Reproduction

No response

Logs

No response

System Info

👀

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions