-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Move IP Adapter Face ID to core #7186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refreshing change!
|
I'm getting the error when I try to use this. |
you cannot use Face ID with SDXL, the current changes only affect the Stable Diffusion pipeline |
|
@jfischoff you can use it now, I also updated the example code |
|
@fabiorigano is this ready for a review? |
|
@yiyixuxu I have to add some checks on the inputs, but I would appreciate your feedback. thanks :) |
|
Since both Face ID adapter and Face ID XL don't use an image encoder, I tested the multi-adapter feature by separately extracting and then concatenating the image embeddings of Face ID XL and another IP Adapter, Plus Face SDXL. I think that Here it is the code of the test: # Create a SDXL pipeline
# ...
# Load sample images
image1 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ai_face2.png")
image2 = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/women_input.png")
# Extract Face features using insightface
ref_images = []
app = FaceAnalysis(name="buffalo_l", providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
for im in [image1, image2]:
image = cv2.cvtColor(np.asarray(im), cv2.COLOR_BGR2RGB)
faces = app.get(image)
image = torch.from_numpy(faces[0].normed_embedding)
ref_images.append(image.unsqueeze(0))
ref_images = torch.cat(ref_images, dim=0)
# Load Face ID XL adapter into the pipeline
pipeline.load_ip_adapter("h94/IP-Adapter-FaceID",
subfolder=None,
weight_name="ip-adapter-faceid_sdxl.bin",
image_encoder_folder=None
)
# Generate Face ID image embeddings and save them locally
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
ip_adapter_image=ref_images,
ip_adapter_image_embeds=None,
device="cuda",
num_images_per_prompt=1,
do_classifier_free_guidance=True,
)
torch.save(image_embeds, "faceid_xl.ipadpt")
# Unload ip adapter and lora
# ...
# Load Plus SDXL adapter into the pipeline
pipeline.load_ip_adapter("h94/IP-Adapter",
subfolder="sdxl_models",
weight_name="ip-adapter-plus-face_sdxl_vit-h.safetensors")
# Generate Plus SDXL image embeddings and save them locally
ip_images =[[image1, image2]]
image_embeds = pipeline.prepare_ip_adapter_image_embeds(
ip_adapter_image=ip_images,
ip_adapter_image_embeds=None,
device="cuda",
num_images_per_prompt=1,
do_classifier_free_guidance=True,
)
torch.save(image_embeds, "plus_face_xl.ipadpt")
# Unload the IP adapter
# ...
# Load both IP Adapters
pipeline.load_ip_adapter(["h94/IP-Adapter", "h94/IP-Adapter-FaceID"],
subfolder=["sdxl_models", None],
weight_name=["ip-adapter-plus-face_sdxl_vit-h.safetensors", "ip-adapter-faceid_sdxl.bin"]
)
pipeline.set_ip_adapter_scale([0.7]*2)
# Load image embeddings and run inference
generator = torch.Generator(device="cpu").manual_seed(42)
t1 = torch.load("plus_face_xl.ipadpt")
t2 = torch.load("faceid_xl.ipadpt")
t = [t1[0], t2[0]]
images = pipeline(
prompt="A photo of a girl wearing a black dress, holding red roses in hand, upper body, behind is the Eiffel Tower",
ip_adapter_image_embeds=t, guidance_scale=7.5,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=30, num_images_per_prompt=num_images, width=1024, height=1024,
generator=generator
).images |
good news is that we do not want to support also, to make it easier to test, can you upload the |
can you combine face-id with other ip-adaper models? I thought it required its own attention processor |
@yiyixuxu I used PEFT to load the LoRA weights, so we don't need additional attention processors :) |
|
I uploaded some tensors here https://huggingface.co/datasets/fabiorigano/testing-images/tree/main Some of my tests and the results (input image embeddings are computed from "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/ai_face2.png"): Face ID SD 1.5 onlypipeline.load_ip_adapter("h94/IP-Adapter-FaceID", subfolder=None, weight_name="ip-adapter-faceid_sd15.bin", image_encoder_folder=None)
pipeline.set_ip_adapter_scale(0.6)
image_embeds = load_pt("https://huggingface.co/datasets/fabiorigano/testing-images/resolve/main/ai_face2.ipadpt")
images = pipeline(
prompt="A photo of a girl wearing a black dress, holding red roses in hand, upper body, behind is the Eiffel Tower",
ip_adapter_image_embeds=image_embeds,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=20, num_images_per_prompt=1, width=512, height=704,
generator=torch.Generator(device="cpu").manual_seed(0)
).imagesPlus Face SD 1.5 onlypipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter-plus-face_sd15.bin")
pipeline.set_ip_adapter_scale(0.6)
image_embeds = load_pt("https://huggingface.co/datasets/fabiorigano/testing-images/resolve/main/clip_ai_face2.ipadpt")
images = pipeline(
prompt="A photo of a girl wearing a black dress, holding red roses in hand, upper body, behind is the Eiffel Tower",
ip_adapter_image_embeds=image_embeds,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=20, num_images_per_prompt=1, width=512, height=704,
generator=torch.Generator(device="cpu").manual_seed(0)
).imagesPlus Face SD 1.5 + Face ID SD 1.5pipeline.load_ip_adapter(["h94/IP-Adapter", "h94/IP-Adapter-FaceID"], subfolder=["models", None], weight_name=["ip-adapter-plus-face_sd15.safetensors", "ip-adapter-faceid_sd15.bin"])
pipeline.set_ip_adapter_scale([0.5, 0.5])
t1 = load_pt("https://huggingface.co/datasets/fabiorigano/testing-images/resolve/main/clip_ai_face2.ipadpt")
t2 = load_pt("https://huggingface.co/datasets/fabiorigano/testing-images/resolve/main/ai_face2.ipadpt")
image_embeds = [t1[0], t2[0]]
images = pipeline(
prompt="A photo of a girl wearing a black dress, holding red roses in hand, upper body, behind is the Eiffel Tower",
ip_adapter_image_embeds=image_embeds,
negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality",
num_inference_steps=20, num_images_per_prompt=1, width=512, height=704,
generator=torch.Generator(device="cpu").manual_seed(0)
).images |
|
@yiyixuxu it is ready for review |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thnaks!
I left some comments and questions
src/diffusers/loaders/ip_adapter.py
Outdated
| logger.warning( | ||
| "image_encoder is not loaded since `image_encoder_folder=None` passed. You will not be able to use `ip_adapter_image` when calling the pipeline with IP-Adapter." | ||
| "Use `ip_adapter_image_embeds` to pass pre-generated image embedding instead." | ||
| "image_encoder is not loaded since `image_encoder_folder=None` passed. `ip_adapter_image` is allowed only if you are loading an IP-Adapter Face ID model." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a little bit confused here - I thought it was the opposite, i.e. we do not allow using ip_adapter_image with the Face ID model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
conceptually Face ID embeddings are image embeddings, but the tensor as it is doesn't have the unconditioned part, so in encode_image it is updated as is expected.
do you think it is better to leave this to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes - let's make it clear on the doc how to create the ip_adapter_image_embedding for face-id
src/diffusers/loaders/unet.py
Outdated
| ] | ||
| } | ||
| ) | ||
| key_id += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there more than one face-id checkpoints right now? does it make sense for us to support more than one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Face ID and Face ID XL are both supported by this PR
Face ID Plus models have different image projection layers
| """Forward pass. | ||
| Args: | ||
| ---- | ||
| id_embeds (torch.Tensor): Input Tensor (ID embeds). | ||
| Returns: | ||
| ------- | ||
| torch.Tensor: Output Tensor. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to follow our doc-string format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I will update it (also IPAdapterPlusImageProjection for code consistency)?
src/diffusers/models/embeddings.py
Outdated
| nn.LayerNorm(embed_dims), | ||
| nn.LayerNorm(embed_dims), | ||
| Attention( | ||
| query_dim=embed_dims, | ||
| dim_head=dim_head, | ||
| heads=heads, | ||
| out_bias=False, | ||
| ), | ||
| nn.Sequential( | ||
| nn.LayerNorm(embed_dims), | ||
| FeedForward(embed_dims, embed_dims, activation_fn="gelu", mult=ffn_ratio, bias=False), | ||
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong opinions here but perhaps we could create a small block consisting of these layers and use that block here instead. Then
for ln0, ln1, attn, ff in self.layers:
residual = latents
encoder_hidden_states = ln0(x)
latents = ln1(latents)
encoder_hidden_states = torch.cat([encoder_hidden_states, latents], dim=-2)
latents = attn(latents, encoder_hidden_states) + residual
latents = ff(latents) + latents
could become:
for block in self.blocks:
...If the checkpoint needs to be rejigged to match this structure, we could have a load state dict hook to deal with the modifications. But I would wait for @yiyixuxu to comment further before making any changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice but I don't think it is a big deal
if it requires a lot of effort from @fabiorigano I don't think it's worth it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah totally fine by me. It was just a suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi, I added IPAdapterPlusImageProjectionBlock, let me know if it works for you
src/diffusers/loaders/ip_adapter.py
Outdated
| # load ip-adapter into unet | ||
| unet = getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet | ||
| unet._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage) | ||
| extra_loras = unet._load_ip_adapter_weights(state_dicts, low_cpu_mem_usage=low_cpu_mem_usage) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. To reduce the maintenance burden and to promote better readability, perhaps we could separate out the LoRA-related code from _load_ip_adapter_weights()?
sayakpaul
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Left a couple of comments.
Co-authored-by: Sayak Paul <[email protected]>
src/diffusers/loaders/ip_adapter.py
Outdated
| extra_loras = unet._load_ip_adapter_loras(state_dicts) | ||
| if extra_loras != {}: | ||
| # apply the IP Adapter Face ID LoRA weights | ||
| peft_config = getattr(unet, "peft_config", {}) | ||
| for k, lora in extra_loras.items(): | ||
| if f"faceid_{k}" not in peft_config: | ||
| self.load_lora_weights(lora, adapter_name=f"faceid_{k}") | ||
| self.set_adapters([f"faceid_{k}"], adapter_weights=[1.0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sleek!
src/diffusers/loaders/unet.py
Outdated
| heads=heads, | ||
| id_embeddings_dim=id_embeddings_dim, | ||
| ) | ||
| print(state_dict.keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to go away.
src/diffusers/loaders/unet.py
Outdated
| print(updated_state_dict.keys()) | ||
| print(image_projection.state_dict().keys()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs to go away.
| max_diff = numpy_cosine_similarity_distance(image_slice, expected_slice) | ||
| assert max_diff < 5e-4 | ||
|
|
||
| def test_text_to_image_face_id(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a fast test as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR is looking quite nice to me. Thanks a lot for working on it. Also, do we need to add a check like so
diffusers/src/diffusers/loaders/lora.py
Line 108 in cf6e040
| if not USE_PEFT_BACKEND: |
when there's a call to use the IP Adapter Face ID weights?
I will defer to @yiyixuxu to merge this. I would just run the concerned slow tests on our CI infrastructure as well to ensure nothing's breaking. @yiyixuxu could you do that before merging?
I will add it |
|
great work as always! thanks a lot :) @fabiorigano |
* Switch to peft and multi proj layers * Move Face ID loading and inference to core --------- Co-authored-by: Sayak Paul <[email protected]>



What does this PR do?
Fixes #7014 #6935
@yiyixuxu @sayakpaul
Create face embeddings
IP Adapter Face ID (SD 1.5)
IP Adapter Face ID XL (SDXL)