Skip to content

Conversation

@JPPhoto
Copy link
Contributor

@JPPhoto JPPhoto commented Mar 8, 2023

When the user picks a strength of 1 for img2img or inpainting, the current code slightly mixes in the initial image with random noise. Strength 1 should be a complete replacement and should yield the same results as txt2img in the img2img case.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 8, 2023

(Note that txt2img technically mixes in all-zeros for its initial latent, which is what we now do for img2img and inpainting when strength is 1.)

@JPPhoto JPPhoto marked this pull request as ready for review March 8, 2023 17:16
@JPPhoto JPPhoto changed the title Make strength 1 behave the same as txt2img Make img2img strength 1 behave the same as txt2img Mar 8, 2023
Copy link
Contributor

@damian0815 damian0815 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

@damian0815 damian0815 merged commit 2db180d into main Mar 8, 2023
@damian0815 damian0815 deleted the JPPhoto-fix-strength-1 branch March 8, 2023 21:50
@keturn
Copy link
Contributor

keturn commented Mar 8, 2023

This is where I have to eat my words about "Inpaint Replace," as you've disproven my hypothesis that adding noise at full strength is effectively equivalent to overwriting the input entirely.

This is because no matter how much noise we're adding, it's still a Gaussian distribution we are adding to what was there before, so if the input image is significantly far away from the default case (latent (0, 0, 0, 0)), you'll be able to tell the noise is centered around that instead of zero.

The proposal you have here is to abruptly change the behavior of the function as soon as the strength hits 1.0. I'm not a fan of the way that introduces a big discontinuity between 0.999 and 1.0.

[Oops, this has been merged before I finished my thoughts. Oh well. Now it's my turn to tell JP "omg something just changed the mechanics of main!" 😉]

I'm assuming the use case here is "I want to inpaint the masked area and use only the prompt without any consideration of what was underneath at all."

I think the most straightforward way to do that in terms of the tools the canvas already has would be to use the eraser, not the mask layer, setting infill=solid.

In fact, if that works, I think that's an argument against this (now already-merged) proposal, as it makes the two tools equivalent but only under a specific condition, yet not under things that are nearly-but-not-quite that condition.

If we need more finesse over this control, I can imagine experimenting with alternatives like nudging the noise distribution so that when do input + noise, the noise is more likely to nudge the sum toward zero (e.g. if input is high, noise is likely negative, instead of being balanced +/-). But having thought about it a bit, I'm not sure that would have any advantage over an eraser that can do partial transparency.

(There would be some difference, because the eraser tool operates in pixel-space and the img2img noising is in latent space, but I don't know if that would be an important difference.)

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 9, 2023

@keturn I appreciate your comments, so first thanks for taking the time to write all of that.

I operated under the assumption (based on what we all discussed) that in UI-land 1.0 would replace inpaint replace, i.e. a complete replacement of the erased region. That's why 0.99 was removed as the upper limit and we simplified the interface to no longer have inpaint replace as an option. I think this is a great UI adjustment, but the problem (as you identified) is that we didn't actually implement inpaint replace at 1.0.

That's what I set out to do. With the new code, img2img with strength 1.0 completely replaces the underlying image with latent noise based on the seed you choose. This now works for img2img, inpainting, masking, you name it. It is a true inpaint replace.

Extracting this, or reverting it, and moving it to a "noise" infill method would probably leave people pretty confused. "Why would I want to replace the selected region with noise?" A bigger issue at the moment is that all of the infill methods operate in image space, not latent space. There is no real way to get an infill method to do a true replacement today.

So, if (as you're suggesting) 1.0 is not a complete replacement of the image (img2img) or mask/erased region (inpainting), what does 1.0 actually mean? (And while we're here, I would also argue that we're working with discrete steps and it's a bad UX to offer up strength that isn't directly tied to steps. Why can I set it to 0.98 when I have 30 steps? Do I expect that to look different than 0.97? This is precisely why I didn't want a floating point value as the symmetry point in the UI even if the CLI supports it.)

As for the partial transparency eraser, at least as I understand what you wrote, would that not be equivalent to masking with a reduced strength? If not, would you blend the actual pixels in that region with the selected infill method at the desired transparency level and then use strength to set the schedule and noise? This seems very complicated and I'm not sure how beneficial it would be.

@psychedelicious
Copy link
Contributor

I agree with the intention of this PR.

But, can we please not have magic values? If I (as a UI) want to do a true inpaint replace, let me be explicit about it. Having different code paths for <1 and 1 is really not fun to reason about and design a functional UI interface for. The change in this PR is a disservice to "UI-land".

Let's have a flag to enable this code path, which uses true noise. Don't make the end user have to remember that 1 isn't actually 1. Provide an explicit API and let the UI expose it in a human-friendly way to the end user.

@psychedelicious
Copy link
Contributor

Also, this PR includes a change that causes the bug in #2931

ElrikUnderlake pushed a commit to ElrikUnderlake/InvokeAI that referenced this pull request Mar 12, 2023
* Fix img2img and inpainting code so a strength of 1 behaves the same as txt2img.

* Make generated images identical to their txt2img counterparts when strength is 1.
ElrikUnderlake pushed a commit to ElrikUnderlake/InvokeAI that referenced this pull request Mar 13, 2023
* Fix img2img and inpainting code so a strength of 1 behaves the same as txt2img.

* Make generated images identical to their txt2img counterparts when strength is 1.
@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 13, 2023

But, can we please not have magic values? If I (as a UI) want to do a true inpaint replace, let me be explicit about it. Having different code paths for <1 and 1 is really not fun to reason about and design a functional UI interface for. The change in this PR is a disservice to "UI-land".

So what does "1" mean? If you want a separate UI element, then we really shouldn't allow "1" at all as it holds no meaning, and setting it to "1" should throw an exception as it's not valid.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 13, 2023

Also, this PR includes a change that causes the bug in #2931

Once I get some clarification about what "1" means, I can look into that. I suspect we were relying on special behavior from img2img (talk about magic) and "fixing" this exposed it. A layer above the img2img/inpainting code should explicitly randomize the seed.

@psychedelicious
Copy link
Contributor

So what does "1" mean? If you want a separate UI element, then we really shouldn't allow "1" at all as it holds no meaning, and setting it to "1" should throw an exception as it's not valid.

It doesn't matter what it means (and I won't pretend to really understand what it means on a deep technical level).

There are two contexts in which I am disagreeing with this PR.

The first context is general and perhaps philosophical. The change in this PR means a change in a particular dimension (the strength parameter) has a direct effect in some other very different dimension (the logic and code path used). That's a "magic value". It is inherently counterintuitive.

This PR is going deep enough that no consumer of the InvokeAI API can reasonably bypass the change. That is exactly the level at which we should not be using magic values. Let the user-facing parts of the code handle that - that is what UI is for. Provide an explicit API so the UI can wrap that up in a bow for the user.

The second context is that of the specific change at hand.

Based on the discussion in this thread, it sounds like 1 is a valid value. The math works and images are generated, and an image generated with 0.999... is different from an image with 1. This suggests that the "meaning" of 1 is something that is not easily expressible without math. That's fine. This is a complicated mathematical system. But 1 is not invalid.

If 1 is truly invalid and the math is breaking down in such a way that it just happens to produce something, ok, no worries. We can just prevent the user from setting it to 1. I had set it like this originally (max was 0.99 IIRC) and I guess it was changed at some point.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 13, 2023

I think of this as the same as linear interpolation, where a value of 0 means starting with "all initial image" and a value of 1 means "all noise". Perhaps that interpretation is wrong.

I understand that you are philosophically upset about the discontinuity at 1, and you feel that mathematically it's dishonest to have a value of 1 that no longer includes the underlying latents derived from the image. Regardless of the math, my interpretation of the user's intent is that they want to replace the image/inpainting region completely with noise and not preserve any characteristics of the underlying image.

We removed inpaint replace and at that time set the maximum from 0.99... to 1. Whatever the reason, when we did that, we left no alternative for pure noise generation to replace an area. So that has to go in, IMO, if the particular change to generate pure noise here is undone.

I also think we should think about whether 0.9999 makes sense mathematically when we're using 50, 100, 200 discrete steps. Does the scheduler take that into account? Should we quantize strength based on the number of steps? I don't have a good answer and I didn't dig deeply enough.

@keturn
Copy link
Contributor

keturn commented Mar 13, 2023

The strength is the portion of the diffusion process that runs on the image.

time=0 is when the image is pristine, time=1000 is when it's as noisy as it gets. A linear schedule (which is what we usually use) for inference steps=8 looks like this:

  • step 1=time 875
  • step 2=time 750
  • step 3=time 625
  • step 4=time 500
  • step 5=time 375
  • step 6=time 250
  • step 7=time 125
  • step 8=time 0

There are two ways of talking about this: one is the length of time the image has been diffusing for, the other puts a number on how noisy it is and calls it 𝜎. Those two things are directly correlated and you may see a scheduler talking about one or the other, depending on how the author formulated the problem.

The "strength" is applied in a straightforward manner by taking that fraction of the timesteps: strength=1/4 takes the last quarter of those, steps 7 and 8. strength=7/8 would be steps 2–8.

strength=8/8 (aka 1.0) is all eight steps.

The steps at higher time values can be very important to determining the broad features of the composition, so making sure you're including that first step is important if you're using the input to set the average tone of the image but need the composition to come from the prompt.

I agree that this is a pretty crummy interface, as the caller can't be certain if different strength values will produce different results unless they are also attending to the step count and reverse-engineering the math. It also severely limits the precision based on step count.

I'd prefer setting time_max and the number of steps to actually run, but that'll be a major departure from both the current UI and any code built on top of the existing Img2Img pipelines.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 13, 2023

The "strength" is applied in a straightforward manner by taking that fraction of the timesteps: strength=1/4 takes the last quarter of those, steps 7 and 8. strength=7/8 would be steps 2–8.

To clarify, we allow a floating point value of 0.9 or 0.99, but that ultimately gets quantized/mapped onto a specific number of steps. 0.9999 is the same as 0.9 is the same as 0.875 in your example above with 8 steps, which means that 7/8 of the steps will be noised. 0.874 gets 6/8. Quick testing shows that to be the case, with no difference in output images for those first 3 values and a big jump with 0.874.

This means that we're already presenting it to the user as something it isn't, and I don't like that, either. A good compromise that doesn't affect the UX at all would be to have some kind of interface hint that does the quick calculation to tell the user that their run will be for floor(strength * total_steps) steps.

strength=8/8 (aka 1.0) is all eight steps.

So, does that mean that we should use pure noise or let it be influenced in some way by the init_image?

@psychedelicious
Copy link
Contributor

psychedelicious commented Mar 13, 2023

Thanks for the explanation y'all.

If I understand correctly, here is what I would like to see:

  • The API accepts a specific step count and transforms that according to the schedule. By API, I mean the thing this PR changes. Maybe it also accepts the strength value but I don't think it should.
  • The UIs make this user friendly by accepting the totally confusing and misleading but socially normalized "strength" value.
  • The UIs also provide a way to set the steps explicitly. In the web UI, there would be a toggle that changes the strength slider to a steps slider.

(The internals of the UI need to know how many true steps are used (because some samplers do not use the number of steps selected directly) in order to dynamically change the scale of the img2img steps.)

  • Both API and UI provide a flag for "pure noise" or "inpaint replace" where appropriate - I guess this only makes sense in the context of inpainting, because pure noise on img2img would be the same as txt2img, right?

(I would also love to expose more of the scheduling to the user but that's a different topic)

Let's not think of "strength" as the "traditional" or "established" way to interface with step selection. This is a brand new, rapidly evolving technology, and the current API or parameter set was decided by a bunch of mathematicians who made some debatable decisions on how to allow people to use the system. We can improve this and help to set a better precedent.

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 13, 2023

The UIs also provide a way to set the steps explicitly. In the web UI, there would be a toggle that changes the strength slider to a steps slider.

Sure, that could work to replace what's there now - like what the UI currently does for symmetry:

image

I don't know if the UI supports, it, but it might be a good transitional interface to have both steps and strength labeled.

Both API and UI provide a flag for "pure noise" or "inpaint replace" where appropriate - I guess this only makes sense in the context of inpainting, because pure noise on img2img would be the same as txt2img, right?

Correct. If 8 of 8 steps is pure noise (and thus a strength of 1.0 is pure noise), then that should yield the same image as txt2img. That's what set me off down this path in the first place.

If it's not pure noise, then a value of 1.0 or total_steps represents something else that we don't have the language for. And should that something else be available in any way?

@JPPhoto
Copy link
Contributor Author

JPPhoto commented Mar 13, 2023

@keturn I'm not sure that the underlying math syncs up with your explanation. Maybe it does. Take a look at this line:

latents = self.scheduler.add_noise(latents, noise, batched_t)

In the case of ddim, batched_t when called with strength 1.0 is [876]. That leads to a multiplier of [[[[0.1344]]]] for the initial latents and [[[[0.9907]]]] for the noise.

If we set a strength of 0.01 (the lowest I can set in the UI), we get [1], [[[[0.9990]]]], and [[[[0.0383]]]], respectively.

Any guidance?

@psychedelicious
Copy link
Contributor

The UI can support anything; we have full control. Let's not let the UI direct the backend.

Very little of SD translates easily to non technical language. Clearly, "strength" is not a good translation, and sounds to me like an ultimately unsuccessful attempt to abstract some complicated math to something a normal human can understand.

But again this is a job for a UI, not a technical API.

Admittedly there are a few levels below the PR but this is the lowest level api we offer so imo it should not abstract stuff like those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants