in SD3

same prompt in SDXL niji

same prompt in SD1.5

  • @j4k3
    link
    English
    2
    edit-2
    3 months ago

    You can’t use the same prompt. SD3 requires a different toolchain that is more complicated than any other accessible diffusion model. There are 3 separate prompts. They each need their own tailored inputs. SD1.x only uses CLIP-L. SDXL under the hood uses both CLIP-L and CLIP-G. In practice, there are only minor benefits from separating these two embeddings models.

    SD3 adds a T5xxl LLM in addition to the embedding models. They are using a weird way of tuning. When the T5 model is loaded, they swap an entire model layer using pytorch. They are not using a LoRA or a custom trained model. It is the same T5xxl from Google but with that layer swap.

    ComfyUI also has a SD3 model node that is required and works kinda like setting CLIP -2 in SDXL models.

    Lastly, SD3 is hyper sensitive to the negative prompt. You must use conditioning nodes to invert the negative prompt conditioning and merge this back into the noninverted conditioning after 0.100-0.200 seconds from the beginning of generation.

    I’m not sure how the entire thing works, but there are 16 layers of generation with SD3. Much (all) of the strict alignment is in the model loader code. I have not figured it out in useful detail and am not super motivated to do so. However, there are no mechanisms in the diffusion model structure that can create the periodicity present in SD3. For instance, if you twist SD3 to the point where it shows male or female breasts and other genitalia, these are actually not present at all. If perhaps one were to explore if this omission is from overtraining, - by building strong momentum against the behavior, one is likely to find that the behavior never has a probably factor at all. One way to test such a thing is to start describing nipple tattoos. Then use the way that history of pervious prompts are stored to avoid rerunning some parts of generation to move the tattoos into the place of where real nipples should be. If the omission is due to training, this forced redundancy of a natural feature will overcome the bias. Logically it should be easier to create male nipples than female due to the likely larger training bias and dataset size. When testing this, I found the behavior was deterministic and never susceptible to statistical influence. That behavior is only possible in the model loader code. The code for running SD3 in ComfyUI is considerable. If this code were reverse engineered, it should be possible to do some really remarkable things, like swapping out the LLM for a smarter one or training its behavior in other directions. It might even become possible to make it conversational.

    Edit: Failed to fully describe the periodicity mentioned. The tattoos were added at a certain point in the generation that was always the same. It errors out to nonsense if these tattoos start where nipples should be. However, if the prompt starts by describing them in another area, is descriptively long for the T5 and later specifies a revised location, it becomes possible to watch them move with subsequent preview steps. These features are not coming out of the noise, they are deterministically placed later, and hint that genitalia are removed deterministically at an earlier point.

    This is just the easiest subject to test alignment in practice, there are all kinds of other areas where this behavior is relevant but harder to determine or understand, like in my long quest to make O’Neill cylinder interiors with centrifugal gravity that is ongoing on the back burner.