A common feature in T2i generation is to skip the last and final layer (matrix calculation) in the CLIP text-encoding model

This will “distort” the text encoding slightly , which SD users have discovered works to their benefit when prompting with common english words like “banana”, “car” , “anime”, “woman”, “tree” etc

Being able to select between a CLIP Skip 2 text encoder , and the default text-encoder will be an appreciated feature for perchance users.

For exotic tokens like emojis or other tokens with high ID in the vocab.json , the un-modified CLIP configuration (CLIP skip 1) is far superior.

But for “boring normal english word” prompts , CLIP skip 2 will often improve the output.

This code here shows how one can import a SD1.5 CLIP text encoder configured to CLIP skip 2

https://github.com/huggingface/diffusers/issues/3212

//—//

Sidenote: Personally , I’d love to see a split of the:

text prompt -> tokenizer -> embedding -> text-encoding -> image generation

pipeline into their separate modules on perchance

So instead of sending a text to the perchance server the user can send an embedding (many are available to be downloaded online) or a text+embedding mix.

, or a text encoding configured to either CLIP skip 1 or 2

to the perchance server and get an image back.

The CLIP model is unique in that it can create both text- and image-encodings. By checking cosine similarity between the text and image encodings you can generate a text prompt for any given input image, that when prompted , will generate “that kind of image”.

Note that for either of these cases there wont be a text prompt for the image. The pipeline is a “one-way-process”.

//—//

Main thing here to consider is adding a CLIP Skip 2 option, as I think a lot of “standard” text-to-image generators on perchance would benefit from having this option.