Stable Diffusion

4297 readers

1 users here now

Discuss matters related to our favourite AI Art generation technology

Also see

Other communities

founded 1 year ago

MODERATORS

[email protected]

Making a better CLIP interrogator with the FLUX T5 encoder? (lemmy.world)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

4 comments fedilink hide all child comments

This is an open ended question.

I'm not looking for a specific answer , just what people know about this topic.

I've asked this question on Huggingface discord as well.

But hey, asking on lemmy is always good, right? No need to answer here. This is a repost, essentially.

This might serve as an "update" of sorts from the previous post: https://lemmy.world/post/19509682

//---//

Question;

FLUX model uses a combo of CLIP+T5 to create a text_encoding.

CLIP is capable if doing both image_encoding and text_encoding.

T5 model seems to be strictly text-to-text.

So I can't use the T5 to create image_encodings. Right?

https://huggingface.co/docs/transformers/model_doc/t5

But nonetheless, the T5 encoder is used in text-to-image generation.

So surely, there must be good uses for the T5 in creating a better CLIP interrogator?

Ideas/examples on how to do this?

I have 0% knowledge on the T5 , so feel free to just send me a link someplace if you don't want to type out an essay.

//----//

For context;

I'm making my own version of a CLIP interrogator : https://colab.research.google.com/#fileId=https%3A//huggingface.co/codeShare/JupyterNotebooks/blob/main/sd_token_similarity_calculator.ipynb

Key difference is that this one samples the CLIP-vit-large-patch14 tokens directly instead of using pre-written prompts.

I text_encode the tokens individually , store them in a list for later use.

I'm using the method shown in this paper, the "NND-Nearest neighbor decoding" .

Methods for making better CLIP interrogators: https://arxiv.org/pdf/2303.03032

T5 encoder paper : https://arxiv.org/pdf/1910.10683

Example from the notebook where I'm using the NND method on 49K CLIP tokens (Roman girl image) :

Most similiar suffix tokens : "{vfx |cleanup |warcraft |defend |avatar |wall |blu |indigo |dfs |bluetooth |orian |alliance |defence |defenses |defense |guardians |descendants |navis |raid |avengersendgame }"

most similiar prefix tokens : "{imperi-|blue-|bluec-|war-|blau-|veer-|blu-|vau-|bloo-|taun-|kavan-|kair-|storm-|anarch-|purple-|honor-|spartan-|swar-|raun-|andor-}"

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 1 month ago (1 children)

T5 is a very mediocre LM. why do people keep using it?

[–] [email protected] 1 points 1 month ago* (last edited 1 month ago) (1 children)

Hmm. I mean the FLUX model looks good

, so there must maybe be some magic with the T5 ?

I have no clue, so any insights are welcome.

T5 Huggingface: https://huggingface.co/docs/transformers/model_doc/t5

T5 paper : https://arxiv.org/pdf/1910.10683

Any suggestions on what LLM i ought to use instead of T5?

[–] [email protected] 1 points 1 month ago (1 children)

aya based llms are extremely powerful. same with qwen

[–] [email protected] 1 points 1 month ago

That's good to know. I'll try them out. Thanks.