Text-to-image prompting

tyto4tme4l

Something of an artist
@MareStare
Great idea! Img2img and inpainting are invaluable tools and it would be useful to describe them in detail. Things like denoising strength and the difference between “Whole picture” vs “Only masked” for the inpaint area are extremely important here.
Scarlet Ribbon

@Thoryn
I have the same GPU. I can generate a 1024x1024 image in Comfy UI in less than 15 seconds. I don’t know what was up with automatic1111, but I was getting similarly glacial performance on it.
Strongly recommend you just get rid of it and learn a different front end.
Zerowinger

3-3/4" Army Man Fan
So, I tried out Pony Diffusion on Civitai to some success, and part of the prompt was copy-pasting the score_x score_up prompts that I had seen elsewhere. However, I’m a little confused as to exactly how those prompts work, the whole text to image format is very different to the style I’m familiar with.
Could I get some insider info on just exactly how this format in Pony Diffusion and similar checkpoints works?
MareStare

Mare is very curious👀
@Zerowinger
The score_* tags are specific to Pony Diffusion. Their original idea was that you’d be able to write score_7_up tag only (just a single tag), and you’d get an image based on the dataset of images of quality 7 or higher.
However, the way this was implemented during training was wrong, and completely broken. The developers discovered this bug only in the middle of training, at which point fixing that bug would be too expensive (they’d need to restart training from scratch again, which would cost them potentially several tens or even hundreds of thousands of dollars). So, they kept the bug, and made a guideline to include that lengthy score_9, score_8_up, ... etc. string at the start of the prompt to work around it.
There is more detail on this training fiasco in this article: https://civitai.com/articles/4248/what-is-score9-and-how-to-use-it-in-pony-diffusion
Zerowinger

3-3/4" Army Man Fan
@MareStare
So basically, including that string is necessary for higher quality images then? What about the rest of the prompting? On Imagen, I’m used to using full sentences and phrases to describe exactly what I want the output to be, with Pony Diffusion it seems the go to format is to list each individual aspect as a prompt separated by a comma.
Scarlet Ribbon

@Zerowinger
Different models are trained in different ways, leading to some models being better for natural language, and others better for tag-based prompting. Pony doesn’t completely fail with natural language prompting, but in my experience it performs much better with tag-based. If you add source_pony to your prompt, you can damn near just use Derpi/Tanta tags to get most of the results you’re looking for.
MareStare

Mare is very curious👀
@Zerowinger
You can use full sentences to describe the prompt with Pony Diffusion as well. Citing from their page on civitai the recommended prompt format:
score_9, score_8_up, score_7_up, score_6_up, score_5_up, score_4_up, just describe what you want, tag1, tag2
where tag1, tag2 are simple words/word combinations similar to derpibooru tags like “unicorn, blushing, trio, duo”, etc
MareStare

Mare is very curious👀
The quantity of steps depends on the sampler, for Euler it’s 25+ sampling steps, but sometimes it can be lower. I guess it depends on composition and it’s never constant. I recommend just trying different settings and checking if increasing the steps substantially improves the image
Thoryn

Latter Liaison
I’ve seen some guides mention to use BREAK in prompts to help guide the model. E.g.
Description of scenery
BREAK
Character 1 wearing denim jeans and red sweater sitting on a bench
BREAK
Character 2 wearing black suit with bowtie walking in the background
But I’m not having much success with it, it still gets confused as to who wears/does what.
Any of you using it successfully?
Lord Waite

I always feel like it helps to have a bit of a base understanding on how the models work on these things.
Initially, someone created a large dataset of images and descriptions. The descriptions were tokenized, and the images cut up into squares. Then, random noise was generated based on a seed. It took one square, generated random noise based on a seed, and attempted to denoise the noise into the image on the square. Once it got something close, it discarded the square and grabbed another one. At the end, all of this was saved in a model.
Now, what happens when you are generating an image is that your prompt is reduced to tokens by a text encoder (XL based models use CLIP-L and CLIP-G), random noise is generated by the specified seed, and then the sampler and noise schedule is how it denoises, with as many steps as you specify.
Some schedulers introduce a bit of noise at every steps, namely the ancestral ones (with an a at the end), and sde, but there may be others. With those ones, the image is going to change more between steps and they’ll be more chaotic. Also, some will take less steps then others to get to a good image, and how long each step takes will vary a bit. I believe some are just better at dealing with certain things in the image, too, so it’ll take some playing around.
Now, the clip text encoder actually can’t cope with anything more than 77 tokens at once, and that includes a start and end token, so effectively 75. So if your prompt is more than 75 tokens, it gets broken up into chunks of 75.
The idea behind “BREAK” is that you are telling it to end the current chunk right there and just pad it out with null tokens at the end. The point is just that you’re making sure that particular part of the prompt is all in the same chunk. I’ve had mixed results on it, so I try doing it that way occasionally, but also don’t a lot of the time. It is going to have trouble with getting confused anyways. This is just an attempt to minimize it a bit.
(Text encoding is one of the differences between model architectures, too. 1.* & 2.* had one clip, XL has two, then when you start getting into things like flux and 3, you start dealing with things like two clips and a t5 encoder, and the t5 encoder accepts more like 154 tokens. I also didn’t get into the vae, which is actually what turns the results into an image…)
mp40

@Lord Waite
Have you done anything else with this? I’m looking for resources on how to build or curate a llm of my own but the “uncensored” model is still denying some of my prompt requests, do I just need to try other jailbreak prompts till somthning works or?
Lord Waite

@mp40
I haven’t done more with it, but with ollama, the key was making a custom “Modelfile” file, and creating a model from that Modelfile.
What you can do is copy the modelfile of an existing model and modify it.
So, first:
ollama pull rolandroland/llama3.1-uncensored
to install the modelfile you are going to base it off of. Then, if you run:
ollama show rolandroland/llama3.1-uncensored --modelfile
It’ll print out on the console that models Modelfile, so just copy that to a file named Modelfile.
Then change the FROM section to say:
FROM rolandroland/llama3.1-uncensored:latest
and add a section at the bottom that says:
SYSTEM"""<your prompt here>"""
And just write a prompt for how the AI is going to act there. You basically want to describe to it what its purpose is, and let it know that it’s uncensored and can describe sexual acts and such, tell it not to add in disclaimers, tell it the exact format that a prompt should be in and the type of words it should use, and give it a few examples of real prompts.
(I’d give one here, but looking at it, I really want to clean it up and improve it. I was explicitly telling it to add the line of score tags, then a source and rating tag, then a description, then several paragraphs of danbooru tags.)
Then run:
ollama create <new mode name> --file Modelfile
Keep the modelfile, try using the model you generated, and if you want to tweak it, do:
ollama rm <model>
change the modelfile, and rerun the create command.
That’s basically how to do it, in any case, the key is going to be playing with creating a prompt until something sticks, and basing it off the right model, as I remember trying it with a different model or two and not having as much luck…
Lord Waite

@mp40
No problem. It’s one of these spots where I really need to play more with it, and there might be better ways to do some of it, but that’s what was getting me results.
I remember one oddity was that autocomplete on what I was typing kept giving I can’t talk about this topic type lines, but the actual response was uncensored.
There could easily be better models, too. I just remember trying two or three and this was the one that was giving decent results.
Lord Waite

@mp40
Oh, also, one thing worth mentioning is that I think the longer the system prompt is, the more likely it is for the system prompt to start going out of the context window. I’ve noticed that since the instructions to uncensor it are at the beginning, it tends to start becoming censored again if you put too much in the system prompt.
mp40

Thanks for the info! I was using a much simpler setup – I had Mistral small installed via pinoko and was just trying jailbreak prompts, even though I thought Mistral small was uncensored.
Lord Waite

@mp40
No problem.
Nice thing about ollama is that it’s a service running, so you install that and install whatever models you want though it, then you can use any program that talks to ollama to interact with the models. (And there are ComfyUI nodes that can talk to it.)
Ollama itself is over here, as it lists of all the various models you can install with it (Though bear in mind the size. 1b, 3b, or 8b is fine. Don’t download 70b models…):
https://ollama.com/
You can technically talk to the models directly with ollama, but that’s chatting through a command line, so you really do want another program to use with it as an interface.
When installing it with docker, you can choose to install a version that has ollama as well, but I did them separately.
Open WebUI gives you a nice web interface where you can chat with any of the models you install, and even has a way to set it up to talk to ComfyUI, so you can send text from a chat directly to comfyui to generate an image using it as a prompt. It’s fairly fun to play with.
Syntax quick reference: **bold** *italic* ||hide text|| `code` __underline__ ~~strike~~ ^sup^ ~sub~

Detailed syntax guide