#🔧|finetune

1 messages · Page 21 of 1

tropic iron
#

And then when I use my lora, instead of using the old checkpoint and the lora I instead use the sdxl base and the lora

#

Because the lora should in theory take the place of the checkpoint?

dusky urchin
#

do a new fine tuning on SDXL.

#

do whatever you have RAM and data for. you can start by trying a LoRA fine tuning on SDXL with the same data you used for your other fine tunings.

tropic iron
#

Rockin

tropic iron
#

What happens if I use a LoRA trained with SDXL alongside a checkpoint based in 1.5?

dusky urchin
#

you can always turn 1.5 latents into pixels, then into XL latents

#

and run img2img

#

there isn't really a point in that though. simpler workflows, in my experience, have always been better, whereas improving prompts yields better results

stiff dust
#

yes that works. But I would do that only if you don't have proper training data for sdxl.

tropic iron
#

Its that the only checkpoint (for furry art) I've found which produces anything other than crap has 1.5 as a base model

tropic iron
#

Hey finetuning people

#

I'm back with 1000 questions

#

I'm trying to understand the machinery. That's how I've always done best. So okay...

#

Y'all have made pretty clear that if I'm using training upon a checkpoint based on SD 1.5, which has native resolution 512x512, then all my images should have that resolution. I can see in my mind's eye (I think) why - the base library has x output neurons, where x is 512x512

#

Assuming that's correct - what happens when I ask a 1.5 based model to give me something with a different resolution? Does it distort? How does "native resolution" become "arbitrary resolution"?

steady seal
#

Hi everyone. I am looking for someone who can train the lora model to convert seifle to AI grillz image. Please DM if someone is available

dusky urchin
neat fox
#

Ask for much beyond the standard resolutions and aspect ratios and you get mutated limbs and duplicated objects

#

Generally 768x512 is fine, 768x768 is usually just over the line

lime ivy
#

Hi everyone. I finetuned SD1.5 on a dataset with mostly half-body image. The clothes trained well, but the face is kinda distorted. Can I improve it by further training on a face-only or close-up dataset (like a regularization)? Or is there any other way to improve the face quality?

worn imp
#

Hi, I'm trynna train a model on art from Yabujin. In total I wanna do training on drain gang too, stuff like my profile picture basically, but I wanted to take things slowly first and see how it goes. But I'm already having problems obviously, so if anybody could help me out to get some good results, any help would be appreciated. :)

#

this is my code

#

and this the current training data

#

I used to have a way larger one, like 200 images basically, but I didn't caption them all so well. I thought the ai would figure out the style itself, without much captioning from me, due to the sheer size Lol

simple wave
#

Hi all!

I am learning to finetune a model on dreambooth. I just need a small dataset for getting through basics. If anyone has any resource where I can find small image datasets of the same object, pls share them

Ex: 25 images of the same animal/thing/place

jade hornet
#

why would you be doing finetuing without knowing what you are finetuning? figure out what you want to do, then solve the problem

jade hornet
#

no

worn imp
#

but can you help me

#

opa ╱|、
(˚ˎ 。7
|、˜〵
じしˍ,)ノ

stiff dust
# tropic iron Y'all have made pretty clear that if I'm using training upon a checkpoint based ...

it's not working that way.
In general, SD supports any resolution. But the way objects are composed and arranged to each other (probably by the convolution layers) is trained on a specific resolution. The model knows how to place a face in 512x512 . if you give it a 1024x1024 it will get confused and create multiple faces. That's why you should rather stick to the resolution it was trained for.

jade hornet
#

there are ways to be clever and maybe outpaint the image, combined with controlnet or other tools to achieve the original intended result...or just use XL which was intended for higher resolutions

craggy sierra
#

Is there a discord community for training loras and such?

jade hornet
#

I've found a couple others, but frankly they're pretty low volume

#

the r/stable diffusion reddit has a discord, for example

craggy sierra
#

My dataset has images like 1.jpg, 1.txt, 2.jpg, 2.txt, and so on. The .txt files contain tags for each image. Do I need to set this setting to .txt, or are .caption files something else?

craggy sierra
viral jackal
#

having a lot of trouble getting cascade to train

stiff dust
#

me, too. My feeling is that Cascade is really bad for training.

hollow spruce
#

same :/
like it works with high effort. but it doesn't ever work perfectly. Some things are easy to train, but most are hard.
really depends on what you're aiming for. Just dont ever aim for face/person loras XD

somber thorn
#

We created an index for the datacomp-12.8M dataset using Fondant and published it on the huggingface hub. You can find more details and info on how to use it in this short post.
You could use the dataset to fine-tune your own controlnet models.

errant scarab
#

Hello Everyone,
I need some help getting started with training my own Dreambooth or Lora.
I have a good local system with 24GB GPU Vram and know how to use ComfyUI (& automatic).

I want to train on a very old comic book style and I have a couple of those comic books with me.
This is what I found on the resources section on discord https://huggingface.co/docs/diffusers/training/lora
Is there a video you'd recommend I follow to get started on this journey.
It would help me a lot. Much appreciated.

neat fox
quick anchor
#

Hello -- hope this is the right place for this question : I am attempting to run a py script that is calling on tokenizer /tokenize_config from https://huggingface.co/base_model/resolve/main/tokenizer/tokenizer_config.json but is returning a 404 error , checking the link leads to a "Repository not found" page. Is this likely a temporary outage with hf or did I do something wrong /overlook something else? log snippet if helpful: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/base_model/resolve/main/tokenizer/tokenizer_config.json

errant scarab
neat fox
#

i'm far from an expert but i've gotten a few pretty decent ones in my limited experience

dusky urchin
thick bear
#

If anyone can share their lora settings for onetrainer I'd love to see, especially for training a style. The UI is so much nicer than kohyas mess of nested tabs but there's almost no resources for it

#

I tried to replicate prodigy settings I've seen around for kohya but it didn't do too well in OT

neat fox
thick bear
#

That'd help, I think the biggest difference in character vs style is in captioning anyway. I've actually had okish results just using the XL default settings in OT, but I don't think they're entirely appropriate for pony models which is my problem

hollow spruce
#

For everyone who has captioning issues:
https://github.com/jhc13/taggui
added moondream1 as a model for auto captioning.
Definitely not the best model, but it runs on a toaster

so as long as you have 6gb vram or more, you can do (relatively good) auto tagging
(if you're at 14 or more... use cogvlm!)

hollow spruce
hollow spruce
# thick bear 24 here
[[subsets]]
num_repeats = 1
caption_extension = ".txt"
shuffle_caption = false
flip_aug = false
is_reg = false
image_dir = "A:/Datasets/npcp/source"
keep_tokens = 0

[noise_args]

[sample_args]

[logging_args]

[general_args.args]
pretrained_model_name_or_path = "B:/SD_models/checkpoints/sdxl/sd_xl_base_1.0_0.9vae.safetensors"
mixed_precision = "bf16"
seed = 23
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
max_token_length = 225
prior_loss_weight = 1.0
sdxl = true
xformers = true
cache_latents = true
cache_latents_to_disk = true
no_half_vae = true
gradient_checkpointing = true
max_train_epochs = 60

[general_args.dataset_args]
resolution = 1024
batch_size = 7

[network_args.args]
network_dim = 64
network_alpha = 1.0
min_timestep = 0
max_timestep = 1000

[optimizer_args.args]
optimizer_type = "AdamW"
lr_scheduler = "constant_with_warmup"
learning_rate = 0.0001
max_grad_norm = 1.0
text_encoder_lr = 5e-5
warmup_ratio = 0.05
min_snr_gamma = 5

[saving_args.args]
output_dir = "A:/Datasets/npcp/output"
save_precision = "bf16"
save_model_as = "safetensors"
output_name = "npcportrait_v2"
save_every_n_epochs = 5
save_last_n_epochs_state = 1
save_state = true
save_toml = true

[bucket_args.dataset_args]
enable_bucket = false
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64

[optimizer_args.args.optimizer_args]
weight_decay = "0.1"
betas = "0.9,0.99"
#

my settings for derrian distro. but they translate 1:1 into onetrainer
Was my settings for -> https://civitai.com/models/336145/npc-portrait-xl-for-basedreamshaperlightning
which was also a pretty high effort style lora. so I can vouch that these settings work for 3090/4090 users

This is a pen & paper npc character portrait generator lora. Its mainly optimized for DND, Pathfinder, and equivalent fantasy games, but it can...

#

I had 1.2k images for my dataset. (but the settings barely/dont change unless you have a sub 100 dataset)

#

npcp, simple background, round ears, human, caucasian, girl, a human with a contemplative expression draped in a blue scarf gazing into the distance. <- had captions like this, to reinforce the style (basically do a trigger word, then tag everything that is happening in the image, and nothing about the style)

thick bear
#

Thanks I'll test it out on the next run, probably be more like a 50-100 dataset though so I'll fiddle

#

For style captions in kohya I had pretty good results having only "by artistname" as the caption, but I'll try adding some non style caps like you did as well

hollow spruce
thick bear
#

Llava 1.6 is pretty amazing for describing images

#

Looks like taggui doesn't have 1.6 support yet but I was really impressed testing it out in comfy, I'll def try tagging a dataset with that soon

dense abyss
#

ha

polar smelt
#

Hi guys, I want to use the openai clip model and a classification model to format a simple image description (for example a user who wants to create an image) into an sd-prompt.

My thought was to use clip to capture the text features and then train the classification model on the features and labels (which would be the sd-prompts).
Could anybody with more experience tell me if this is a viable way to achieve my goal?

foggy inlet
#

@polar smelt just use CogVLM or ShareGPT4V, or check out some other models available at vision arena (https://huggingface.co/spaces/WildVision/vision-arena) - some of them are pretty good, and you'll probably get better results if you just use one (or more) of those and figure out best prompts for them to get what you want, instead of trying to create new model from scratch

last panther
#

Morning folks. Does anyone pls know any good tutorial on how to set the right parameters for Dreambooth SDXL on Diffusers "train_dreambooth_lora_sdxl.py"? My training is working but the results are not great 🙂

sacred grail
#

does anyone know how I can do Aesthetic score finetuning?

slim gyro
hollow spruce
#

probably some dumb mistake like using .caption instead of .txt for captions... or some pretty basic setting that always needs to be enabled... but wasnt

#

any reason you want to use dreambooth specifically? rather than lora?
if you're determined to use dreambooth original or kohya implementation, you're in a for a bit of troubleshooting ^^'

#

ah yeah. then it checks out.
there's a million things that can go wrong if you're doing finetuning.
best start with an existing preset, to test your dataset if its working at all. then adjust from there
(Onetrainer has a basic finetune preset for each major sd version, to get you started)

#

if you dont wanna switch trainers, then just take inspiration from the preset, and redo it in your trainer

#

oh. ooooohhhh.
if you're dealing with datasets under 10k images, then best stick with LoRA

if anything, then you should just take a simple lora preset that people recommend, and spend 80% of your time on making your captions better and better

raw dirge
#

a lora would work better for u

#

not enough images for you to train a whole helicopter checkpoint

#

with people is easier because the model already knows how ppl look

#

but for machinery you would probably need 10k to 50k

#

since it already knows the basics of helicopters a lora would be better,u could get like 50 imgs of each heli model and train a lora for each one of them

#

id say give lora a try,it will save u time and energy if u dont like it u can always train the checkpoint with lots of imgs

#

for 1.5 i used to do it on kohya but idk whats changed its been a while

silver dawn
stone garden
#

One message removed from a suspended account.

minor shale
jade hornet
#

that output looks normal, some of it was missing though

minor shale
#

wdym

minor shale
jade hornet
#

Nevermind, I see it now. It threw an attribute error, which means one of your package versions is wrong. I'd compare your installed packages vs the ones it needs

#

I can't tell from that which one

minor shale
#

amazing...

#

what command do i run to check

jade hornet
#

I'm a Linux guy, try pip list? You might have to go to the tech support channel. The requirements text files have the versions you need

minor shale
#

im on macos but it's pretty similar to linux

#

based on unix kernel

jade hornet
#

There's a pip command to force reinstall and you can point it at those version files, it be easier... Something like pip install --force-reinstall -r file.txt... Google it though, that was from memory

tame vortex
#

(and you d need to use venv's pip not your system's one, assuming whatever you re using has a venv)

dusky urchin
#

i think you've been at it for a while. didn't we discuss that you can already achieve this pretty straightforwardly?

#

you don't have to fine tune at all

#

for stuff like this: it isn't meaningful. think about what is visible, not what it's called

#

if it can't be seen it's not going to fine tune either

#

if the differences are subtle it's not going to show up in fine tuning unless it is similar to something that has already existed in sdxl.

hoary ember
#

I have about 2.5 million adult photos (mostly 1280x720 resolution) that I scraped from a large pic gallery site. All of these images include metadata such as a one-line text description of the scene and categories/tags. What would be the best software for making a fine-tune with this scale of dataset?

#

(I would ideally like to train something based on SDXL or a similarly fast model, because part of what I want to do is video generation, but I'm open to alternatives)

dusky urchin
hoary ember
#

Do I need video samples for finetuning, or can it work with just images? I could also easily scrape a video site, but the dataset size would be massively larger (my image set takes up nearly 500GB). I could always compress/convert the video to lower-res, but wonder how this would affect training/output quality

#

Also, does SVD let me use large image sets (millions) or does it just take a single image?

hollow spruce
# hoary ember I have about 2.5 million adult photos (mostly 1280x720 resolution) that I scrap...

A.) invest hundreds of hours, to learn the skills to take on a project of this magnitude
B.) invest thousands of dollars, to pay someone/or a group/ who already has the skills, to make a finetune of this level for you, using this dataset

what you're asking is the equivalent to "I got a hold of lots of spare car pieces and raw metal. How do I use this to tune my car to look better and go faster. Preferably I want to use less gas as well, since I want to compete with supercars"

its not that there's an issue with the ambition, but you're best off starting small, and learning how to make a small lora based on 400 images. then on 3k images. then on 10k images. then take what you've learned, and start from new again, since genuine full finetuning is even more destructive/hard to do right, and work on making 10k finetunes until you get good enough, to slowly scale up.

Expect your final final finetune on 2.5kk images to costs several thousands and will have to run on dedicated rented cloud hardware (since batch size matters, meaning you'll need an A100 cluster to do it in a reasonable amount of time, or else you'll wait for 3 months for a single A100 to get through it), and you only get one chance unless you're willing to invest that kind of money again and again

hoary ember
# hollow spruce A.) invest hundreds of hours, to learn the skills to take on a project of this m...

I understand that this is a big project, and that I will need to learn skills, and that this will take time. I'm fine with that.

I'm just trying to determine which skills I need to learn. I am trying to determine specifically which tools would be used for this type of project, so that I can learn how to use them. I am planning on starting out small as you said, and working towards the larger goal (I wasn't planning on just immediately trying it with 2.5m images without trying small batches first lol)

Also, as far as hardware, is there really no way that this could be done on a local machine (AMD 7970X CPU, 512 GB DDR5, and 2 x RTX 4090)? Given this hardware, what scale of training would be possible / in what kind of time period?

hollow spruce
# hoary ember I understand that this is a big project, and that I will need to learn skills, a...

well you're in a bit of trouble.
hardware wise you're good to go including standard full finetuning (unet only). so thats about 80% of the way.

But your main issue will be:
A.) Guides <- lots of guides say different things, while saying "this is the best way". 95% of them, are in fact, geared towards ultra small datasets. Things change significantly once you hit the 400 & 3k dataset size mark.
B.) Captioning <- while there are tools that help automate this, you also gain their bias. (Example: cogvlm always saying "serene", or LLava often mixing up arm locations, which results in wrong arm anatomy if you train for too long on many images)
C.) Dataset management. Anything but 1k images, needs a dedicated tool for dataset management. currently there exists no true free or even paid dataset management software. You'll definitely find many if you google, but you'll also despair once you get more and more images...
basically, there exists "Dataset architecture" which is genuinely complicated. And this becomes unavoidable once you hit 100k

and you might think... well do I really need that? I can just autocaption everything and accept the bias from x or y tool.
Which would then result in your training not actually improving in quality beyond a certain level. Meaning you'd benefit more from a trimmed dataset that is well managed. <- downloading huge datasets of millions of images is easy... but this is the core reason why no one actually trains that large. its for the simple reason that improperly managed datasets only cost more to train, but dont infinitely improve quality nor adaptability

#

if you just want to start training, while being ok that this might be too big of a goal, I can heartily recommend:
• onetrainer <- for the easiest training of sdxl
• taggui <- for tagging of your images
• hydrus network <- for real (but very very painful) dataset management

hoary ember
# hollow spruce well you're in a bit of trouble. hardware wise you're good to go including stand...

Thanks so much for the recommendations! That's a great starting point 🙂 ... as far as dataset management, I'm very comfortable with Python and wrote the tools to scrape the images myself, and made sure to generate JSON metadata for all of the images (title, descriptions, keywords, names, etc) that makes it pretty easy to work with. Basically, the entire dataset is already tagged / organized quite well.

... I'm glad you brought up CogVLM because that is something I had actually been looking into. I have one line image descriptions (~100 chars / 10-15 words) and 5-10 tag words that I scraped along with each image, but I was considering using CogVLM to expand on these descriptions even more. But I am hearing what you're saying re: biasing the dataset. ... Maybe I could work on making a fine-tune of CogVLM first, and then work with it?

hollow spruce
hoary ember
#

Oh damn, I didn't realize the limit was that short ... that suuucks ... welp, I guess I won't bother spending the time with all the CogVLM stuff then, because my idea had been to try to generate a detailed paragraph describing each image to append to the end of the original one line description

hollow spruce
# hoary ember Oh damn, I didn't realize the limit was that short ... that suuucks ... welp, I ...

one thing that works well for me, since I have complex custom tags for all my datasets, is to make a custom prompt for each image for cogagent. (or in your case, for llava 1.6 in order to support anatomy knowledge, as cogvlm is 100% sfw)

basically I retrieve the tags for the image, then use them to build a custom prompt like:
"This is an <artwork|photo|drawing|render> of <tracer|ahri|batman> from the show <gotham>. Caption the woman|man|animal, pose and background" while also limiting it to 77 tokens

#

obviously you'd extend this to make the most use of your existing information

#

this also helps you avoid most vlm issues, as you reduce hallucinations to an absolute minimum

#

this will give you natural language captions which dont work well for small dataset lora training, but really shines if you do 4k images or more

hoary ember
#

Thanks so much for taking the time to explain all of that! ✨ I'm gonna go do some research into what you've told me so far, and mess around with some test batches and see what I can come up with.

jade hornet
#

<50 images

hollow spruce
#

different answers depending on which one you're training

jade hornet
#

for me it would be XL, realistic

hollow spruce
# jade hornet for me it would be XL, realistic

option A.) <trigger word>, ask cog to generate a short caption of everything you dont want your lora to learn.
option B.) since its under 50, <trigger word>, then manually keyword tag everything you dont want the model to learn. do like 1 ~2 word descriptions. <- then enable shuffle captions + keep 1 token. enable TE training.
never mention any word related to anatomy, like "neck", "boobs", "stomach", "hands", "arm", "feet" etc... <- these will always make your lora worse since you dont have enough examples for it to learn an actual improvement, meaning you will most like cause a senseless offset, unless it is the actual concept you're training. (and even then, you'll probably have to rely on pure overfitting... as sub 400 images isnt enough to make an actual positive contribution to anatomy knowledge)

depending on situation make use of the mask feature in onetrainer

option b will usually give better results, since its more targeted

jade hornet
#

cool, regarding anatomy knowledge, I've found some derived models that already have that trained in work somewhat better for nsfw type training...but of course those are easier to overfit or already are in some cases

hollow spruce
#

my lora (of a specific dnd character in my campaign, that I generated in dalle3) + base | dreamshaper turbo. taught just the face without messing up the hands XD (and no background bias or weird skin color offset)
trained on 8 images. works on every model except pony & other foundational ones

#

one of the 8 training images + mask I added

pale hawk
#

I have a question about my first attempt at full fine-tuning SDXL 1.0. Here's what I did:

Used 370 high-quality, advertisement-style text-image pairs with the kohya sd-script.
Set the batch size to 16 and the learning rate to 3e-4, leaving other parameters at default.
Observation during training:
Instead of gradually adapting the existing SDXL 1.0 outputs to fit the custom dataset style, the image generation process seemed to start from scratch. The images began in a distorted state and slowly formed over time.

Generated Images:
Below, you'll find the image generation results for a given prompt, captured every 2000 steps from 0 to 18,000 steps.

prompts:
A bottle of Paul Medison White Musk shampoo is prominently featured against a soft purple backdrop, complemented by an elegantly draped white chiffon fabric. The vibrant red bottle with white and black text stands out, highlighting the product's sophisticated appearance and suggesting a luxurious hair cleansing experience.

I'm new to the Discord community culture, and I want to ensure I'm respecting the rules and norms here. If my question is not appropriate for this community, please let me know, and I'll promptly delete it. Thank you.

#

most left image is ground truth image

jade hornet
#

the question is appropriate, though I must have missed what the question actually was. I saw your method and results

pale hawk
# jade hornet the question is appropriate, though I must have missed what the question actuall...

I'm curious about the typical process of full fine-tuning for SDXL. Is it normal for it not to gradually modify existing SDXL outputs to match the desired custom dataset style, but instead to start from a distorted state as shown in the attachment?

From other research, I've noticed that fine-tuning and quality-tuning often stop around 15K~30K steps. I'm wondering if it's okay to continue training beyond this point.

jade hornet
#

with the caveat that my understanding is very basic and high level, full finetuning is retraining all the parameters, and typically works better with thousands of images. dreambooth is more well suited for a smaller dataset. as for how many steps in general, it'll start to overtrain at some point and the results will degrade. It's difficult to say in advance where that will happen

pale hawk
#

Got it, thanks for the response.

full remnant
#

How long would it take to finetune SD3 on one P100?

hollow spruce
# full remnant How long would it take to finetune SD3 on one P100?

A.) we dont have access to SD3, so we know close to nothing, other than what the paper has told us
B.) that's not how that works. You can have all the compute in the world... what you need is a dataset and a good understanding of dataset architecture & captioning
C.) due to this being the official SAI server, talking about nsfw topics or how to circumvent censoring isnt allowed on this server.

full remnant
#

Oh, sorry then

dusky urchin
pale hawk
dusky urchin
#

it doesn't really even make sense as an application. you are only going to use like 5 creatives for an ad campaign, which would take less than an hour to make.

pale hawk
# dusky urchin sdxl isn't capable of generating fine typography like this. it will be a bajilli...

Thank you for the insightful comments. However, what I’m curious about is, I understand that models like LoRA or ControlNet, which freeze the backbone model and only train the adapters, maintain the capabilities of the backbone model from the beginning and gradually proceed with the generation process towards the style of the custom dataset. On the other hand, I’m wondering if, in the case of full fine-tuning, the original model’s capabilities are lost and the image generation starts off in a disrupted state from the beginning.

dusky urchin
sonic narwhal
#

Which is best finetuning SDXL vs Cascade when it comes to realism?

stiff dust
#

also you should always use a few warmup iterations. The AdamW optimizer is in an unstable state in the beginning and need some time to adapt to the data. With a warmup of, say, 50 steps, you set the learning rate gradually increase to your desired value for the first 50 steps, giving the optimizer time to collect statistics

#

beyond that there is no huge difference between Lora and full finetuning (beyond Lora being more parameter efficient).

pale hawk
stone garden
#

Nice to meet you, I'm Japanese.
Let me ask you a question.

I decided to give style learning a try, and I did so while referring to this site.
https://romptn.com/article/22757
The SD Ver is local environment WebUI 1.5, Windows 11-64, memory 16GB, GPU: Ge Force RTX 3060Ti 8GB, but even if I press the learning button, it stays in "standby" and no images are created. (This is the "Train 'embedding'" part of the above site)
I restarted my PC and closed unnecessary programs, but if there is any solution, please let me know.

*If possible, it would be easier to notice if you write in a reply.

romptn Magazine

オリジナルの「embedding」を作成できる『DreamArtist』 今回は「ずんだもん」の1枚の画像からキャラクターのembeddingを学習させて「ずんだもん風」の画像を生成するまでの、一連の工程を解説しています。

bronze igloo
#

Hello all. I'm working on training a ControlNet to remove furniture from furnished room photos. Wondering if anyone has done anything similar - but after training for about 5 days, it seems to have plateaud. I've posted details here in case anyone can help: https://github.com/lllyasviel/ControlNet/issues/659

GitHub

I'm working on a project to take images of furnished rooms and remove all the furniture. I've got a large dataset of image pairs. I'm not using any preprocessing on the images so as to ...

tame vortex
shut pike
#

So SD3 got captioned with CogVLM - is there a source for good captioning prompts (that detail the image, subjects, their clothes, pose etc. & also judges the image quality) ?

jade hornet
#

I wouldnt have thought SD would be a suitable application for that

#

somehow the AI needs to understand the difference between what constitutes the empty room and the "stuff"

stone garden
tame vortex
#

yes

#

but the whole log please.

#

the whole text

#

copy paste it in a .txt. Then drop that file in here

stone garden
# tame vortex yes

Thank you, I understand.
(If you reply with a quote, you will receive a notification, so it will be easier to notice, so please help us)

Next time I teach the style, I will paste the log.

stone garden
hollow spruce
# stone garden I have copied the text from the console screen, so please check it. I think the...

| AssertionError: Training models with lowvram not possible

The final line here describes the issue.

Here's an explanation for why this is not working, on the PC you are currently using:
You are using a GeForce RTX 3060Ti 8GB.

8GB VRAM is not a lot when it comes to AI Image generation. It is basically the bare minimum to generate.
While generating, the program uses tricks to reduce the amount of VRAM that it needs, which allows you to generate.

For training, it needs all parts of the model loaded at the same time, in order to train it.
8GB are not enough to do this while using stable-diffusion-webui

There are ways to still do it on your pc, by using the program kohya or onetrainer. But it wont be easy nor very fun due to the issues you will face on an 8GB vram card.

16GB VRAM will let you do all kinds of trainings efficiently.

Else, you can also use online service to train your dataset for you. (Civitai has a free version of this. Some other sites also provide such services.)

hollow spruce
# bronze igloo Hello all. I'm working on training a ControlNet to remove furniture from furnish...

this needs a few more examples, using new pictures which weren't in your training dataset, to see what your model has learned so far.
There's a chance its working. But there's also a chance that it learned to work only on your training dataset... which is another way to say, you've overfit it to hell and back.
(If however it is working as intended, then there are a lot of things which can be done to improve from your current situation.)

hollow spruce
# shut pike So SD3 got captioned with CogVLM - is there a source for good captioning prompts...

no one-fits-all solution sadly.

there are a few basic prompts, which will work on all images, at the cost of worse captions.

but you'll be much better off if you can segment your datasets into categories.
For example this is the prompt I used to generate captions of all images that have one woman in them: Caption the woman, pose and background
its given me the best results.
(Protip: use cogagent! better results than with cogvlm, barely more vram used)

stone garden
hollow spruce
stone garden
hollow spruce
shut pike
bronze igloo
unreal trench
#

Hi! is there a place like Civit where I could download captioned datasets for Dreambooth LoRA? How to create regularisation datasets?

jade hornet
hollow spruce
#

but use the filter and find just what you need. 120000 datasets currently exist. common topics will be easy to find. niche topics require luck, or you can be the difference you want to see!

hollow spruce
restive bobcat
#

I would like to try and replicate the green screen LoRA https://civitai.com/models/240019/green-screen, for gaining training experience, further model generalization, and own use cases.

The end goal would be to have a green screen LoRA to generate multiple characters, in any pose, on a green screen background.

For training I would rent GPU VMs on Google Cloud, perhaps with a budget of ~ $100 per month. I have for example a 16 GB VRAM instance (stopped atm), just for occasional testing generation/ port forwarding for local Web UIs etc. I do know a bit of Python.

From all the super cool discussions in this channel, I understand that I need to start from curating a good dataset, with good captioning. It should be feasible to collect enough green screen images from Shutterstock etc, in addition to generating from the mentioned LoRA. Based on a minimum of 400 (?) images, I would try and generate prompts with Cog-something. Then I should caption the images like grScr anime old man in coat standing, basically describing all features I don't want the model to learn.

Am I somehow on the right track? Based on my budget, how many images should I realistically aim for in my training dataset?

EDIT: most green screen stock photo is actually video. Could I split each video into its still images, with each image from the video given identical caption describing everything but the background, and use all of those images in the training data set? The background is "all" the model needs to learn, right?

Example: 2 seconds of 25 fps video of woman dancing in front of green screen. Split video to 50 images with caption gr_scr a woman in casual clothing dancing.
Too easy?

shut pike
#

I think I'll go crazy for SD3 and use CogVLM to caption my images with several natural language tags for different areas (like subject, composition/lighting, etc.) - what do you think? Too much or a good idea for a Model with a 512 token limit?

foggy inlet
#

@shut pike you might want to try out MoAI, too - it got pretty nice results in benchmarks (I posted info about it few days ago, on #1003207327203209236)

shut pike
foggy inlet
#

yeah, figuring out good prompt might be really important - that will result in accurate description of both the content, as well as all the important visual aspects

#

just mind even the best of currently available VLMs can struggle with some cases, even simple ones:

shut pike
#

I would never trust auto-tags.

foggy inlet
#

so might be worth to figure out some safety system for that - probably best automated (for example ask both CogVLM and MoAI for the description, and then GPT-4 or Claude if output from both models describes the same image)

#

and if you want to do good finetuning don't forget about regularization - there are multiple models on CivitAI that sucks terribly on that (and then turning males into females, making all faces looking the same, etc. crap)

dusky urchin
#

you also need a 24GB GPU if you want to run CogVLM at 4bit, or get an 80GB GPU. it doesn't perform as well quantized.

hollow spruce
#

cogvlm, with the prompt: caption this image

hollow spruce
#

noteworthy mention. this is specifically in regards to using cogvlm for captioning datasets. If you wanna talk with it, or iterate on a conversation... then yeah. 4bit is terrible

restive bobcat
# dusky urchin what is the use case?

My use case is to generate characters and backgrounds separately, and blend them together in a video editing program like DaVinci Resolve. I believe this process is called chroma keying? I just got a hunch that it should work, but how good I don't know.

foggy inlet
#

@hollow spruce I just tested moondream2 (https://huggingface.co/spaces/vikhyatk/moondream2) and indeed it handled digits test nicely. I had to change your prompt a bit to produce better captions for more complex images, but I like this model - it seems it might be ready for SD3 era (at least for simple use cases) - thx for sharing it 🙂

foggy inlet
#

(phi-1.5 might be the limiting factor for more complex scenes. something similar to it, but based on phi-2 or T5-XXL + CLIP + OpenCLIP could increase compatibility with SD3 style prompting)

broken mulch
#

Hey everyone, I'm curious about something. If I use specific keywords and elements to train an SD Lora for creating images, and then later change up these keywords to design clothes, do you think the designs and elements on the clothes will come out consistently? Has anyone experimented with this kind of thing before?

drifting mirage
#

Hi! It's my first time preparing a dataset for Kohya SS. This will be a photo style, for realistic portraits. The set includes photos from 750 x 1000 to 2500 x 3700. Please help me understand a few points:

  1. What resolution is optimal for Lora SDXL? Maximum quality is important to me. Obviously, for 750 x 1000 there needs to be an upscale, and 2500 x 3700 needs to be downscaled, but to what extent?
  2. Should the dimensions be multiples of 16/32/32? Does it matter?
  3. Is it worth even at x1 to increase the detail, clarity, and remove jpeg artifacts, for example, using SUPIR?
  4. Is it worth compressing with 100% quality using some jpeg optimizer? The dataset is large, 200+ photos, this may affect the speed of training.
dusky urchin
#

and an sd 1.5 model. use the "joint" workflow

dusky urchin
dusky urchin
#

fine tuning will generally cause outputs to be less creative, not more.

dusky urchin
#

you cannot "see" unease and mystery

#

looking critically at how many words it uses that are not visible things, and how few actually do, it is not good at all at the task you need it to do

#

"but i don't know though"

foggy inlet
# dusky urchin you cannot "see" unease and mystery

Some models can see "unease" - here you have sample caption from llava-v1.6-34b (last paragraph):

In the image, there's a character with a futuristic appearance, seated in a contemplative pose on a rocky outcropping. The character is wearing a black body armor with pink lights that suggest technological functions. The armor's design is sleek and polished, with a headpiece that includes a visor and what appears to be a communication device.

The character is facing towards the right side of the image, where a large, towering structure looms in the distance. This structure is complex and appears to be a fusion of organic and mechanical elements, with tentacles extending outward. It stands against a backdrop of a turbulent sky, where dark clouds or perhaps an otherworldly atmosphere gathers.

The ground is rugged and strewn with debris, suggesting a place that has been through significant events or perhaps a battle. The entire scene is awash in a palette of dark, moody colors with accents of pink and purple, contributing to a somber and mysterious atmosphere.

The art style is detailed and realistic with a touch of surrealism, given the fantastical elements present. The image quality is high, with a texture that suggests a digital painting. The level of detail is impressive, from the individual strands of the character's hair to the intricate patterns on the armor.

The overall atmosphere of the image is one of solitude and introspection, with a sense of anticipation or unease. The juxtaposition of the character's calm demeanor with the chaotic and threatening environment creates a powerful visual narrative.
#

Prompt was

Please describe this image, so another person could imagine the same picture. Include all the relevant information about the content, artistic style, image quality, interesting visual aspects, and the general atmosphere of the image. Be accurate, and concise.
#

hm, but even the tiny moondream2 noticed the sense of unease in this image - I am not sure why you've said it cannot be seen

dusky urchin
#

and you can call that unease

#

and i'm trying to tell you that using the concrete visible words make a useful caption

#

for these purposes, but also for the purposes of making visual art, and prompting

dusky urchin
#

it gave you the opposite of concise

foggy inlet
#

not really - it was as concise as it could to describe the things I've asked for

#

and the T5-XXL in SD3 should be able to comprehend long prompts like that

#

and with SD3 model trained on longer and more detailed captions from CogVLM I would expect SD3 to be able to generate visualizations of more complex and abstract ideas pretty well, too - we'll probably see in a month or so, when SAI will publish the weights and we'll be able to experiment with various parts of the model

dusky urchin
#

if you wanted to use this for training, maybe a caption would be

a digital airbrush illustration of a rear wide angle shot of a skinny brunette woman wearing black spandex, black science fiction futuristic skinny body armor with pink energy coming from a window on the armor in the back; pink fringed neon lights on her armor; illuminated pink headphones with a pink band behind instead of on top of her head with short antenna; she is in a "thunderbolt" pose seated on her knees with her legs more apart than the traditional pose, and her legs appear fused into black webbing and tissue with an octopus tentacle tip going from her butt to the foreground bottom edge of the frame; the black tissue webbing is fused with red grass and branches of a large black tentcale form evocative of a tree, with black and purple branches above her at the top of the frame; a pink waxlike substance is melting from these branches like cables hanging. in the midground is a series of low mountaintop or wave crest forms behind and to the right of the woman; and in the upper right corner deep in the background is a platform superstructure building high in the sky, with curved cable forms as piers into the mountaintop/ocean elements. it is illuminated by small red lights; the structure appears to be a few large boxy platforms with darker greebles at the top, using a pink-purple-black palette. the sky is gesturally rendered clouds illiuminated by a mostly set sun in the background, the atmosphere is light orange.

#

i think science fiction futuristic skinny body armor is not good

#

i would have to think of a better way to describe what she is wearing but honestly it's kind of vague in the image too

#

@foggy inlet do you see my point?

#

these captions are helpful because if you want the model to generalize from conditioning correctly, it has to recognize when "the same thing" appears in different contexts

#

so if you wanted to generate an image of something out of sample - technically you always want to do this

#

okay, let's say the model has never seen "spiderman" before

#

like that word has never been used

#

it would be nice if it had seen a lot of examples of different costumed people correctly captioned with the elements of the costume

#

merely saying "batman" does not help you at all tackle the problem of rendering "spiderman"

#

am i making sense?

#

you can't "see" batman either, in thes ame way you can't "see" a person who isn't a celebrity

#

batman is a name for a collection of real visual artifacts.

#

so listen, you can "see" unease, but it's not helpful for training or generating out of sample images, which is like, the problem you are trying to solve

#

take it or leave it, this is my professional opinion

#

you could ask the LLM to "describe the collection of concrete visual elements used to express the emotions in this text"

#

i don't know how llava will work with that, but that's what the goal would be

foggy inlet
#

well I agree model should see "the same thing" in a lot of different context to understand it, but I am not sure if we'd need that kind of description like you provided for that.

imagine you'd be talking to an artist - would you describe him what you want with every tiny detail, or rather describe high level concept and composition, atmosphere style etc. and then let him do his job as an domain expert and figure out the details based on his vast experience?

dusky urchin
#

i hear what you are saying

#

i don't think the model is creative, in the same way ChatGPT struggles to be creative

foggy inlet
#

SD3 is getting there

dusky urchin
# foggy inlet SD3 is getting there

they can only go so far with the resources they have. for all the things they set up in unreal engine, such as millions of images of different three object juxtapositions and placements, it doesn't help with five objects. the model is totally capable of correctly generating images with five objects, but the state of the art approach to this stuff is limited by the generalizability of the conditioning

#

i think this is also why dall-e3 has such a "look" whereas SDXL does not

foggy inlet
#

yeah, but I've seen MJ responding pretty well to longer description even long time ago in v4 times - so it's pretty sure possible

dusky urchin
#

it is "undertrained" but also less conditioned

foggy inlet
#

and current alpha SD3 revisions can respond pretty well to poetry, too:
https://twitter.com/thibaudz/status/1768009402970263667

long prompt on sd3: Tomorrow, at dawn, at the hour when the countryside whitens,
I will set out.  You see, I know that you wait for me.
I will go by the forest, I will go by the mountain.
I can no longer remain far from you.

I will walk with my eyes fixed on my thoughts,
Seeing…

dusky urchin
#

i think we have a similar experience

dusky urchin
#

they're just so hard to make and train

#

someone will have to publish it first

#

lots of work

#

dall-e3 is going to be SOTA for a while. they have telemetry. they know what people prompted for dall-e2 and 1, and hence could generate training data in whatever to improve results for those prompts

#

sd3 does not

#

stability has no meaningful telemetry

foggy inlet
#

true, but there are also new methods to increase speed of training, like BTX - stuff like that will help

dusky urchin
#

time will tell, but midjourney and dall-e4 have better odds

dusky urchin
#

they also need to transition to pixel diffusion like IF

foggy inlet
#

then BitNet 1.58 - but that needs quantization during training to work, if I recall correctly

dusky urchin
#

lots of things that need to happen

#

yeah

#

i don't know if midjourney published on their model

#

i figure they use pixel diffusion because of how well it recreates scenes from movies

#

the movies it is trained on lol

#

you know that meme where it's like a woman sending an email "look it wrote a whole email from a bullet point" and the recipient is saying "look it turned a whole email into a bullet point"?

#

that's midjourney

foggy inlet
#

progressive training is done in the realm of pixels 256x256, then 512x512 etc. - why we cannot do the same in the realm of complexity (simple tasks, medium difficulty tasks, hard tasks), like we train our own brains (noones ask toddlers to learn quantum physics, isn't it?)

dusky urchin
#

it's people saying something vague, and then they're really happy that midjourney is remixing scenes from famous movies

dusky urchin
#

and wuerstchen too, just not pixels

foggy inlet
#

there's also a lot of interesting papers flying around every day, more than I can read - I guess we'll soon need AIs to read all of this with comprehension (at least at high level), and then help us pick most promising ideas potentially giving biggest improvements - maybe not GPT-4, but a bit smarter models like Claude 3 Opus or GPT-5, with properly working bigger contexts - those might be able to help us with such as tasks, too 🙂

hollow spruce
# drifting mirage Hi! It's my first time preparing a dataset for Kohya SS. This will be a photo st...

Optimal resolution is as close to 1 Megapixel as possible.
That could be 1024x1024, or 1153x768, or similar.

Kohya and onetrainer have this built in natively if you enable resizing + buckets. Then you don't need to worry about cropping or sizes in any way, as long as its over 1 Megapixel, as it will just be scaled down to optimal size

If you wanna go absolute quality, then do the resizing yourself via Photoshop automated process, and save as a png file, so that it's lossless. (that way you avoid jpg artifacts. It's a complicated topic... But the quality increase is incredibly small if you do all this extra work manually)
Better captioning and datasets are always more effective at increasing quality

drifting mirage
dusky urchin
jade hornet
#

heh, yah more photo realistic seems to be like an endless journey, doesnt it

small eagle
#

anyone have a current up-to-date version of this but for local (instead of google collab)

#

have a pair of 4090 24G cards with 64GRam, i wanna give training an sdxl lora a try

small eagle
#

got kohya setup, same machine i have comfy installed, seems like the cuda version is different
can kohya force the version comfy is using? or am i gonna need to fully upgrade everything to get kohya to run?

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
DEBUG: Possible options found for libcudart.so: set()
CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 8.9.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Loading binary /home/vender3d/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
libcusparse.so.11: cannot open shared object file: No such file or directory
CUDA SETUP: Problem: The main issue seems to be that the main CUDA runtime library was not detected.
CUDA SETUP: Solution 1: To solve the issue the libcudart.so location needs to be added to the LD_LIBRARY_PATH variable
CUDA SETUP: Solution 1a): Find the cuda runtime library via: find / -name```
drifting mirage
dusky urchin
dusky urchin
#

in terms of interacting with python and completing basic programming and IT tasks, it's better to ask chatgpt

drifting mirage
dusky urchin
#

you need a lot of photographs and processing per person if you want to recreate a non-celebrity likeness in SDXL flexibly. are these going to be photos of real people that you take on a stage? do you have hundreds of photos per person? did you caption it?

neon quail
#

Hi guys, sdxl finetune question, i have 5000+ large high resolution images and i don't want to crop and lose part of the image, so i resize them to fit 1024×1024 and now i have black border frame, will this frame be one the final result or will be ignored on the training ?

#

Example

dusky urchin
#

the black border frame, all else being equal, will 100% appear in anything created by your fine tuning

neon quail
#

My goal training the full image without losing any part , because the images mostly big in height and not square and i will lose the body or the clothes if i cropping them 1204×1204 , now i resized all the images to fit in 1204×1204 but with black borders , will buckets work ? If i resize only the highest?

ruby pond
neon quail
#

Thank you 🙏 🌹

dusky urchin
neon quail
dusky urchin
neon quail
#

So i will try to resize the height to 1024 and let the buckets deal with width 🤔

gentle flame
river umbra
#

Hey guys, need some help
I'm training my stable diffusion model with Dreambooth. I have 100 images and I go for 20 steps per image, so I have 20600 steps in total and it tells me that it will take 50 hours. How can I decrease the time that the training will take? Is it better to get rid of some images or decrease the amount of steps? Thanks in advance!

#

I'm creating my own architecture model

dusky urchin
#

you mean architecture as in Architecture

#

designing buildings

#

share the output of nvidia-smi

jade hornet
charred socket
#

Im training a style lora for sdxl, which preset should i use Im a beginner T-T

hollow spruce
#

Try using kohya or onetrainer. They built in a lot of optimizations, so it will run a lot faster.

jade hornet
celest pier
#

Hi everyone! I want to finetune stable diffusion but after loading its components separately from huggingface. I have loaded them separately and am using them for inference but I am struggling with how to finetune it now

#

as far as my intuition goes, i am supposed to freeze the vae and text encoder and train the unet>

#

This is the inference pipeline from their documentation

#

but how do i finetune this now

hollow spruce
little dust
hollow spruce
# little dust Does that work for stable video diffusion?

community sentiment goes more towards animatediff, since its compatible with most existing workflows and app pipelines

check that out if you wanna help improve video generation.

if your goal is actually SVD specifically, then there are few groups and companies that actively try it. Its not easy, nor plug &play. You probably wont find much help online, as its too complicated to casual walk into, and also requires significant hardware & compute to even make sense

little dust
hollow spruce
little dust
#

I mean I can just load it in huggingface and run a pipeline from stability's gen-AI, no? @hollow spruce

bold geyser
#

Any tips for epoch number/ batch size/ repeats- could any of these settings be the reason I m getting json file instead of checkpoints or s_ ? I am Training small model on a few projects based photos? Around 20 ? Any link with settings? I M having enormous RAM 128 GB latest M3 chip, base model LoRA?

jade hornet
#

the json file is normal, every time you run training it'll output the config you used for that specific batch

#

if you want a beginner level kohya_ss tutorial I can link a decent one

#

try this: it's not super new, but it's very thorough https://youtu.be/xXNr9mrdV7s?si=xAfvABoxq1VzWndY

LORA training guide/tutorial so you can understand how to use the important parameters on KohyaSS.

Train in minutes with Dreamlook.AI: https://dreamlook.ai/?via=N4T
code: "NOT4TALENT"

Join our Discord server: https://discord.gg/FWPkVbgYyK (Amazing people like LeFourbe on there)

------------- Links used in the VIDEO ----------

Folder to J...

▶ Play video
bold geyser
drifting mirage
#

Hi! I am trying to train Lora for the first time. I seem to have done everything according to the Aitrepreneur tutorial, but Kohya stops right at the beginning. Honestly, I don't even understand what line exactly is error here. Please help me understand what I'm doing wrong (there's a lot of repetition between the 2 and 3 screenshot).

hollow spruce
#

next to where you set the path to the base model

queen matrix
#

That was my guess too.

drifting mirage
#

On my win11 I have CUDA 12.4 installed over 11.8 does this affect Kohya? Kohya was installed with no modifications clean latest version a week ago. It has its own CUDA inside and is not connected to the global or?

hollow spruce
#

it does look like a cuda error tbh x_x

#

I'd first try onetrainer, see if that works. so you dont have to mess with your installs

#

assuming you're just getting started, use the included preset for sdxl lora. that one is nearly flawless for starting out

drifting mirage
regal harbor
#

so I wonder how well SDXL understands relativity. Say I'm training an IRL Homer, so I'm finding various men who look like Homer might if he were real.

In one photo, he looks like homer, but not as fat. Should I tag it "skinny" even though he's not really skinny? But he's skinny compared to Homer

dusky urchin
# regal harbor so I wonder how well SDXL understands relativity. Say I'm training an IRL Homer,...

you have to define what is essential about "an in real life homer," which is full of valid subjective judgements about the art that you have to express formally.

adding the skinny caption will improve the performance of the fine tuning. imagine there are two ways for the training process to "learn" how to generate your image:

  • the caption closely resembles your image. this means that the image it generates "starts out" looking like your image early on. from then on, only small changes to the parameters are needed.
  • there is no caption. the training process will make a lot of changes to a lot of parameters. if you don't have other training data that depends on those parameters, those changes will stick, along with whatever changes actually improve its ability to generate your particular image.

so under the theory that only a small subset of parameters are needed to improve generation of your images (true), poorly described captions will cause a "loss of generalization" aka lots of other spurious parameter changes are "kept" along with the small number of changes that actually improve performance.

#

as long as you are using captions that describe what you can concretely see, you will get good results.

#

when training styles, people omit the captions because they want a lot of changes to a lot of parameters.

#

a full fine tune versus lora fine tune also helps. a lora fine tuning has so many fewer parameters that the effect of having bad captions or no captions is diminished.

dusky urchin
drifting mirage
#

Hey guys! What is your Kohya train speed on 4090? On windows. Yesterday I was able to run Kohya and trained a couple of models for the first time, everything works ok but the speed... 2.30-2.50s/it on XL training, xformers, butch size 5. It's not okay, right?

hollow spruce
#

I get about 5.5s/it using batch 8 + adamW (no normalizing, on windows with overhead)

there are a bunch of settings that make it go slightly faster or slower. but they're marginal. so your speed looks pretty normal

regal harbor
#

Are you saying tagging like that will screw the model?

dusky urchin
#

you are asking for a "brother, i "just" need an answer" answer, which is impossible

#

if your captions accurately describe what can be seen in the image, your model training will occur "faster" in the sense that the fine tuning will be able to create the images in your dataset with fewer iterations of backpropagation aka in less time

#

homer simpson is already in the sdxl training dataset

#

you are not teaching it a new concept

#

sdxl already knows what homer simpson specifically is. it might not know "the complete collection of concrete visual elements that make up homer simpson" are related to "homer simpson" 100%, but it might

dusky urchin
# regal harbor I was thinking of e.g. "Homer sitting in Moe's_Tavern drinking beer wearing a p...

I don't tag "bald, beard, fat" because these elements are essential to Homer.
CLIP already knows what "homer simpson" is. however the distance between bald, beard and fat to homer simpson is probably larger than you would assume. you could probably improve the speed at which the model trains by including bald beard and fat. but since you are using the word essential, i think you are still dancing around the hard task of deciding what subjectively defines "an in real life depiction of homer simpson"

#

if you spent 5 minutes writing down the concrete visual elements that describe an in real life homer simpson, you will be able to make much better captions

primal pawn
#

Hi !
Hope everyones fine.

So, I'm developing a diffusion model for a project that converts text inputs into image outputs (Text to layouts). The stable diffusion model seems to be the most suitable option for this task. My datasets consist of 4003, 256x256 images, each accompanied by detailed captions (Roughly 250 words) in text format. These datasets are hosted on Hugging Face : https://huggingface.co/datasets/jkanishkha0305/text-based-layout-generation-dataset.

However, during training(Using keras implementation : https://keras.io/examples/generative/finetune_stable_diffusion/), the model encounters an issue related to CLIP embedding, specifically mentioning a "ValueError" due to a shape mismatch. The error message states: "Cannot assign value to variable 'clip_embedding_1/embedding_3/embeddings:0': Shape mismatch. The variable shape (1000, 768), and the assigned value shape (77, 768) are incompatible." This problem ig arises because my captions are very detailed, containing roughly 250 words each.

Additionally, when attempting to train the model with a simpler dataset on platforms like Colab or Kaggle, I encounter "OOM" (Out Of Memory) issues, likely due to limited GPU memory (15GB).

I have additionally tried the method specified here : https://github.com/huggingface/diffusers/tree/main/examples/text_to_image. But it runs into "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 101.06 MiB is free." So can someone suggest better ways pls. like may be should i go with TPU V3 ?

I need assistance in resolving these issues. So any help or guidance related to fine tuning of stable diffuion model using custom text(captions) with image dataset would be greatly appreciated.

Thank you.

primal pawn
stiff dust
#

that's a CLIP issue, which can only deal with 77 tokens at max

#

also, I don't think that CLIP can deal with you images at all...

#

I would say a general diffusion model is just not the right tool for your task

#

your problem has nothing to do with image generation at all. You want to extract room geometries from a text prompt. So instead of making images, just make geometries like <roomtype, x, y, width, height>. So either train a custom model on top of a text foundation model or train an instruct model that generates these geometries from a text

regal harbor
#

However, if he were missing an arm, I might tag "amputated" or whatever descriptive word for that the model already knows.

gentle flame
gentle flame
dusky urchin
# regal harbor But baldness is intrinsic to Homer, isn't tagging bald redundant? You don't tag...

You don't tag in every photo "person, man, Homer, eyes, nose, mouth, lips, teeth, beard, chin, neck, shoulders, arms, hands, fingers" etc, right?
you kind of should. it has nothing to do with being intrinsic or extrinsic to homer. it's that when you train on a dataset of billions of images from the public, and for hundreds to tens of thousands of epochs, there are going to be weights in the unet that are "reused" both for generating homer and for generating bald men with a complete set of organs and limbs. when you are training for some BS number of epochs on a vanishingly small number of parameters and very little data, you are hijacking preexisting stuff and "multiplying" it a little bit to bias the whole, complex process towards making copies of your image. if you want the process to find those multiplying coefficients faster, you should say person, man, homer, etc., because an image with those details is more likely to look like your goal per step of denoising.

dusky urchin
#

you still haven't said what an IRL homer should look like

#

until you specify that you are just stabbing in the dark following guides

hollow spruce
# regal harbor But baldness is intrinsic to Homer, isn't tagging bald redundant? You don't tag...

That's the logic of tagging "what is unique to this image"

Pangloss' method of tagging every feature is similar to how Anime models are trained. Every feature is mentioned. And while prompting is more work afterwards, they are undeniably accurate, and ignore less keywords, while also hallucinating less.

For all of my big datasets, I tag every damn feature. Due to that, they are consistently better at following prompts, and have significantly less token bleed in general.

For super small datasets, the most efficient way is to have a trigger word, and write a short sentence description of everything visible.

There are models where training becomes different, harder, or easier.
But for all base sdxl derivatives, this holds true

#

btw, training via 'whats unique' is a pretty easy and good way of training. But there's a definite "quality ceiling" you'll hit with it.
So its not wrong to train like that, but it is important, that once you want to achieve certain details, or being able to train multiple concepts into a single lora, you'll need to change your method of captioning, to one which has a higher quality ceiling

regal harbor
regal harbor
regal harbor
#

I suppose after training the model/lora, I could also train a TI, in order to get all those Homer details without using so many tokens

primal pawn
mighty magnet
#

I have trained now a couple of loras, but always ahve the issue, that I can't continue the trainig, what is there to do to make sure I can continue training a lora?

#

I want to do a first few epochs on very basic captioning and than continue training with a set of very detailed captions

dusky urchin
gentle flame
#

I saw it in another discord

hollow spruce
# mighty magnet I have trained now a couple of loras, but always ahve the issue, that I can't co...

kohya & onetrainer both have the option to "continue" from an existing lora.
you do need to keep the core parameters like net rank the same though. but just changing captions shouldn't be an issue (remember to not save captions to disk! as you'll want them to be generated new, when you switch them up later)
(if you have a different folder, with the images again, just different captions, then saving captions to disk is fine though)

#

here is what it looks like in onetrainer

dusky urchin
dusky urchin
#

so if it's rank 32 base, will it resize to rank 128 in that field?

#

or does it error out

shut pike
gentle flame
mighty magnet
stone garden
#

good morning.
I tried to create a LORA from a photo, but I can't.
I can't understand the meaning even if I translate the command prompt text, so could you please explain it to me in a simple way?

stone garden
exotic cipher
jade hornet
#

and you can cheat, using controlnet to generate photos to train a lora too

stone garden
# exotic cipher Which trainer are you using? Kohya, one trainer or some other trainer?

Good morning, thank you for your reply.
I use this method when creating LORA.
https://www.youtube.com/watch?v=N1tXVR9lplM

🐸本動画は、画像生成AI『Stable Diffusion webUI AUTOMATIC1111』(ローカル版)で使用する学習データの自作方法について解説したものです。23年4月時点での仕様に基づいて作っています。
🐸本動画は技術研究目的で作成しています。またソフトウェアの使用方法の例を紹介した動画であってその使用を推奨するものではありません。
This video is edited for technical research purposes.
It only explains how to install the software and its features, and is NOT intended as a recommendation to use itself.
...

▶ Play video
stone garden
exotic cipher
stone garden
exotic cipher
stone garden
#

Sorry, I have an additional question, but when using photos of real people, does it affect learning if the image size is too large? Please let me know if you have a suitable image size.

exotic cipher
#

im sorry i cant get it to work all the python files just die on me the second i open them
though just try kohya ss gui its slightly better to use since it does have an interface instead of being a commandline tool
my apologies from not being able to help with your issue
the installation process should be similair enough with the key difference that you launch the setup.bat file and afterwards the gui.bat file

stone garden
exotic cipher
#

it is another application
I would suggest making a new empty folder on your home screen
and follow the instructions described on the github page

stone garden
exotic cipher
stone garden
exotic cipher
#

yes uninstall python and reinstall it.
the changes should all apply to each drive / disk on your computer
when you are reinstalling python make sure these boxes have been checked

stone garden
stone garden
#

(Sorry I didn't notice your reply)

exotic cipher
#

I had to migrate from 3.10.6 to 3.10.9 and I had no issues with stable diffusion or any other program that uses python
What matters is that python is added to PATH so you can run the venv and the pip install lines

celest pier
#

hi

#

anyone has experience

#

with funituning stable diffusion

#

and is willing to help me out with a project?

exotic cipher
#

I only have experience with kohya ss gui doing lora training
https://github.com/bmaltais/kohya_ss
finetuning is similair to training lora's but takes alot longer and needs alot more data (about 10-100x that of a lora dataset) from what I have heard
I suggest you start with training a few lora's before you move on to full finetuning

GitHub

Contribute to bmaltais/kohya_ss development by creating an account on GitHub.

#

@celest pier

stone garden
upper smelt
#

Hello there ! I have a strange result training a Lora above SDXL. when I use the layer, images are oversaturated, with bands artefacts

#

I don't have this issue with SD 1.5

#

I tried playing with the lora's weight, with no luck

#

Is there anyone with an idea?

unique belfry
#

i have question about this https://civitai.com/models/4201?modelVersionId=130072, are the no vae checkpoints ones that require no additional vae, or ones that contain no real vae and require the user to use an seperate vae. and are there any vae files on the page, if so which ones are the vaes and which ones are the checkpoints

jade hornet
upper smelt
jade hornet
#

sdxl does not generate very well at resolutions less than 1024x1024

amber shard
jade hornet
jade hornet
#

subject training you are describing the scene, the pose, etc, with style training it's normally the opposite, maybe try no captions except "ohwx style" or some custom keyword you choose

amber shard
latent charm
#

Does 1epoch 30repeat equal to 1repeat 30epoch?

silver dawn
#

Hello,
I have a quick question: Any tips on training a model on a reflective objects, such as a stainless steel bottles? I've captured the dataset myself, but the issue is that in all the data, reflections on the bottle are visible, and the model seems to be generating them after training

jade hornet
latent charm
coral canopy
latent charm
stone garden
#

22、

dusky urchin
#

does anyone have any opinions on how to render dice? this seems exceptionally hard

dense sequoia
neon quail
#

hello, help i don't mess this up because i'm in the medial of captioning 11k images of people

#

do i have to caption and describe only the subject and the clothing with the background or without the background

#

what about tagging , do i add them after described the image like , (man, wearing cowboy hat in farm with big mustache, man,mustache.cowboy,4k,sunset

#

i want the model to be flexible and focus on the subject not the background but with an 11k image maybe it's ok to caption the background ?

dusky urchin
#

oh yeah

#

i remember now

#

sophisticated users of diffusion models describe everything that is concretely visible in the scene, in unambiguous terms.

#

let's say you have this image:

#

worst: spiderman

bad: spiderman on top of a skyscraper in london

bad-okay: spiderman, squatting, high in the air. behind him is the big ben in london on a sunny day.

#

okay: spiderman, a man dressed in a red and black spiderman costume, squatting with legs outstretched on top of girder high in the air, medium closeup level angle, backgrounded by an in focus shot of the big ben in london far away down below

#

best:

spiderman: a fit young adult male wearing a skintight nylon costume fully covering his body. the costume is red all over with black sections: black from the waist to below the knees, from the bicep to the elbows on both arms, from the tricep to the base of the fingers on his lower arms, with a widely spaced black grid on the red areas evocative of a spider web; a black arachnid symbol in the center of the chest about 2 inches in diameter; and graphic eye paint about three times as large as ordinary human eyes that resemble upside down almond shapes, with a white fill and a thick black stroke and curved accents of the stroke at distant corner of the eye evocative of spider compound eyes). he is posed with his right knee bent and his left leg bent and outstretched in a low squat, right arm forward, three quarters profile, and left arm out in the [insert the tai chi stance that they are using here, i don't know what it's called]. he is squatting on an architectural, cropped triangular element that resembles 14 inch steel pipes welded together at the top of skyscrapers. he and the architecutre are composed on the right side of the frame. behind him deep in the distance is a shot of [the london neighborhood with the big ben], showing the big ben near the bottom of the frame,westminster, then [more description of the concrete visible city elements of london]. the sky is an overexposed bright haze on the left and some sunset illuminated clouds on the right, it appears to be near dusk on a cloudy day.

#

@neon quail do you see

#

this isn't practicable

#

but this is how you get Ideogram quality results

neon quail
#

thank you this is really helpful, I spent six days and 12 hours every day to separating the images in folders and sorting them by , men, women,clothes, then captioning them and manually edit the captions , now I'm on the final 1000 images and i got scared and confused by other captions tutorials 🌹 🌹

dusky urchin
#

yeah

#

focus on stuff that matters to you

#

ultimately if you have a bad learning rate for your text encoder, you're not even going to use the captions correctly

#

you should test on 5 images and see what kind of results you get first

#

and try to understand what all the parameters do

#

if you plan to use prodigy, you are wasting your time captioning

#

are you doing a full fine tune or lora fine tune?

neon quail
#

i test on 500 ,learning rate 0,00001 ,40 repet 1 epoch and the result is good

#

full fine tune

dusky urchin
#

okay

#

then i don'tthink your captions matter

neon quail
#

captions don't matter for full fine tune ?

#

is 10 epoch 1 repeats good for 11k images ? or should I go higher with epoch 🤖

real citrus
real citrus
slim gyro
#

does koyha delete image files for the fun of it

#

where is mtefularization images?

dusky urchin
#

there's no generalizable advice

#

i get that there exists generalized advice but it doesn't mean it's correct, in even any case

#

i don't know if for @neon quail 's particular problem, if he "just" prompted better, or used "just" clipvision, he could achieve most of what he wants

real citrus
twilit cradle
#

Hey everyone! so.. did kohya newest updates break lora training? all of a sudden now when I make new Loras, they do not work (same settings as before)

full remnant
#

Anyone here who is experienced at LoRA training?
I am looking for anybody who can train and upload a Ponydiffusion LoRA with best parameters, using a high-quality and diverse synthetic dataset I will provide. (No need to credit me)

jade hornet
#

I know that's not specific, basically it burns out too fast

woven sequoia
#

Hello i'm trying to train a model with my mother's artworks, I have a small dataset (to start) of 52 artworks
I would like to let the model hallucinate without any prompts and let it generate new artworks based on the one it has been trained on
Could you help me?

woven sequoia
#

🙏

coral canopy
woven sequoia
#

It's abstract yes

woven sequoia
coral canopy
woven sequoia
#

so the prompt would be empty or "art by name"?

coral canopy
#

Art by name

woven sequoia
#

you're talking about the caption for the embedding right?

coral canopy
#

Caption for each image for training the Lora, yes

woven sequoia
#

okay i'll try that

#

how long is it to train a LORA generally? I have a dataset of 52 pictures to start from

coral canopy
#

What graphics card?

woven sequoia
#

3080

#

10gigs vram

coral canopy
#

I would say not more then 20 min

woven sequoia
#

okay nice!

#

thanks for the help i'll try that

coral canopy
#

If you have problems you can DM me.

woven sequoia
#

thanks FoxHeart

worldly ledge
#

Does someone know the best way to do a fine tuning with art style. Is indeed using dreambooth?

rugged mountain
#

大家好

#

Hello everyone

swift urchin
#

what kind or regularization images should be used for a style?

jade hornet
#

you dont need them, imo

ebon sable
#

so is it just me or do LoRAs seem to overfit really easily on generated images?

jade hornet
#

well overfit is not bad necessarily, unless you mean like artifacts. when training a lora, somtimes it should be overfit, meaning that you dont want a bunch of variation, you want exactly one specific output

#

the other thing to consider is that some non-base checkpoints dont work well with certain loras. the reason everyone loves juggernaut so much is that in general, it's extremely forgiving to use with loras

#

having said that I downloaded a likeness lora from civit recently, and it had artifacts even at like .5 strength, so not sure what that creator was thinking when they uploaded it

tall condor
#

hi huys, i spent allomost the last 2 weeks trying to get kohya_ss up and running
i tried windows 10 as well as ubuntu22, however i cant seem to get it fully working to use the GPUs
anyone has propper instruction especially on driver versions,os versions and cuda versions
i basically tried cuda11.8 which apparenly is allmost uninstallable on ubuntu22 by now because the drivers 520 wont build
with 535 i cant get any real speed however the gpus is utilized
what OS Ubuntu version do you recommend?
and what cuda and driver version?

jade hornet
ebon sable
#

I mean that's natural for training yes, and normally captioning everything besides the main concept works fine with non-generated images. but every time I've tried training on my own generated images, captioned or not it always learns minute details to the extent that every resulting LoRA looks like an overcrowded mess. the concept is usually learned perfectly fine, but... with everything else along with it

#

in the past I could always skirt around it by simply training with the model used to generate the dataset, so it only learned the difference. but right now I'm in a conundrum where my generated image dataset was made with a combo of multiple base models and LoRAs per each image, so that.. doesn't work too well

tall condor
#

@jade hornet im on RTX4090

tall condor
#

anyone recently installed kohyass on ubuntu 22 or ubuntu20?

#

it appears its impossible

#

my next install with cuda11.8 and 520 drivers failed with blank display

#

this is gerring really redicolous

tall condor
#

anyone can help with this

restive sun
#

Hey, I am trying to fine-tune SDXL with 30 sets of images, each set having a specific building, I am thinking of using Dreambooth for this. If I am not wrong I think for each set I have to fine-tune the model separately. So, I am thinking of fine-tuning the model with one set of building images and then saving the model and using the saved model to do the next set of images and so on. Do you think doing so will cause issues with the model? Is there a better way for doing this?

tough bison
#

if i wanted to do a batch of images, can i set it up so that i can choose which images out of the batch to upscale (in the same workflow)?
for example, a batch of 5 images takes 20 seconds on the first step, but the next step of upscaling takes about 2 minutes per image.
i dont want to upscale all 5 images, but choose images before proceeding to the next step. is it possible to set the workflow to "pause" and wait for me to choose images for the next step?

#

i guess the alternative is to just do the batch, save the files you want, and upscale them in a different workflow?

stable coyote
tall condor
#

i thik that there is a general problem in kohya with multi gpu
i tested 4 version: 24.x.x. 22.x.x. and 21.x.x.
same machine same cuda same driver same gpus (4x4090)
kohya 21 takes 1:50, kohya 22 and 24 take ~28:00
its like 15 times slower
any ideas?
i tested under ubuntu22.04, ubuntu20.04 and windows 10. no matter what i do i can not get the speeds back to the speeds of 21
i tested also with cuda12 and cuda11
anyone got any ideas?
i even tested on 2 systems, one with 4 gpus and one with 2 gus, one intel one amd
so im quite sure that everybody will run into this issue
tested gloo and nccl

jade hornet
#

I had no idea multi gpu was supported

oblique jay
#

I am training a Lora in One trainer with 450 images and 150 epochs with 2 batch sizes but it consumes 10.4 GB of the 12 GB of vram that my RTX 3060 12GB has. But the problem is that it is very slow, there is no way to increase the batch size consuming the same amount of vram or only 12 vram to make the training faster?

jade hornet
#

is it working? if so rejoice. why stress about the speed, you got a hot date coming up?

oblique jay
jade hornet
stiff dust
#

no, that's normal. You train way too many epochs. In total you have 33750 steps that could take between 20-40 hours depending on your gpu

dusky urchin
dusky urchin
dusky urchin
#

don't waste your time on updates in kohya

oblique jay
dusky urchin
oblique jay
dusky urchin
#

do you like the results you are getting?

oblique jay
dusky urchin
oblique jay
#

images similar to this and sometimes people appear, they are not just landscapes

tall condor
#

geohot's p2p patch seems interesting but im currently on windows

#

however this with nccl could give quite a speed bump.

#

can anyone share his settings for dreambooth finetune sdxl from scratch

#

especially with ppl in it

arctic venture
dusky urchin
#

can you remove the text? then, do unet only training.

oblique jay
covert pagoda
digital glade
#

Hi there

#

Anyone confortable with finetuning with kohya ss ? ( full finetune' )

charred sleet
#

Does anyone know if 225 is a hard limit for max token length for lora training? I've been wanting to try out natural language captions for lora training and some of mine are a fair bit over that limit. koya_ss seems to cap it at 225, but I don't know if that's just a UI limit

latent charm
#

75 is the clip limit

viscid mica
#

I have installed the following extensions in Stable Diffusion, but Lora is not generated.

https://github.com/liasece/sd-webui-train-tools

It is not generated as attached and [Ending job] is displayed on the command, but
On train-tools
There is no change even after an hour has passed since "A: 4.34 GB, R: 4.48 GB, Sys: 6.1/23.9883 GB (25.4%)" is displayed.

What I confirmed
・Stable Diffusion version change
(1.8.0→1.6.0→1.5.1→1.7.0)

・Can the Train base model be used to generate Lora unless it is adapted?

GitHub

The stable diffusion webui training aid extension helps you quickly and visually train models such as Lora. - liasece/sd-webui-train-tools

copper tangle
#

hi all. i've lost my patience with Kohya. in my images folder i have pngs with corresponding .txt files that match filenames (UTF-8, permissions are fine). for whatever reason when i begin training, the caption files show up as missing. anyone else experience this? any solutions? training goes fine otherwise.

dusky urchin
copper tangle
# dusky urchin can you be more specific? what error do you see

In Terminal I'm getting this message: " WARNING No caption file found for 140 images. Training will continue without train_util.py:1459 captions for these images. If class token exists, it will be used. /" even though there are indeed caption files for each of the images living in that same folder with the same filenames.

dusky urchin
charred sleet
copper tangle
copper tangle
copper tangle
copper tangle
#

Updated this file too: "merge_captions_to_metadata.py" but still didn't fix the problem :/

charred sleet
#

it's also a command line arg if you're using sd-scripts I think

#

I believe caption.py is used for kohya_ss's caption-file generating tool, so it wouldn't affect training for imagesets with existing captions. I'd have to take a closer look to see what merge_captions_to_medata does, but from a skim, I think it is used to help generate the output lora metadata, so not directly involved in training

#

it's best not to modify the repo files in general. Changes will make your local copy out of sync with the repo, possibly introducing bugs that are hard to diagnose

#

(the caption extension parameter is under the "Parameters" section with the name "Caption file extension". The command line arg is --caption_extension)

dusky urchin
#

and a corresponding caption

#

have you tried an online service for this? a LoRA training costs like $0.50

viscid mica
# dusky urchin what is your goal?

I would like to be able to generate Lora with train-tools.
But if it doesn't seem to work now, I'll consider another Lora generation tool.

copper tangle
copper tangle
# dusky urchin can you share one?

Solved! Can't share one bc it's for a work project. Trying to learn to do everything on my own for learning's sake. Otherwise I'd give up and use something online 🙂

drifting portal
#

Hey everyone 😊. Does anyone know if I can use 300dpi image to train LoRa? Or can point to some documentation that goes over size and resolution? Thanks

charred sleet
#

what matters for lora training is resolution and quality. sd1.5 loras should be trained on images with at least 512x512 resolution, and 768x768 is often recommended. sdxl should be trained on 1024x1024. You can train on higher res images and it'll be fine (they'll be bucketed by aspect ratio and scaled down to hover around the training resolution), but lower res images can be bad for the output lora quality.

#

for documentation, I dunno if there's any one guide that's been helpful for everything, but there're a lot of videos and articles online covering some basics. https://www.reddit.com/r/StableDiffusion/comments/11vw5k3/lora_training_guide_version_3_i_go_more_indepth/ is a popular one, and https://rentry.co/59xed3 for a bit more in-depth details. https://civitai.com/articles/3522/valstrixs-crash-course-guide-to-lora-and-lycoris-training seems nice too

Reddit

Explore this post and more from the StableDiffusion community

After copy-pasting a guide I wrote in discord several times, I think it's time I consolidated and expanded on it here on Civit. With so many guides...

covert pagoda
#

does anyone have experience training with lion optimiser for style models?

charred sleet
#

anyone know if there's a trainer that allows multiple text captions for the same image? I'm experimenting with natural language captioning and the regular "keep tokens" and "shuffle caption" functionality won't work for that. I know you can just copy the image and write new captions for each image, but that messes with the repeats and adds more images to be cached

EDIT: nvm, figure out the option: sd-scripts has --enable_wildcard which looks like it can do this.

copper tangle
#

stupid question when someone has a moment. is it possible to train simple 2D vector icons (after converting them to png)? i'm talking like, very simple as in graphic of a globe/smartphone/flower/envelope, on white/black backgrounds, etc. All in the same color palette and not very detailed. they're not truly "images" so i assume not...

charred sleet
# copper tangle stupid question when someone has a moment. is it possible to train simple 2D vec...

I'm not sure what you mean by "not truly images". icon loras are out there though: https://civitai.com/models/49021/minimalist-icons
https://civitai.com/models/141066/game-icon

This is a Lora-model for creating minimalist icons. You will no longer have copyright issues. Just generate icons and use them! Trained on the Deli...

Prompt 2d icon. {your prompt}. lora:game_icon_v1.0:1 The shorter the prompt, the better. Use the "2d icon" modifier at the beginning. Then ...

wet wadi
#

I'm using Kohya, trying various types of LoRAs. So usually, the sample images are at best, a rough indication of how the training is going and a means to tell when the model is over fitted. We expect them to be pretty bad.

What do you do when the training samples look more like the target than anything you can generate in A1111 or Comfy?
Just to be clear, the training samples look rough, but they get the face right. She has an uncommon face, but the sample images look like bad pictures of her, while in A1111 and Comfy, the facial structure, lips, nose just isn't right. Like it's been overly normalized by the rest of the model.

wet wadi
charred sleet
copper tangle
jade hornet
dusky urchin
#

i don't think arbitrarily stopping the training is a good idea in general and it's been limiting the capability of fine tuning for a long time

hoary ember
#

I am trying to create a fine-tuned model based on SDXL (using either KohyaSS or Huggingface libraries). I have a captioned set of ~500k images in ~400 categories that I want to train on to create an initial checkpoint.

I have a couple of questions regarding how to prepare images as far as cropping/resizing the image dataset to prepare it for training:

All of my images are available in full 1MP resolution but in a variety of different aspect ratios (e.g. 1024x1024, 1280x720). Are square images generally best? Should I train on both the uncropped full version and a version that is cropped to square? For cropping to square, it better to use random cropping, center cropping, or to use object segmentation and try to crop around important subjects?

#

Also, is it beneficial to train on smaller copies of images, so that the model is getting trained on generating the subject at all resolutions? For instance, if I have a 1024x1024 image, is there any benefit to also creating a 512x512 and 256x256 copy of this image and training on those too? I was thinking this might improve generation of the subject at lower resolutions, but was worried about it overfitting with repeated use of same image.

stone garden
#

Stop writing bibles here

hoary ember
#

lol, it's you that needs to chill my dude. You're spending too much time on twitter if you think writing a couple short paragraphs about a complex issue is a "bible"

stone garden
hoary ember
#

I don't need to do anything, you're just an angry little yapper with nothing better to do that be an annoyance to someone seeking info on a technical question. Get a life.

stone garden
#

let's fight

#

you can't do anything exactly, shhhh now

#

🤫

latent charm
latent charm
#

I have an idea. I want to transfer the pose from the anime style to realistic style. Both image created by the same model same prompt with different style. Anyone has idea how to achieve that via training?

humble holly
#

Hi, I would like, learn to fine tune a model to modify existing image. Where can I start to learn to do this?

jade hornet
latent charm
pine night
#

Hello. Are the stable diffusion models not available for fine-tuning through the API?

jade hornet
#

they should be, if you're into that it should be on modelslab or whatever. but if there's an issue, you dont need the api

pine night
#

Thanks for the reply. Do you mean that we don't need the api to fine tune the diffusion models?

jade hornet
#

correct

pine night
#

I found that we could fine tune the models through HuggingFace. But I am supposed to do it on a huge dataset. That is why I was looking for a solution where I don't have to write the code to do the distributed training and manage the necessary compute myself.

dusky urchin
#

what is your goal?

stable cloud
#

Hi Guyz i am new here can any body help me to prevent text generation on the image from stable diuffusion

tell me if can we finetune the Sdxl for not producing the text on any image, becasue stable diffusion is very bad in producing text so can we somehow stop sdxl to product any kind of text on the generated image, Currently i am using negative prompt

NEGATIVE PROMT = "text, fonts,words,3d, cartoon, anime, (deformed eyes, nose, ears, nose), bad quality,bad anatomy, ugly"

but it does not listen to the negatrive prompt so well
i want to generate canva template for christmas halloween etc with no text on it but it always put text with wrong spelling

empty horizon
#

Hello everyone. Can anyone suggest me a feature in which I can create layers similar like photoshop through AI

stone garden
#

I think I was able to install kohya_ss, but it seems like the version is wrong and the installation URL is displayed.
Which one should I download?

stone garden
#

Please let me ask you an additional question.
After updating to Python 3.10.11, Stable Diffusion no longer starts...
I was prompted to "Press any key" on the command prompt screen, so I pressed the enter key, but the screen just closed and the SD did not start.

stone garden
#

I will write down my machine specs.

Model number: ILeDEs-M07M-A134-SASXB
CPU: Intel Core i5-13400
Memory: 16GB
HDD: 8TB
graphic board:Ge Force RTX 3060Ti 8GB

If you receive a reply with a quote, you will receive a notification and it will be easier to notice.

empty horizon
#

Hello can anyone suggest same feature that Runway uses for Erase and replace (ai-tools/erase-and-replace) in Stable Diffusion sdxl? I have used inpainting but i cannot replicate the same through prompt which runway does.

jade hornet
#

Inpainting is the answer, despite it working differently, that's the workflow you would use

ocean dune
stable cloud
#

Hi Guyz can any one help me i want to finetune sdxl so it generate every image with the "solid color empty background" where i can put text in future i want stable diffusion to give me result like this :

#

I generated this image by this prompt "solid color background, christmas sales template,soft lightning,8k" but this type of prompt does not wokr if i want to make fathers day template , halloween templat ebut i need this type of thing for every image generation where i have room for the text on image

#

is it possible to finetune sdxl to get this type of result and one more thing there should not be a text on any image.

#

fituning lora would be better or the other one?

dusky urchin
#

is there a captioning tool or web based UI that anyone likes?

dusky urchin
stable cloud
#

can you tell me what is layered diffusion ?

#

is it the type of finetuning method

dusky urchin
dusky urchin
stable cloud
#

seripously i dont even know ideogram, i am new giuy i onlyknow lora finetuning

dusky urchin
#

i mean why do you want to generate greeting cards?

stable cloud
#

@dusky urchin i want to generate greeting card and put text on these type of templates so i can use it any where like in my shop to display christmas discoiunt offer or any thing

pliant drift
#

this ic_light model . might change the game for synthetic dataset creation

ruby moth
stone garden
#

I installed Forge, but the following problem occurs.
・“Error Connection errored out.” occurs frequently.
・I installed sd_xl_base_1.0.safetensors and Pony Diffusion V6 XL, but LORA does not appear (F:\webui_forge\webui\models\Lora)

I've been struggling for about a week now.
help me! (It will be easier to understand if you reply with a quote)

livid rapids
#

Who do I have to beg to try an implementation of this? https://twitter.com/rasbt/status/1758502685995589698

I noticed it months ago but haven't seen any support for it in training repos and don't have the skill to implement it myself

While everyone is talking about Sora, there's a potential successor to LoRA (low-rank adaptation) called DoRA. Here's a closer look at the "DoRA: Weight-Decomposed Low-Rank Adaptation" paper: https://t.co/Mmjhy3xTpd

LoRA is probably the most widely used parameter-efficient…

livid rapids
#

nvm it's actually been supported for a while I just missed it

drifting mirage
#

Hi! Is it possible to pause and resume train in OneTrainer? I would like to pause the training and test the model in a real workflow and if it is not trained enough, then resume training from the same place

fallen halo
#

Hi everyone! I need since help fine tuning stable diffusion for a product, if there is anyone that can help?? Appreciate you all!

dapper prism
#

Note that the greedy search caption failure issue is present in all automatic captioning tools to varying degrees, and it can impact up to 3% or more of your total dataset

dapper prism
#

For those who don't know what greedy search is, all you have to know is that the greedy search caption failure occurs when you come across captions that are endlessly repeating, letters, characters, phrases, and sentences. Greedy search is used in all VLMs and captioning models currently available

fallen halo
#

Hey! 🆘 I'm working on a project with stable diffusion Finetuning and ControlNet and need some HELP. If you're experienced with these, I’d appreciate your input. Thanks!

gentle flame
#

GPT4O is a lot less restrictive with what it's willing to caption

torpid basalt
#

Does anyone know a guide on dreambooth for artstyle? I want to do the fine-tuning with some real paintings and the trained model must match the artist's painting technique!

knotty pivot
#

Does anyone know how to create exact variants of a particular image. For example, if I want to create exact variants of a shirt in this format:

#

When remixing, I have not been able to get the exact orientation of the template

#

Please PM me!

hybrid shard
dapper prism
hollow spruce
#

@naive slate need a mod for this ^^

#

thank you to whichever mod removed it ❤️

pliant charm
#

give me a office task for employee working and decrease heat

rocky yew
#

I am curious, are there any good tutorials on to fine tune a lora? I am in the process of doing a redo of an old one and I want to fix it up, how do I go about this? It was trained on Kohya at basically the default, 22 images, no regulations and at 4400 steps - I would like to know how to... fix it up so it looks a bit more cleaner ((at the moment it looks quite whack)) -- Images used where large and clear

jade hornet
#

Determining what caused the learning to not produce what you intended could be many things. Maybe a different training model. Maybe it just needed more steps. Maybe some of the photos are to far apart in concept and the AI was confused on what it was supposed to learn. Maybe your captions need tweaked so there is more guidance on the particular scenarios that you wish to produce later. There's probably not a guide you will find to tell you exactly how to improve your specific scenario

polar nova
#

Hey, I was wondering if it is possible to train a lora to a person( full body, close ups, etc.) and if so, how to go about it and what checkpoint is better at realistic photography for this job.

jade hornet
torpid basalt
#

what the max number of images do i can use to train a style using dreambooth?

stiff dust
hot breach
torpid basalt
empty horizon
#

Hello everyone, can you suggest me a good model for open pose ?

white ocean
full trellis
#

Hello, I have a problem with kohya_ss. I started making my first model today, I did everything as in the guide, and my model was ready in 3 seconds while in others it takes up to 9 hours

my monitor is small so sorry for this logs:(

torpid basalt
valid sentinel
#

Hi, I'm using an ID-preserving pipeline with ControlNet (canny|pose) and SDXL for a use case that involves only one reference image for the pose and a face image to interpolate that face onto the pose. What is a good approach to fine-tune my SDXL or any other components in the pipeline to achieve better throughput? We are inferring on an image with the same prompt, so the model should generate only one use case but with a consistent pose.

tulip walrus
#

i'm having trouble opening stable diffusion I keep getting this message

stable coyote
#

I was able to train SD3 LoRA using example code: train_dreambooth_lora_sd3.py, but then file generated seems to be in diffufers format not webui/kohya. In ComfyUI I get errors like:

lora key not loaded: transformer_transformer_blocks_6_attn_to_v.alpha
lora key not loaded: transformer_transformer_blocks_6_attn_to_v.lora_down.weight
lora key not loaded: transformer_transformer_blocks_6_attn_to_v.lora_up.weight
lora key not loaded: transformer_transformer_blocks_7_attn_to_k.alpha
lora key not loaded: transformer_transformer_blocks_7_attn_to_k.lora_down.weight
lora key not loaded: transformer_transformer_blocks_7_attn_to_k.lora_up.weight
So i tried convert_diffusers_sdxl_lora_to_webui.py but it seems it doesn't convert it properly. Any ideas or tips how to progress with that?

This model: https://civitai.com/models/512239/pixel-art-medium-128 seems to have the same issue- file header looks similary to my generated file. I bet it was trained the same method.

This is an early version of "Pixel Art Medium" for SD3 Medium. Outputs 128x128 pixel art, grid-aligned images. Tips Always use "pixel art style" at...

earnest barn
#

i started using kohya to finetune sdxl on a dataset of 343 images of my art. the resulting checkpoint kinda had the style i was looking for but lacked the details i'm hoping for, it seemed like it might require more training. since i only did 10 epochs, i'd like to train it with another 10 to see if that helps. to pick up where i left off, would i choose the first version of the model as the source model instead of sdxl base? if not, how would i go about training more on top of what i already did?

small eagle
#

are there any newer tools for training SDXL lora's that are more simplistic than Kohya or Invoke-Training?

small eagle
#

thanks!

river lake
#

How are people training sd3 lora already can anyone tell me where they are training lora from

cold wyvern
#

You will need a 3090/4090 to make a lora, and a a100 or h100 to do a full finetune

rigid matrix
#

Apart from those dog and other small ones. I would like to see a full scale one.

white narwhal
#

老兄,有没有loraSD3的训练方法

#

在线等,着急

#

或者教学文档

stone garden
white narwhal
#

看看我

regal shore
#

Hello everyone, is it possible to locally finetune an SDXL model on an RTX 4060 ti 8 gb GPU if the dataset has 300 images? If this is possible, then approximately how long will it take?

cold wyvern
tulip plover
# white narwhal 看看我

(我迟了一天还回复 😅 )

关于SD3的LORA 现在这个其实只是拿来尝鲜一下罢了。 因为现在为止开放的训练方式就只有用diffusers训练的方式,而且这个训练方式有很多限制,包括只能支持一个概念的训练。 不过现在我们倒是有SimpleTuner的大佬支持SD3 LORA训练。前提是你有3090或者比它还要更多显存的卡,这个diffusers训练方式还是挺怂的。

中文圈其实也没有太多很好的关于SD3的文档。不过你假如真有一张3090,你可以Google Translate一下这个文档(官方hugging face)或者上SimpleTuner ( https://github.com/bghira/SimpleTuner
https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sd3.md

hollow spruce
hollow spruce
#

LoRA would be easier to train with your limits. and will take like 1~7h to train depending on your settings. will still be a lot of effort to get it working the first time though. (but after first time, settings barely change - so if it works, it works)

river lake
#

is it possible to use sdxl lora with sd3?

cold wyvern
river lake
#

also can you give me a article if there's one or a quick guide on how to install it

cold wyvern
river lake
#

so I use it inside a terminal right?

river lake
white ocean
river lake
viral swan
#

How much do captions help with training lora? Are they important? Mostly for style not learning a new concept

stiff dust
#

always use captions describing the image. You can/should still use a small caption dropout

gentle flame
regal shore
#

Hey guys.Does it happen when the style from the dataset is not copied in fine-tuning at low epochs (1-2 epochs)?

empty horizon
#

Hello everyone, can anyone suggest me why my generation of mask is not applying on the left side of the image. I am keeping the dimensions - 1016 x 504 but there is this left black patch that is coming again and again

thin mantle
#

Like super descriptive. Will this hep or is a simple obvious description all you need

stiff dust
river lake
#

How do I extract the parameters from a lora? like comfyui images have workflow
I trained a lora once and I can't remember it's settings now

white ocean
thin mantle
thin mantle
#

For example which would be a better caption and why. After ill provide you with more images to show the kinds of images im wanting to train for

#

Llava-13b:
The image is a colorful illustration featuring a woman with pink hair, wearing a yellow raincoat and a frog hat. She appears to be staring at the viewer with a somewhat angry expression. The woman is also wearing a nose ring, adding to her unique appearance. The illustration is likely created using a digital medium, as it has a vibrant and detailed style. The combination of the frog hat, pink hair, and yellow raincoat gives the image a whimsical and quirky vibe.

#

Gpt4o:

#

The image depicts a person with vibrant, neon-pink hair, accentuating a striking and bold fashion statement. They wear a distinctive frog hat with large eyes, combined with pink-tinted glasses and a red clown-like nose, enhancing the unique and eclectic style. The individual dons a yellow and black jacket with a high collar, further contributing to the bold and modern aesthetic.The background is bright yellow with white Japanese characters, creating a vivid and eye-catching contrast. The overall style is reminiscent of modern digital illustration and graphic art, heavily influenced by Japanese pop culture and cyberpunk elements. The neon colors and street fashion sensibilities give the portrait a contemporary and edgy feel.

#

These are the type of images i want to tune for:

spiral kettle
#

hey there is there any actual way right now to fine tune stable diffusion with 2 input images to give out one output image??

#

I want to combine some aspects of each input image to give an output if that makes sense

mossy oriole
# thin mantle For example which would be a better caption and why. After ill provide you with ...

From personal experience, SD/SDXL doesn't understand a majority of the LLM caption junk.
What I would take from Llava, then Gpt4o.

colorful illustration, colorful, illustration, woman, pink hair, yellow raincoat, frog hat, looking at the viewer, angry expression, nose ring, vibrant, whimsical and quirky vibe.
-> whimsical and quirky vibe might be a terrible choice, as these prompts already have a pre-existing function.

vibrant hair, neon-pink hair, frog hat, pink-tinted glasses, pink glasses, yellow jacket, black jacket, black and yellow jacket, bright yellow background, digital illustration, cyberpunk, neon colors, street fashion, portrait
-> You can consider removing the "bright yellow background"

mossy oriole
white ocean
white ocean
# thin mantle Do these captioning models beat out llm vision models like gpt4o and claude 3.5 ...

gpt4o and claude 3.5 sonnet are the latest hotness ... but blip2 is pretty good at describing what is in the image, that's what it was designed for and it's optimized for doing that, so it may be faster ... whereas gpt4o and claude are general purpose. Also blip2 is free and already integrated into Kohya. I haven't compared them head to head, but I'm guessing they are similar ... i'm not sure if Claude or gpt4o use a separate vision model to ingest images

thin mantle
#

Thanks guys i recieve my training pc tmrw so im excited to hit the ground running with your tips

white ocean
# thin mantle Thanks guys i recieve my training pc tmrw so im excited to hit the ground runnin...

My process is: 1) use blip2, in Kohya_ss then 2) I run exiftool from the commandline to extract exif tags that I put in the images via lightroom, these are appened to the .txt caption file. then 3) back to kohya_ss and run wd_14 tagger with the append option, this puts the wd_14 tags on the end. This last one tends to produce some nsfw images, but also seems to pick up a lot of creative and artwork anime characters as well.

#

Also really helpful ... if you are training a lora of a single person, it helps to use "starbyface" website ... this will give a random celebrity that looks a lot like your person. Include this celebrity doppleganger name in the caption file ... it will put your subject in more scenes and give more character.

white ocean
# thin mantle

I missed this post, but these captions seem far better than blip2, although it is only set to write a short caption by default; I haven't set it to write more than just one or two sentences. Keep an eye on your caption size limit for training.

thin mantle
#

So pretty much the standard consumer build

cunning zinc
#

Hi guys, i want to try out finetuning and i am following the following:
https://next.platform.stability.ai/docs/features/fine-tuning

however I get an ModuleNotFoundError: No module named 'stability_sdk.finetune' even after installing stability-sdk.
Is there another package that needs to be installed?

#

stability-sdk version is 0.8.6

river lake
#

if I increase my batch size in khoya ss lora training from 1 to 4 on a 3090 will it affect the output quality of the lora

jade hornet
#

try it? I tend to run with a batch size of 4

#

in theory no it's just how many you're doing at once, some swear it makes it better. maybe there's more blending when you're doing them at the same time, hard to say.

scarlet rune
#

Hello everyone! I have a question and wonder if anybody know how to do. I am an architect and want to fine tune a model for a specific case study. Such as a tower ruin that designer require to create a new design within it, like adding new structures on it (adaptive reuse). I have the photos of tower ruin and 300 images of design alternative renders for the tower (there was a competition and I took images from there). So the question is: can I fine tune the stable diffusion that it can generate new design solution by keeping the site and ruins but adding new structures on it?

#

Which model would be best for it

fading sail
tired wind
#

@fading sail Hi, Thank you, all credits goes to the original author. My node is just a wrapper around their implementation.
https://openaccess.thecvf.com/content/CVPR2023/papers/Yan_Linking_Garment_With_Person_via_Semantically_Associated_Landmarks_for_Virtual_CVPR_2023_paper.pdf
Their paper only discusses finding landmarks on upper body garments. The dataset VTON HD (https://www.kaggle.com/datasets/marquis03/high-resolution-viton-zalando-dataset/data) is upper body only as well. So current implementation & model won't work for lower body garments,
And, it should work for both man and woman, as long as the model can detect the landmarks appropriately.

fading sail
#

ah thanks for replying, il check it out. thx 🙂

#

im mainly looking for something that does all the body parts, so like upper, lower or full dress, i think IDM VTON does it and the magicclothing one, but they seem to require a lot of VRAM, but il keep searching

thin mantle
#

Whats your guys workflow for captioning many images and making sure they are correct. What app do you guys use? And what features are your favorites?

#

Specifically to train loras and checkpoints

#

Cuz vision models often get things wrong

hoary geyser
#

hey all, i'm attempting to train a sd(1.5) u-net from scratch, on a small (~2000 images) dataset that's not varied (specific subject).
my theory is i can use kohya_ss modified to re-initialize the weights at the start of the training loop to effectively reset the u-net.

technically this is working, but the output images aren't sensical yet. Wondering if there's anyone here who I can talk to, to explore this further.

jade hornet
#

There are lots of attempts to make something completely automated, but forget that, dataset prep is where you should be spending the most time

stiff dust
neat rover
#

How hard it is for a beginner to create a pixelart model that will output similar results to this? I can get thousands of graphics in similar style and exact same resolution and format to train it.

nocturne sierra
#

Hello everyone,

Does anybody know any good SDXL checkpoint training guide?

We have a nice dataset of 5,000 images in a specific style, but we just can't find any tutorials and articles about checkpoint training (not LORA)

blissful quest
#

/imagane Real style, exterior, day, school building in long shot, a group of four 11-year-old children are standing on the far balcony, on the second floor by a railing, the children are talking excitedly. The first child is fair skinned with curly black hair and is wearing a red t-shirt and gray pants And talking to Solo, the second boy is tanned with straight blond hair he is wearing a green button up shirt and gray pants he is talking to the first boy, the third boy is light skinned with straight black hair he is wearing a light orange t-shirt and gray pants he is looking at Solo
The boy in the center: solo, tanned skin tone devilish blond wavy hair, blue t-shirt, short jeans and a gray school bag on his back
.

thin mantle
#

I just finished training my first lora. I am now left with 800 safetensor files. How do i test them all to see which is most efficient?

latent charm
#

just test the last. If it ok, boom.

thin mantle
#

Ok lol

empty horizon
#

Hello everyone, what should be the input prompt to extend the below images. I am trying to automate the extend feature so I require one specific prompt for the extension of both the images.
for patterned images, it's working absolutely fine. But for plain background images, it's adding some background which is not matching with the original image bg.

#

Prompt used: Generate creative background scene matching original image. Environmental scene, city life, dressing style, nature, building, non-living objects.
Neg prompt used: Blurry, bordered, zoomed, solid color, monotonic background, disfigured, human figure, living objects, gore, dead, hazy, dull.

thin mantle
#

Ty for anyone that helped answering my questions. My first lora came out great. I just posted in #✨|sdxl now im wanting to mix 2 models together because i find myself having to make an image then denoise that image with another model to get the result i want. Would merging the models save me from having to do this?

deft solstice
#

Imagine a sleek, modern laptop displaying a vibrant and futuristic website interface. The screen showcases innovative design elements like smooth animations, interactive features, and bold typography. In the background, there are creative tools scattered around, symbolizing the process of crafting cutting-edge digital experiences. The scene conveys a sense of forward-thinking design, blending creativity with technology to shape the future of web development.

cursive condor
#

anyone here has information about fine-tuning? I have no idea how much more/less data it needs. the example I had from dreambooth dataset was like 10 images per object/class.

deft solstice
#

#🔧|finetune Imagine a sleek, modern laptop displaying a vibrant and futuristic website interface. The screen showcases innovative design elements like smooth animations, interactive features, and bold typography. In the background, there are creative tools scattered around, symbolizing the process of crafting cutting-edge digital experiences. The scene conveys a sense of forward-thinking design, blending creativity with technology to shape the future of web development.

deft solstice
#

Prompt used: Generate creative background scene matching original image. Environmental scene, city life, dressing style, nature, building, non-living objects.
Neg prompt used: Blurry, bordered, zoomed, solid color, monotonic background, disfigured, human figure, living objects, gore, dead, hazy, dull.

thin mantle
gentle flame
valid sentinel
#

https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py

Hi, I'm fine-tuning my own style (with a fixed pose and background) using about 30K images. These images were generated using the SDXL model. Now, I want to fine-tune the SD15 model to adapt to that style for better performance.

All of my samples in this dataset use the same prompts. However, the results after one epoch are bad. The model seems too vibrant. I don't know if this is due to my dataset preparation (one prompt for all) or something else. Has any developer struggled with the same issue?

GitHub

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX. - huggingface/diffusers

woven viper
#

I need help setting up a workflow to train a Lora. I can get the text description files to generate but It will not generate the actual Lora file at the end. Is anyone familiar with doing this?

graceful nova
#

Hi everyone
i trying to train Canny Controlnet to generate clothing images. in my training set (of 100K images), i have only Cloth items but while inferencing SD 1.5 drawing cloth along with humans. I using HF diffuser to train my controlnet.

  1. Is it normal for SD 1.5 to generate human even if there are no in training set.
  2. Currently i am not using any negative embedding. can negative embedding help removing unwanted human.
elfin valve
#

Hey friends! I'm looking for paid lessons from someone with experience in training masked multi-resolution SDXL fine-tuning using OneTrainer.

I've been attempting to create a photo-realistic fine-tune to generate images of Miranda Kerr, the Australian model. For my base model, I used epicrealismXL_v7FinalDestination. This model can produce good pictures at both 1024x1024 and 1024x1280 resolutions. I thought that training on a mix of images in these resolutions might improve the fine-tune, but the results have been disappointing, indicating I might be missing something crucial.

Here's what I did:

  • I prepared 20 images at 1024x1024 resolution (half face close-ups and half-body shots) and 38 images at 1024x1280 resolution (a mix of half-body and full-body shots).
  • All images were masked with SAM masking, and captions were generated using "WD14 VIT v2," with "mirandakerr" as the first word.
  • For the first training concept, which used only 1024x1024 images, I set "Resolution Override" to "1024x1024". For the second concept, which used only 1024x1280 images, I set "Resolution Override" to "1024x1280".
  • I enabled "Aspect Ratio Bucketing" during training, hoping for at least decent results.

Despite this, the output quality is poor. Starting from epoch 1, I see degraded quality, and by epoch 5, the quality resembles a painting style (really bad).

The dataset used:

In total, I trained with 58 masked images per epoch.

I tried training for 100 epochs, and it can be seen that after the first 5 epochs, the quality degrades and stays at the degraded level (paint-like style). I'm attaching 4 samples of 1024x1024 resolution (epochs: 0, 5, 50, 100) and 4 samples of 1024x1280 resolution (same epochs: 0, 5, 50, 100). It is clear that the image quality at epoch 0 is good, but anything after the first epoch is unusable due to the really bad quality. Learning rate used "3e-6". Learning rate "2e-6" gives similar results (but slower training).

I'm also attaching the training preset "Tuned SDXL FineTune BFloat16 3.json." I have 16GB VRAM and used ADAFACTOR with the bfloat16 data type. However, I can rent a better machine if you believe the results will be better on a bigger machine.

Note, "epicrealismXL_v7FinalDestination" can already produce "Miranda Kerr". However, I'm trying to teach the model to recognize her using the unique trigger word "mirandakerr".
In fact, I'm not interested in training this specific model and I'm not interested in the final checkpoint. My goal is to learn how to train photo-realistic images of real people.
Currently, I'm concerned about the poor quality (fried skin).
The reason for using masked training is that I need to train the model on photo studio face shoots where the background is always white. In such cases, I don't want the model to learn to reproduce a white background in all photos.

In the training dataset, you may find black bars on the sides - these were created to match the training resolution. However, these black bars are masked and thus not included in the training.
I have never used Kohya_ss, so I can't say if it gives better results, but I think I will experiment with Kohya_ss as well because currently, I'm a bit stuck with OneTrainer.
I would really appreciate any help or advice on effectively training a photo-realistic model using OneTrainer.

Disclaimer: "Voplica" is the project where I'm conducting my experiments. The final goal is to create a service for photo-realistic model training. I am personally a software developer. All content at Voplica is AI-generated.

Looking forward to finding a teacher and potential partner.

Thank you!

frail moth
#

Does anyone have advice for Logo prompts and how high the CFG should be?
Been using Redmonds Logo Lora

#

with prompts like this

#

logo, claw reaching out of the screen (animal paw with claws), neon claw, 3d realistic <lora:LogoRedmondV2-Logo-LogoRedmAF:1>

#

but i get mostly shit

#

using fp8 to not run out of vram idk if thats an issue

#

and sdxl

latent tiger
#

Hey all! Been a while since i did any fintuning so was wondering if you know any great guides in fintuning and dataset building (mostly how to make a quality dataset) thanks for your help!

tame otter
#

are there any good guides, like not tutorials, just in depth information regarding the building and structuring of a large dataset of thousands or 100s of thousands of images

minor island
#

Hey all, asked this question over in tech support and we could not land on an answer... Maybe you guys can figure it out..

I am training a lora and saving samples during the process. The samples have very clear influence of the training data in them so the lora is for sure seeing the results from the training. Fast forward to using the lora and no matter what I do, I can't get the image output using the lora to have any difference than the same prompt/seed than the image without the lora.

Using other loras, this is not the case, it only seems to impact the ones I have trained. I see the lora listed in the prompt details so A1111 is picking up the lora is there but it just seems to have zero effect.

Not sure where else to look at this point. Any thoughts or advice is greatly appreciated.

elfin valve
jade hornet
minor island
minor island
elfin valve
#

Update: The position is now closed. Thank you everyone who applied ❤️

Leaving the below information for reference. This position is now filled, but we will notify when we have new positions. Moreover, we are always open for collaboration, partnerships, or just meet other tech enthusiasts in this field.

Voplica is a startup specializing in AI model training, image generation, and image enhancement. We are seeking a Python Engineer with strong expertise in Stable Diffusion (SD) models for a full-time position.

Job Responsibilities:

  • Develop automated model training processes that require minimal human oversight.
  • Develop pipelines for dataset preparation, including cropping, masking, and image analysis using various models.
  • Create inference processes utilizing fine-tuned Stable Diffusion models.
  • Build pipelines for image restoration and enhancement, such as upscaling and fixing details in hands, feet, and faces.
  • Contribute to the kohya_ss/sd_scripts project by automating parameter tuning and improving model training processes. This includes enhancing training on non-diverse datasets, such as masked training for photo studio datasets with consistent backgrounds.
  • Quickly learn and integrate new AI technologies into our projects.

Qualifications:

  • Strong expertise in Python programming.
  • Extensive experience with Stable Diffusion models.
  • Knowledge of data structures and algorithms (Big O notation, space and time optimization, hash maps, trees, heaps).
  • Preferred: Experience with SDXL hyperparameters tuning, learning rate loss analysis using TensorBoard.
  • Preferred: Knowledge of microservices architecture, developing Python workers for job queues (RabbitMQ, Kafka), experience with S3 storage or other object storages (OpenStack Swift, Ceph, etc.), gRPC.
  • Preferred: Experience with context switching optimizations for LLMs.

If you are passionate about advancing AI technology and improving image processing capabilities, we would love to hear from you.

Best regards,
Alex

elfin valve
#

Hey friends,

Some of you have sent me DMs here, but your privacy settings don't allow me to reply or send you a friend request. Please ensure your privacy settings are adjusted, or send me an email instead.

Best regards,
Alex

hollow spruce
#

either you messed up a setting, in how it is saved in kohya/onetrainer/diffusers
or your venv/torch got messed up.

I'd recommend for you to check the settings for how your lora is saved, and if you dont see any obvious issues, then wipe your current venv (virtual python environment), then let it install fresh, then try again.

hollow spruce
# tame otter are there any good guides, like not tutorials, just in depth information regardi...

nop 😦

but based on the large scale finetuners (not companies) I've met, they all fall into 3 camps:
Amazon S3 Buckets. Example: https://huggingface.co/datasets/ptx0/photo-concept-bucket which is then precomputed and hosted on S3 and loaded via a json file
JSON only Datasets: https://huggingface.co/datasets/CaptionEmporium/furry-e621-sfw-7m-hq which include multiple captions for each image, enabling multi-caption style training (very effective, but also costly to train)
For complete local storage, but datasets of above 100k images, you either do the lots of folders + json solution, or go the hydrus network route. Both are equally painful.

#

If you were talking about the balancing act of said dataset, then there's just no agreed upon solution.
and most companies are just winging it (terribly) - which is why we have these extreme biases to begin with.

hollow spruce
# elfin valve Hey friends! I'm looking for paid lessons from someone with experience in traini...

another victim of SECourses/Furkan? 🥲

You've prob already found someone to help you.
But just wanted to point out key things:

• Training photorealistic is very very different from general training. You actually wanna avoid training things like details, but instead the concept of how a person looks.
Example I trained using 9 images, to make a point of this: https://civitai.com/models/349773/oc-aline
• a "full finetune" opens up more possibilities for training. but that's not always a good thing. Unless you at least roughly know what you're doing, a full finetune will often do more harm then good. A simple lora training in 10~30minutes will often do the trick just fine.
• Experiences cannot be transferred 1:1 for training models. Training a lora of a white woman, will usually have a similar pipeline. But training one of an indian or chinese woman or man will have significant differences. (Hence why fully automated solutions don't work, unless you have a genuinely balanced base model)

to recognize her using the unique trigger word "mirandakerr"
trigger words, especially using real names, are a very complex topic which I've literally witnessed kill an AI startup.
tokens already have meanings assigned to them. Meaning there is no "one size fits all" fully automated solution.
For the sake of simplicity, you need to train both the unet and the clip models correctly without overfitting either, if you want custom trigger words to work.
(It's not that custom trigger words dont work - THEY DO! just that they sometimes work better, and sometimes worse, and that's something you need to be aware of. Sometimes a lora fails completely, simply due to the chosen triggerword, so you have to pick a different one)

Currently, I'm concerned about the poor quality (fried skin)
Your training is picking up the fine details before picking up the general concept. Min SNR 5 is your friend here. Also, doing only a net rank 32 lora will help as well, since you affect less parameters. (meaning less details can be messed up)

Kohya and Onetrainer give fairly similar results if you've got a grasp on the parameters. Masked training, while useful, does also have negative side-effects. so relying on it completely isn't a good idea either.

The final goal is to create a service for photo-realistic model training
If you wanna match the currently existing services (phone apps and a few online sites), that's not too hard. They rely mostly on overfitting a fair bit, and thus don't bother with the typical issues that might pop up. This can be automated fairly simply.
If you wanna beat them in quality, then that's genuinely hard, due to the issues I mentioned earlier, which will all occur if you don't cheat while training.

The Pareto principle really applies here. You can get 80% of the way, with 20% of the effort. But the closer you wanna get to 100%, the amount of time and effort you invest will rise exponentially.

A LoRA of an original character named "Aline" Use with: 832x1216px The Standard checkpoint is compatible with base, as well as all sdxl based finet...

elfin valve
# hollow spruce another victim of SECourses/Furkan? 🥲 You've prob already found someone to he...

Wow. So much useful information you pointed here. Thank you so much!
I haven’t checked SNR yet, but will definitely take a look at it.
Regarding masked training. Many datasets I’m working with may be a photo studio shots with constant background (white background) and I’m afraid the model will learn white background during the training (which I don’t what to do).
I noticed you trained DoRa models as well. Have you noticed differences in quality between LoRa, DoRa and full fine tunes by any chance?

hollow spruce
#

and I’m afraid the model will learn white background during the training
• With a very simple photoshop script, you can automatically replace the background <- good enough to add random backgrounds to images, so that none of it gets picked up by training
• "simple background, white background" can be added to the captions of the whole dataset, then used as a negative during inference. that also gets rid of the background. <- change white to whatever background color you actually have
• masks can be used, but doing so means that model loses the context of "where" to generate your new data. <- also works, but isnt less work than the other 2 options. best to try all 3 for your unique scenario

Have you noticed differences in quality between LoRa, DoRa and full fine tunes by any chance
Yeah. A lot.
DoRA is a massive improvement to everything, on the same scale as full finetuning, then extracting a lora. But it does come with the same possible downfalls of full finetuning, where more can get messed up. But once you have experience with captioning & a basic preset that works, DoRA is basically as easy to use as a LoRA, just better in a lot of ways.

regal shore
#

Hi everyone,can you advise for 370 images, how many minimum epochs and steps are needed for average quality?

hollow spruce
regal shore
hollow spruce
#

using kohya?

regal shore
hollow spruce
#

names are pretty much identical in onetrainer, so should be fairly easy

#

just remember to turn on min snr 5 in onetrainer, since that ones hidden away in a dropdown

#

epoch 100 will be your target

#

I usually let it run to 150, in case I prefer a slight overfitting

#

https://civitai.com/models/597892/juno-overwatch-for-pony-properly-trained
and
https://civitai.com/models/596487/sonoshee-mclaren-for-pony-redline

were both trained with those settings, on pony sdxl. so you can look at those for what quality to expect from that preset

Core Tags ( with suit ): juno overwatch, purple hair, short hair, gloves, bodysuit, covered navel, breasts, medium breasts, blue gloves, multicolor...

Core Tags: sonoshee mclaren, 1girl, solo, green hair, green eyes, multicolored hair, pink hair, breasts, necklace, hairclip, large breasts, sunglas...

regal shore
#

Thank you

hollow spruce
# regal shore Thank you

you're welcome. and remember to adjust learning rate in case you change the batch size
your learning rate / batch size = 0.0001
thats why the preset has 0.0008 with a batch size of 8
works if you have a 3090 or 4090.

if you have a smaller gpu, then just adjust it using that simple formular

hollow spruce
#

@unborn wind

#

I think you got the wrong channel

unborn wind
#

Lol I actually didn't. thank you!

remote creek
#

Hi could you please advise me on how to format/partition? I got a new 4tb SSD I want to install Linux and automatuc1111 on it and use it to store all my big files (models, loras, etc), AND use it sometimes for webuis in windows as well (SD.next, forge). So I want to reuse the same directory for loras and models across windows and Linux dual boot, issue is, sharing requires ntfs and I understand having the models on an ntfs partition will slow down the Linux perf of automatuc1111? Is that correct?

lime pawn
#

Little chubby guy eats fruits

hollow spruce
#

under models, is where I have all checkpoints, loras and stuff saved. its also where I automatically store things I train myself

#

And every application has some way of being able to point at a different directory for models.
For A1111, you edit the webui-user.sh
export COMMANDLINE_ARGS="--data-dir '/mnt/md/0/AI/A1111/stable-diffusion-webui' --ckpt-dir '/mnt/md/0/AI/MODELS/checkpoints' --lora-dir '/mnt/md/0/AI/MODELS/lora' --embeddings-dir '/mnt/md/0/AI/MODELS/embeddings' --listen --port 16999"

comfy, swarmUI, and all the other generators can also have different locations set up - you just point them at that models folder

#

as for windows and linux sharing a drive - there's info on that to be found online, and is not specific to stable diffusion. You can basically just "mount" the shared partition in linux. pop!os makes that fairly simple - so read up on their site or ask on their discord

hollow vortex
#

Hi, I have a question about finetuning stable diffusion for inpainting.

I came across this paper which is similar to the work we are doing: https://arxiv.org/abs/2312.03606. They finetuned the Stable Diffusion model to create synthetic satellite images. After reading this paper I was wondering it would be possible to finetune the Stable Diffusion Inpainting model as well. For instance, we could mask out an area in an image and then our finetuned model could fill in that area with a road, river, or something of that nature.

I currently have input images and masks of land cover features on hand. I do not have any prompts but I am under the assumption I could possibly create some myself if needed. What would be the best way to go about finetuning the stable diffusion for inpainting model on my dataset (~10000 images)?

For reference, I have trained a custom UNet2DConditionModel for inpainting from scratch with simple labels such as grass, trees, water, etc... using my dataset that I posted about here: https://discuss.huggingface.co/t/custom-pipeline-inference-speed-extremely-slow/89642 and was able to get some decent results (I posted this because the inference speed was slow but that has now been fixed). I pulled the off the shelf stable diffusion inpainting model and attempted to inpaint certain features into the images and it did pretty well but could definitely use some work. With that being said, I was hoping that finetuning the stable diffusion inpainting model could outperform my current current model

pseudo sail
#

Hi all, looking to make my first dataset. Are there any web-based or mobile-friendly tagging tools? had a look at taggui but it's desktop only from what i can tell

stone garden
#

any 4090 guys here trained flux yet?

stiff dust
#

still trying. It works but results are not really good yet

golden quail
#

Has anyone tried applying ReFT to any diffusion model yet? Seems to beat DoRA and with much fewer parameters for LLM benchmarks https://arxiv.org/abs/2404.03592

torpid basalt
#

which method is the best for train a artistic style with a lot of images? (Lora,Dreambooth,Custom Diffusion, etc)

cloud ice
#

hey not sure if this is the right place to ask such a question, but i am currently trying to finetune sd1.5 with python and i dont get any errors but my outputs test images are exactly the same (pixelcomparisson), also i compared the .safetensors from the unet cmd: "fc /b file1 file2" and no changes were detected.
my idea for the dummy data was to just use the same image several times with the same description as a proof of concept. I then use exactly the same prompt to generate test images, hoping to see some resembelence to my dummy input image

any help would be highly appreciated ❤️

cloud ice
#

also tipps where to ask would help 😄

jade hornet
#

why 1.5? Old model. Anyway maybe use kohya vs trying to write it yourself

cedar birch
#

Hello!

Does anyone trying to train Juggernaut XL that version https://civitai.com/models/133005?modelVersionId=456194

I did, but the results was not good enought. Dataset was captioned by the same tool as jugger, near the 300 images.

Can you give me some advice how to improve it?

Some examples of captions:

A cocktail, amber fluid color, old-fashioned glass, ice cubes 75% ice, lemon slice garnish in glass, reflective surface, vibrant surroundings, illuminated counter, glistening drink

A cocktail, vibrant magenta fluid color, coupe glass, lime wheel garnish on the rim, rose petals around glass, blurred background with bokeh lights, indoor setting

For business inquires, commercial licensing, custom models, and consultation contact me under juggernaut@rundiffusion.com Join Juggernaut now on X/...

stone garden
cedar birch
stone garden
#

it used to be, I trained a lot of my first SDXL loras with jugg and copax

#

back before I was training checkpoints

cedar birch
stone garden
cedar birch
stone garden
#

ok but lightning models are the best to train with as you need a lot of images and lightning runs CFG 1 and steps 2

cedar birch
stone garden
#

there is a render with them settings

#

even more so reason to use it if your renting

cedar birch
#

Try to render a cocktail, sdxl know a less concepts for coctails.

stone garden
#

what a cocktail drink?

cedar birch
# stone garden what a cocktail drink?

A close-up of a champagne flute on a marble surface, filled with sparkling champagne. A lemon twist spirals around the inner rim of the glass. The background is dark, emphasizing the clarity and bubbles in the drink, creating an elegant and sophisticated atmosphere.

#

Something like that

#

Im targeting to something like that, but without bottle

stone garden
#

i have just written this and its rendering now on the beach in the sand is a cocktail glass with a colourful drink inside, with ice cubes and a light condensation on the glass, in the back ground is a stunning sunset on the ocean horizon,

cedar birch
#

The main problem another. I need an image to recipe. So i need full control on, glass type, fluid color, soft drink or not, with/without ice, garnish etc

stone garden
#

tell me what glass and colur drink

cedar birch
#

glass: champagne flute
fluid color: transparent golden

extras:
without ice
with lemon twist garnish on the glass rim

stone garden
#

if your training you need every type of colour not nesseryly the brand of drink as the check point will know that you just need primery colurs and differnt glasses

#

ice, garnish is all handled by the checkpoint

cedar birch
#

Some of my captions

A cocktail, amber fluid color, old-fashioned glass, ice cubes 75% ice, lemon slice garnish in glass, reflective surface, vibrant surroundings, illuminated counter, glistening drink

A cocktail, pink fluid color, lowball glass, crushed ice 90%, cucumber slices in glass, grapefruit slice on rim, mint sprig on rim, tray surface, daytime

A cocktail, white fluid color with foam on top, coupe glass, cinnamon stick garnish on rim, pink and yellow background

#

334 image total

stone garden
#

the lora is a very basic guide for more custom cheraters for the newer models as the new models are so well trained they dont really need a lora

#

ill put those prompts into flux and you will see no need for a lora

#

ok used this one A cocktail, pink fluid color, lowball glass, crushed ice 90%, cucumber slices in glass, grapefruit slice on rim, mint sprig on rim, tray surface, daytime

#

spot on ,,,, no need for a lora or neg prompt

cedar birch
#

cucumber is not a real.

#

Real photo

stone garden
#

could make it more real with 1 word added to your prompt

#

thats not what your prompt ask for tho lol

cedar birch
#

Flux 😄

stone garden
#

thats cant be the same prompt i used

#

they are to different

cedar birch
stone garden
#

oh

cedar birch
cedar birch
#

Flux works better, it's true.

cedar birch
stone garden
#

rendering for your red drink

cedar birch
cedar birch
stone garden
#

my flux hyper on the left your google image to the right

#

A cocktail, red fluid color, tumbler glass, medium ice cubes 90%, lime slices in glass, rosemary, pommegranite, white marble table, daytime

cedar birch
#

What is did with the Pomegranate xD

stone garden
#

omg i just noticed that lol

#

😆

cedar birch
#

Flux better, but you see )

stone garden
#

well trained it even knew what i ment from that shambles

#

running it correctly now

#

wah lah better

#

lol

cedar birch
#

the next problem is the layers (real photo below)

#

A cocktail, amber and green gradient fluid color, highball glass, crushed ice, 80% ice, mint sprig garnish on top, tequila bottle in background, ice on plate, bar setting

stone garden
#

ok here goes

#

A cocktail, amber and green gradient fluid color, highball glass, crushed ice, mint sprig garnish on top, tequila bottle in background, ice on plate, bar setting, UHD, microscopic photography, magnified, molecular, unseen worlds revealed, scientific exploration, capturing molecular details, professional imaging techniques, precise focusing, revealing hidden beauty, scientific discovery, artistic interpretation, nano scale, revealing the wonders of the unseen

cedar birch
stone garden
#

ip adapter

cedar birch
dull wigeon
#

Mia Khalifa serving food in a restaurant wearing a protective revealing leather outfit

granite rover
#

Does someone know If I can caption my images with llama 3.1 to create some loras?, (in comfy ui) (I will just use 20-30 images)

sharp basalt
#

Commercial photography, powerful yellow powder explosion, fried chicken, black background, bright environment, white lighting, studio lighting, OC rendering, super detail, solid color isolation platform, professional photography, color gradinging About Midjourney Parameters --ar 9:16 --v 5.2 --s 750 --c 0 --q 1

stark veldt
#

hi guys, i learned recently a bit about finetuning and i have some questions about how dynamic it is.

im building an app where one could input their biz/startup/game idea and go through steps where they will be generating context about it (objective, target audience, biz model, etc)

every time you generate something, it is used as context for next time you interact with an AI

rn im also working on a step to generate a logo for that brand.

what im doing currently is im dumping the brand context into GPT and asking it to create a prompt for SDXL, which i then insert with some other prompt keywords to make sure it looks like a proper logo

the issue is GPT's prompts are kind of trash.

i was wondering how well fine tuning would work if i:

  • Gotten a bunch of pairs of brand context dump -> ideal logo image
  • Trained a SDXL model on it

i.e would it work well if the "prompt" is a huge context about a brand, and not rather some small trigger keyword?

or is that not how it works?

compact mango
#

I fine-tuned a model with images of an actor from an old movie. All of them have a characteristic look due to the camera quality. Now the model generates images in that particular style, not with the face that I wanted to achieve. What did I do wrong? Should I improve captioning or prompting? I don't want to replicate the style, but face.

rare fern
#

Hey everyone
i want to train a LLM SDXL fine tune model with about 100k images i have trained 84k images model till now but haven't gotten any better results till now, can anyone tell me how to start it?

cloud matrix
#

don't use an LLM with vision capabilities. they can be good but they're over kill for flux training. This is out. Florence 2 is a Vision Language Model that's more light weight and specialized. Look into that.

full trellis
#

why when i installed control net in stable diffusion i didnt see a tab with controll net

stone garden
#

yo i got a question about dataset preparation.
i am currently distilling a lora by making a big amount of wildcard prompt based images and simply discarding low quality and bad character features.
the idea i'm having to make the lora less biased towards one style is to try and generate multiple different styles using the already existing lora.
so i'll make an equal amount of different style images of the character using the old lora and sdxl, take the best images and then train a flux lora on those images

#

how many images does it usually require for a flux lora and how can i make it learn the characters features abstractly and not directly associate it to one particular style

jade hornet
#

You mention style and you mention character, normally a character training would use subject captioning, where you describe as much in the image as possible except the character, because you want the ai to learn the character. Style training is different, you want to describe as little as possible, because you want it attempt to learn the style which is a more abstract concept.

stone garden
#

i used joycaption to caption the dataset and then removed things that describe the recurring features of the character

jade hornet
# stone garden i used joycaption to caption the dataset and then removed things that describe t...

nevertheless, I feel like your goals are diametrically opposed. style training and character training are different methods. Eithre try to train with no captions and just see what happens, or do them separately, or with a multiple concept lora with one trying to capture the style and one trying to capture the character. different triggers for both. or you could do 2 separate loras worst case. remember that flux is very new to all of us still, so any advise is to be taken with a grain of salt

stone garden
#

huh i just want to capture the character and nothing else, no style

#

i want a token to be associated to the character

#

i dont want to have to describe my character in detail just to get him

#

as it is with the pure joy caption loras where character features havent been purged

#

i did a new dataset and new captions and this time i'm just referring to the character as if the model already knew it

#

using the token i chose

#

the description that has the most content is the background and style of the image

jade hornet
#

I see, well character training is something flux should handle easily, even with few images

stone garden
#

because i want a lora that can portray the character in different styles

#

not just in the original one it was made in

cloud matrix
# stone garden not just in the original one it was made in

there's lots of strategies here. If your dataset is consistent enough, you can get away with just using one trigger word to train. If the dataset is varied, you'll want to use captioning. This guy has been writing tons of training diaries. this and his other writings about flux are insightful

https://civitai.com/articles/6868/flux-character-caption-differences-training-diary

I wanted to do more Flux training experiments, and I got some ⚡⚡⚡ buzz donated ⚡⚡⚡ from a user to run some character experiments, so run the experi...

gentle flame
#

I'm not technical enough to explain it properly, but the author described it as rmsprop with a low pass filter ontop.

amber cypress
#

Hi:) I just finished a dataset of around 40gb of architectural images. Tagged by scraping the original captions, extracting keywords + florence2 descriptions.

#

The goal is ideally to make a full fine tune of Flux, since with so many images a lora might not make sense. Any tips, guides?

#

Around 75k images

jade hornet
#

you would typically be right with that approach. 75k images dataset probably not suitable for dreambooth/lora. The problem is flux is so new I dont know anyone that has done a full finetune on it yet. basically, you're treading into new territory here

twilit cradle
#

hey hey quick question if anyone knows! kinda new to this. I am using kohya to train loras, but want to queue up the trainings to test different parameters (when one training ends, the other starts up without me having to manually start it). All i have found so far is to print the training command for each configuration, and then paste them into a script or bat? is first question. 🙂

twilit cradle
pine relic
#

hi all, I am a complete beginner to finetuning SD models. I have tried to set up the Automatic1111 web ui on me mac but failed badly as I was getting some error related to device type being mps which I was unable to fix.

do you have any advice of where to start if I am a complete beginner with not much knowledge in software development and need an easy way to finetune the sd model? The use case is teaching it to generate images of a specific product. Thank you in advance. And sorry if this is written anywhere, I was unabel to find it

jade hornet
# pine relic hi all, I am a complete beginner to finetuning SD models. I have tried to set up...
  1. dont do it on mac if you can avoid it. what matters most is gpu, and unfortunately apple does not shine here. 2. automatic1111 wont help with finetuning, it's only for doing image generation locally. 3. check out this link for a good training app. maybe you should look into doing it on a cloud service such as vast.ai, if you have limited compute choices locally. https://github.com/bmaltais/kohya_ss
GitHub

Contribute to bmaltais/kohya_ss development by creating an account on GitHub.

pine relic
jade hornet
#

I stopped using colab when they changed their EULA to forbid it on their free tier. you can pay for colab pro, but honestly vast is better imo

mighty sedge
#

first time i manage to make my loss graph look this beautiful lmao

jade hornet
#

.08? wow...hopefully the actual results match that

mighty sedge
#

nah, it overfit a long while ago lmao

#

this is actual

jade hornet
#

umm, well depending on what you're going for... 😄

mighty sedge
#

I think best epoch was this one:

#

I'm trying to revive a dead artist style, but sdxl can't keep up with original, I've been trying for weeks

#

this is source, looks at those patterns.. wish sdxl could do it lmao

jade hornet
#

that's a lot of detail

mighty sedge
#

yah.. might have better luck with flux, but I don't got the hardware/patience for it

jade hornet
#

hard to say, I'm still working on flux loras, and kohya only jsut recently added the ability to train the text encoder.

mighty sedge
#

any luck with text encoder training? I'm using to training unet only, not sure if text enc benefits this style usecases

jade hornet
#

well it converges quicker I can say that, but my current training is still in progress, so I'm not satisfied with it yet we'll say

mighty sedge
#

I see, pretty nice to share to insights into this blackbox that is training lmao

jade hornet
#

I'm doing one with multiple concepts just to see if it works...when I get strange defects I cant tell if it's going off course or if I just need more steps, so onward

mighty sedge
#

yah, too many params, overfitting can take many forms. I tried the 2 layer flux training on civitai and it created carbon copies of source material, tried again burning a heap of buzz, 32dim,16 alpha, and got decent results

#

can't beat good ol adamw with low learning rate

mighty sedge
#

I kinda like it, but not good enough for me..

#

Maybe lowering learning rate could help with the patterns and details?

#

already running Loha at 32rank so pretty high

mighty sedge
#

Yah.. gave a shot on flux..

jade hornet
#

I'll say that's definitely an interesting and complex style to try to mimic, it's not bad even if it misses the mark

mighty sedge
#

yah.. I got some interesting results after training.. much closer to source

#

just released it on civitai, "ayahuasca dreams"

#

but ofc there no soul.... original artist had intention over each detail

charred ferry
#

Hey there everyone!

I want to develop an application which converts hand drawn sketches to images of clothes/garment etc

Can i achieve this by finetuning StableDiffusion? If yes, how so?
I would appreciate resources on this

Moreover, my model will have image+text input

Is there any other better approach than stable diffusion?

Open to all suggestions, thankyou!

mighty sedge
#

SD can already do this using img2img. Take a look into controlnets, "canny" and "lineart"

#

by using prompt + img2img + lower strenght controlnets you can achieve this

#

Finetuning or LoRAs are used incase you want to teach new concepts for the models, but there are a bunch of models with photorealistic or fantasy already, so unless you need something very specific you wouldn't use it.

#

Something of note is that there are "inpaint" models, which works better on some img2img scenarios. from personal experience "pony" models work best too.

#

this is a pretty bad example using canny only, not even img2img with a paint sketch