#🔧|finetune
1 messages · Page 21 of 1
do a new fine tuning on SDXL.
do whatever you have RAM and data for. you can start by trying a LoRA fine tuning on SDXL with the same data you used for your other fine tunings.
Rockin
What happens if I use a LoRA trained with SDXL alongside a checkpoint based in 1.5?
they are not cross compatible
you can always turn 1.5 latents into pixels, then into XL latents
and run img2img
there isn't really a point in that though. simpler workflows, in my experience, have always been better, whereas improving prompts yields better results
yes that works. But I would do that only if you don't have proper training data for sdxl.
It isn't that I don't have proper training data
Its that the only checkpoint (for furry art) I've found which produces anything other than crap has 1.5 as a base model
Hey finetuning people
I'm back with 1000 questions
I'm trying to understand the machinery. That's how I've always done best. So okay...
Y'all have made pretty clear that if I'm using training upon a checkpoint based on SD 1.5, which has native resolution 512x512, then all my images should have that resolution. I can see in my mind's eye (I think) why - the base library has x output neurons, where x is 512x512
Assuming that's correct - what happens when I ask a 1.5 based model to give me something with a different resolution? Does it distort? How does "native resolution" become "arbitrary resolution"?
Hi everyone. I am looking for someone who can train the lora model to convert seifle to AI grillz image. Please DM if someone is available
i think the single most important thing to realize is that you don't need high resolution generations; that you can get very far with a square aspect ratio; and if a square aspect ratio isn't enough, then a 16:9 ratio is good, and the maybe a 9:16 one, and that's it
Distortion yeah
Ask for much beyond the standard resolutions and aspect ratios and you get mutated limbs and duplicated objects
Generally 768x512 is fine, 768x768 is usually just over the line
Hi everyone. I finetuned SD1.5 on a dataset with mostly half-body image. The clothes trained well, but the face is kinda distorted. Can I improve it by further training on a face-only or close-up dataset (like a regularization)? Or is there any other way to improve the face quality?
Hi, I'm trynna train a model on art from Yabujin. In total I wanna do training on drain gang too, stuff like my profile picture basically, but I wanted to take things slowly first and see how it goes. But I'm already having problems obviously, so if anybody could help me out to get some good results, any help would be appreciated. :)
this is my code
and this the current training data
I used to have a way larger one, like 200 images basically, but I didn't caption them all so well. I thought the ai would figure out the style itself, without much captioning from me, due to the sheer size Lol
Hi all!
I am learning to finetune a model on dreambooth. I just need a small dataset for getting through basics. If anyone has any resource where I can find small image datasets of the same object, pls share them
Ex: 25 images of the same animal/thing/place
why would you be doing finetuing without knowing what you are finetuning? figure out what you want to do, then solve the problem
you are not talking to me right
no
it's not working that way.
In general, SD supports any resolution. But the way objects are composed and arranged to each other (probably by the convolution layers) is trained on a specific resolution. The model knows how to place a face in 512x512 . if you give it a 1024x1024 it will get confused and create multiple faces. That's why you should rather stick to the resolution it was trained for.
there are ways to be clever and maybe outpaint the image, combined with controlnet or other tools to achieve the original intended result...or just use XL which was intended for higher resolutions
Is there a discord community for training loras and such?
I've found a couple others, but frankly they're pretty low volume
the r/stable diffusion reddit has a discord, for example
My dataset has images like 1.jpg, 1.txt, 2.jpg, 2.txt, and so on. The .txt files contain tags for each image. Do I need to set this setting to .txt, or are .caption files something else?
put .txt in that box
alright, thanks!
having a lot of trouble getting cascade to train
me, too. My feeling is that Cascade is really bad for training.
same :/
like it works with high effort. but it doesn't ever work perfectly. Some things are easy to train, but most are hard.
really depends on what you're aiming for. Just dont ever aim for face/person loras XD
We created an index for the datacomp-12.8M dataset using Fondant and published it on the huggingface hub. You can find more details and info on how to use it in this short post.
You could use the dataset to fine-tune your own controlnet models.
Hello Everyone,
I need some help getting started with training my own Dreambooth or Lora.
I have a good local system with 24GB GPU Vram and know how to use ComfyUI (& automatic).
I want to train on a very old comic book style and I have a couple of those comic books with me.
This is what I found on the resources section on discord https://huggingface.co/docs/diffusers/training/lora
Is there a video you'd recommend I follow to get started on this journey.
It would help me a lot. Much appreciated.
I've had good luck with OneTrainer. It's pretty easy to use and a little lighter on vram than kohya. Also faster
Hello -- hope this is the right place for this question : I am attempting to run a py script that is calling on tokenizer /tokenize_config from https://huggingface.co/base_model/resolve/main/tokenizer/tokenizer_config.json but is returning a 404 error , checking the link leads to a "Repository not found" page. Is this likely a temporary outage with hf or did I do something wrong /overlook something else? log snippet if helpful: Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_errors.py", line 304, in hf_raise_for_status
response.raise_for_status()
File "/usr/local/lib/python3.10/dist-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/base_model/resolve/main/tokenizer/tokenizer_config.json
thanks for the heads up 👍🏼
if you need help getting started lemme know
i'm far from an expert but i've gotten a few pretty decent ones in my limited experience
do you have a shot from the comic book that you are talking about?
If anyone can share their lora settings for onetrainer I'd love to see, especially for training a style. The UI is so much nicer than kohyas mess of nested tabs but there's almost no resources for it
I tried to replicate prodigy settings I've seen around for kohya but it didn't do too well in OT
i haven't done a style, but i could share ones that have worked reasonably well for a character
That'd help, I think the biggest difference in character vs style is in captioning anyway. I've actually had okish results just using the XL default settings in OT, but I don't think they're entirely appropriate for pony models which is my problem
how much vram you got? if you're at 16 or 24 I can help
For everyone who has captioning issues:
https://github.com/jhc13/taggui
added moondream1 as a model for auto captioning.
Definitely not the best model, but it runs on a toaster
so as long as you have 6gb vram or more, you can do (relatively good) auto tagging
(if you're at 14 or more... use cogvlm!)
do you author this?
nop. just use it a lot.
24 here
[[subsets]]
num_repeats = 1
caption_extension = ".txt"
shuffle_caption = false
flip_aug = false
is_reg = false
image_dir = "A:/Datasets/npcp/source"
keep_tokens = 0
[noise_args]
[sample_args]
[logging_args]
[general_args.args]
pretrained_model_name_or_path = "B:/SD_models/checkpoints/sdxl/sd_xl_base_1.0_0.9vae.safetensors"
mixed_precision = "bf16"
seed = 23
max_data_loader_n_workers = 1
persistent_data_loader_workers = true
max_token_length = 225
prior_loss_weight = 1.0
sdxl = true
xformers = true
cache_latents = true
cache_latents_to_disk = true
no_half_vae = true
gradient_checkpointing = true
max_train_epochs = 60
[general_args.dataset_args]
resolution = 1024
batch_size = 7
[network_args.args]
network_dim = 64
network_alpha = 1.0
min_timestep = 0
max_timestep = 1000
[optimizer_args.args]
optimizer_type = "AdamW"
lr_scheduler = "constant_with_warmup"
learning_rate = 0.0001
max_grad_norm = 1.0
text_encoder_lr = 5e-5
warmup_ratio = 0.05
min_snr_gamma = 5
[saving_args.args]
output_dir = "A:/Datasets/npcp/output"
save_precision = "bf16"
save_model_as = "safetensors"
output_name = "npcportrait_v2"
save_every_n_epochs = 5
save_last_n_epochs_state = 1
save_state = true
save_toml = true
[bucket_args.dataset_args]
enable_bucket = false
min_bucket_reso = 512
max_bucket_reso = 2048
bucket_reso_steps = 64
[optimizer_args.args.optimizer_args]
weight_decay = "0.1"
betas = "0.9,0.99"
my settings for derrian distro. but they translate 1:1 into onetrainer
Was my settings for -> https://civitai.com/models/336145/npc-portrait-xl-for-basedreamshaperlightning
which was also a pretty high effort style lora. so I can vouch that these settings work for 3090/4090 users
This is a pen & paper npc character portrait generator lora. Its mainly optimized for DND, Pathfinder, and equivalent fantasy games, but it can...
I had 1.2k images for my dataset. (but the settings barely/dont change unless you have a sub 100 dataset)
npcp, simple background, round ears, human, caucasian, girl, a human with a contemplative expression draped in a blue scarf gazing into the distance. <- had captions like this, to reinforce the style (basically do a trigger word, then tag everything that is happening in the image, and nothing about the style)
Thanks I'll test it out on the next run, probably be more like a 50-100 dataset though so I'll fiddle
For style captions in kohya I had pretty good results having only "by artistname" as the caption, but I'll try adding some non style caps like you did as well
100/400/3k are the big "breakpoints" in dataset size where quality significantly changes, and where you can do captioning things that dont work on smaller sets. keep those numbers in mind
Llava 1.6 is pretty amazing for describing images
Looks like taggui doesn't have 1.6 support yet but I was really impressed testing it out in comfy, I'll def try tagging a dataset with that soon
ha
Hi guys, I want to use the openai clip model and a classification model to format a simple image description (for example a user who wants to create an image) into an sd-prompt.
My thought was to use clip to capture the text features and then train the classification model on the features and labels (which would be the sd-prompts).
Could anybody with more experience tell me if this is a viable way to achieve my goal?
@polar smelt just use CogVLM or ShareGPT4V, or check out some other models available at vision arena (https://huggingface.co/spaces/WildVision/vision-arena) - some of them are pretty good, and you'll probably get better results if you just use one (or more) of those and figure out best prompts for them to get what you want, instead of trying to create new model from scratch
Morning folks. Does anyone pls know any good tutorial on how to set the right parameters for Dreambooth SDXL on Diffusers "train_dreambooth_lora_sdxl.py"? My training is working but the results are not great 🙂
does anyone know how I can do Aesthetic score finetuning?
Set time machine to 1992 and refuel the Delorean with bananas.
probably some dumb mistake like using .caption instead of .txt for captions... or some pretty basic setting that always needs to be enabled... but wasnt
any reason you want to use dreambooth specifically? rather than lora?
if you're determined to use dreambooth original or kohya implementation, you're in a for a bit of troubleshooting ^^'
ah yeah. then it checks out.
there's a million things that can go wrong if you're doing finetuning.
best start with an existing preset, to test your dataset if its working at all. then adjust from there
(Onetrainer has a basic finetune preset for each major sd version, to get you started)
if you dont wanna switch trainers, then just take inspiration from the preset, and redo it in your trainer
oh. ooooohhhh.
if you're dealing with datasets under 10k images, then best stick with LoRA
if anything, then you should just take a simple lora preset that people recommend, and spend 80% of your time on making your captions better and better
a lora would work better for u
not enough images for you to train a whole helicopter checkpoint
with people is easier because the model already knows how ppl look
but for machinery you would probably need 10k to 50k
since it already knows the basics of helicopters a lora would be better,u could get like 50 imgs of each heli model and train a lora for each one of them
id say give lora a try,it will save u time and energy if u dont like it u can always train the checkpoint with lots of imgs
for 1.5 i used to do it on kohya but idk whats changed its been a while
https://www.youtube.com/watch?v=gt_E-ye2irQ. Haven’t tried it yet but I thought I help a bit
We show you how to train Loras exclusively in ComfyUI
Github
https://github.com/LarryJane491/Lora-Training-in-Comfy
Join and Support me
Support me on Patreon:
https://www.patreon.com/AIFuzz
Let’s be Instagram friends:
https://www.instagram.com/aifuzz1/
Discord
One message removed from a suspended account.
Got an error when training a lora
that output looks normal, some of it was missing though
wdym
at the end it says returned non-zero exit status 1.
Nevermind, I see it now. It threw an attribute error, which means one of your package versions is wrong. I'd compare your installed packages vs the ones it needs
I can't tell from that which one
I'm a Linux guy, try pip list? You might have to go to the tech support channel. The requirements text files have the versions you need
thanks
im on macos but it's pretty similar to linux
based on unix kernel
There's a pip command to force reinstall and you can point it at those version files, it be easier... Something like pip install --force-reinstall -r file.txt... Google it though, that was from memory
(and you d need to use venv's pip not your system's one, assuming whatever you re using has a venv)
i think you've been at it for a while. didn't we discuss that you can already achieve this pretty straightforwardly?
you don't have to fine tune at all
for stuff like this: it isn't meaningful. think about what is visible, not what it's called
if it can't be seen it's not going to fine tune either
if the differences are subtle it's not going to show up in fine tuning unless it is similar to something that has already existed in sdxl.
I have about 2.5 million adult photos (mostly 1280x720 resolution) that I scraped from a large pic gallery site. All of these images include metadata such as a one-line text description of the scene and categories/tags. What would be the best software for making a fine-tune with this scale of dataset?
(I would ideally like to train something based on SDXL or a similarly fast model, because part of what I want to do is video generation, but I'm open to alternatives)
SVD takes an input image that can be created from any model, so it isn't specific to sdxl
Do I need video samples for finetuning, or can it work with just images? I could also easily scrape a video site, but the dataset size would be massively larger (my image set takes up nearly 500GB). I could always compress/convert the video to lower-res, but wonder how this would affect training/output quality
Also, does SVD let me use large image sets (millions) or does it just take a single image?
A.) invest hundreds of hours, to learn the skills to take on a project of this magnitude
B.) invest thousands of dollars, to pay someone/or a group/ who already has the skills, to make a finetune of this level for you, using this dataset
what you're asking is the equivalent to "I got a hold of lots of spare car pieces and raw metal. How do I use this to tune my car to look better and go faster. Preferably I want to use less gas as well, since I want to compete with supercars"
its not that there's an issue with the ambition, but you're best off starting small, and learning how to make a small lora based on 400 images. then on 3k images. then on 10k images. then take what you've learned, and start from new again, since genuine full finetuning is even more destructive/hard to do right, and work on making 10k finetunes until you get good enough, to slowly scale up.
Expect your final final finetune on 2.5kk images to costs several thousands and will have to run on dedicated rented cloud hardware (since batch size matters, meaning you'll need an A100 cluster to do it in a reasonable amount of time, or else you'll wait for 3 months for a single A100 to get through it), and you only get one chance unless you're willing to invest that kind of money again and again
I understand that this is a big project, and that I will need to learn skills, and that this will take time. I'm fine with that.
I'm just trying to determine which skills I need to learn. I am trying to determine specifically which tools would be used for this type of project, so that I can learn how to use them. I am planning on starting out small as you said, and working towards the larger goal (I wasn't planning on just immediately trying it with 2.5m images without trying small batches first lol)
Also, as far as hardware, is there really no way that this could be done on a local machine (AMD 7970X CPU, 512 GB DDR5, and 2 x RTX 4090)? Given this hardware, what scale of training would be possible / in what kind of time period?
well you're in a bit of trouble.
hardware wise you're good to go including standard full finetuning (unet only). so thats about 80% of the way.
But your main issue will be:
A.) Guides <- lots of guides say different things, while saying "this is the best way". 95% of them, are in fact, geared towards ultra small datasets. Things change significantly once you hit the 400 & 3k dataset size mark.
B.) Captioning <- while there are tools that help automate this, you also gain their bias. (Example: cogvlm always saying "serene", or LLava often mixing up arm locations, which results in wrong arm anatomy if you train for too long on many images)
C.) Dataset management. Anything but 1k images, needs a dedicated tool for dataset management. currently there exists no true free or even paid dataset management software. You'll definitely find many if you google, but you'll also despair once you get more and more images...
basically, there exists "Dataset architecture" which is genuinely complicated. And this becomes unavoidable once you hit 100k
and you might think... well do I really need that? I can just autocaption everything and accept the bias from x or y tool.
Which would then result in your training not actually improving in quality beyond a certain level. Meaning you'd benefit more from a trimmed dataset that is well managed. <- downloading huge datasets of millions of images is easy... but this is the core reason why no one actually trains that large. its for the simple reason that improperly managed datasets only cost more to train, but dont infinitely improve quality nor adaptability
if you just want to start training, while being ok that this might be too big of a goal, I can heartily recommend:
• onetrainer <- for the easiest training of sdxl
• taggui <- for tagging of your images
• hydrus network <- for real (but very very painful) dataset management
Thanks so much for the recommendations! That's a great starting point 🙂 ... as far as dataset management, I'm very comfortable with Python and wrote the tools to scrape the images myself, and made sure to generate JSON metadata for all of the images (title, descriptions, keywords, names, etc) that makes it pretty easy to work with. Basically, the entire dataset is already tagged / organized quite well.
... I'm glad you brought up CogVLM because that is something I had actually been looking into. I have one line image descriptions (~100 chars / 10-15 words) and 5-10 tag words that I scraped along with each image, but I was considering using CogVLM to expand on these descriptions even more. But I am hearing what you're saying re: biasing the dataset. ... Maybe I could work on making a fine-tune of CogVLM first, and then work with it?
keep in mind the 77 token limit. its one of the core issues why we cant describe images better during training
Oh damn, I didn't realize the limit was that short ... that suuucks ... welp, I guess I won't bother spending the time with all the CogVLM stuff then, because my idea had been to try to generate a detailed paragraph describing each image to append to the end of the original one line description
one thing that works well for me, since I have complex custom tags for all my datasets, is to make a custom prompt for each image for cogagent. (or in your case, for llava 1.6 in order to support anatomy knowledge, as cogvlm is 100% sfw)
basically I retrieve the tags for the image, then use them to build a custom prompt like:
"This is an <artwork|photo|drawing|render> of <tracer|ahri|batman> from the show <gotham>. Caption the woman|man|animal, pose and background" while also limiting it to 77 tokens
obviously you'd extend this to make the most use of your existing information
this also helps you avoid most vlm issues, as you reduce hallucinations to an absolute minimum
this will give you natural language captions which dont work well for small dataset lora training, but really shines if you do 4k images or more
Thanks so much for taking the time to explain all of that! ✨ I'm gonna go do some research into what you've told me so far, and mess around with some test batches and see what I can come up with.
what do you recommend for small datasets? Ive typically done wd tagger
<50 images
are we talking SD1.5, any of NAI (anime) based checkpoints?
or are we talking sdxl base, or sdxl ponyXL?
different answers depending on which one you're training
for me it would be XL, realistic
option A.) <trigger word>, ask cog to generate a short caption of everything you dont want your lora to learn.
option B.) since its under 50, <trigger word>, then manually keyword tag everything you dont want the model to learn. do like 1 ~2 word descriptions. <- then enable shuffle captions + keep 1 token. enable TE training.
never mention any word related to anatomy, like "neck", "boobs", "stomach", "hands", "arm", "feet" etc... <- these will always make your lora worse since you dont have enough examples for it to learn an actual improvement, meaning you will most like cause a senseless offset, unless it is the actual concept you're training. (and even then, you'll probably have to rely on pure overfitting... as sub 400 images isnt enough to make an actual positive contribution to anatomy knowledge)
depending on situation make use of the mask feature in onetrainer
option b will usually give better results, since its more targeted
cool, regarding anatomy knowledge, I've found some derived models that already have that trained in work somewhat better for nsfw type training...but of course those are easier to overfit or already are in some cases
true. but if done right, then you'll end up with a lora trained on base, which will work flawlessly on every non-foundational XL model out there
my lora (of a specific dnd character in my campaign, that I generated in dalle3) + base | dreamshaper turbo. taught just the face without messing up the hands XD (and no background bias or weird skin color offset)
trained on 8 images. works on every model except pony & other foundational ones
one of the 8 training images + mask I added
I have a question about my first attempt at full fine-tuning SDXL 1.0. Here's what I did:
Used 370 high-quality, advertisement-style text-image pairs with the kohya sd-script.
Set the batch size to 16 and the learning rate to 3e-4, leaving other parameters at default.
Observation during training:
Instead of gradually adapting the existing SDXL 1.0 outputs to fit the custom dataset style, the image generation process seemed to start from scratch. The images began in a distorted state and slowly formed over time.
Generated Images:
Below, you'll find the image generation results for a given prompt, captured every 2000 steps from 0 to 18,000 steps.
prompts:
A bottle of Paul Medison White Musk shampoo is prominently featured against a soft purple backdrop, complemented by an elegantly draped white chiffon fabric. The vibrant red bottle with white and black text stands out, highlighting the product's sophisticated appearance and suggesting a luxurious hair cleansing experience.
I'm new to the Discord community culture, and I want to ensure I'm respecting the rules and norms here. If my question is not appropriate for this community, please let me know, and I'll promptly delete it. Thank you.
most left image is ground truth image
the question is appropriate, though I must have missed what the question actually was. I saw your method and results
I'm curious about the typical process of full fine-tuning for SDXL. Is it normal for it not to gradually modify existing SDXL outputs to match the desired custom dataset style, but instead to start from a distorted state as shown in the attachment?
From other research, I've noticed that fine-tuning and quality-tuning often stop around 15K~30K steps. I'm wondering if it's okay to continue training beyond this point.
with the caveat that my understanding is very basic and high level, full finetuning is retraining all the parameters, and typically works better with thousands of images. dreambooth is more well suited for a smaller dataset. as for how many steps in general, it'll start to overtrain at some point and the results will degrade. It's difficult to say in advance where that will happen
Got it, thanks for the response.
How long would it take to finetune SD3 on one P100?
A.) we dont have access to SD3, so we know close to nothing, other than what the paper has told us
B.) that's not how that works. You can have all the compute in the world... what you need is a dataset and a good understanding of dataset architecture & captioning
C.) due to this being the official SAI server, talking about nsfw topics or how to circumvent censoring isnt allowed on this server.
Oh, sorry then
your goal cannot be achieved
Do you know anything about this? Can you point out if there's anything wrong? Are there only methods like LoRA or DreamBooth for fine-tuning the SDXL model?
sdxl isn't capable of generating fine typography like this. it will be a bajillion times simpler to composite the label on
it doesn't really even make sense as an application. you are only going to use like 5 creatives for an ad campaign, which would take less than an hour to make.
Thank you for the insightful comments. However, what I’m curious about is, I understand that models like LoRA or ControlNet, which freeze the backbone model and only train the adapters, maintain the capabilities of the backbone model from the beginning and gradually proceed with the generation process towards the style of the custom dataset. On the other hand, I’m wondering if, in the case of full fine-tuning, the original model’s capabilities are lost and the image generation starts off in a disrupted state from the beginning.
both reduce capabilities in the sense of text to image prompting generally.
thanks for the reply
Which is best finetuning SDXL vs Cascade when it comes to realism?
Hi,
you have to use different learning rates for Lora and Fine-Tuning. Your learning rate auf 3e-4 is totally fine for Lora training, but WAY too high for full fine tune. That's why the model breaks in the beginning. Take the square of your Lora learning rate to obtain a proper full finetuning learning rate. In your case, instead of 3e-4 use its square 9e-8 (or simply 1e-7).
also you should always use a few warmup iterations. The AdamW optimizer is in an unstable state in the beginning and need some time to adapt to the data. With a warmup of, say, 50 steps, you set the learning rate gradually increase to your desired value for the first 50 steps, giving the optimizer time to collect statistics
beyond that there is no huge difference between Lora and full finetuning (beyond Lora being more parameter efficient).
Thank you so much for the answer. I was curious because the capability of the existing SDXL was so distorted that it was unrecognizable even before many steps of fine-tuning had passed, and I now understand that 3e-4 was a large value. As you suggested, I will try again with 1e-7. Thank you very much.
Nice to meet you, I'm Japanese.
Let me ask you a question.
I decided to give style learning a try, and I did so while referring to this site.
https://romptn.com/article/22757
The SD Ver is local environment WebUI 1.5, Windows 11-64, memory 16GB, GPU: Ge Force RTX 3060Ti 8GB, but even if I press the learning button, it stays in "standby" and no images are created. (This is the "Train 'embedding'" part of the above site)
I restarted my PC and closed unnecessary programs, but if there is any solution, please let me know.
*If possible, it would be easier to notice if you write in a reply.
Hello all. I'm working on training a ControlNet to remove furniture from furnished room photos. Wondering if anyone has done anything similar - but after training for about 5 days, it seems to have plateaud. I've posted details here in case anyone can help: https://github.com/lllyasviel/ControlNet/issues/659
it would help if you provide the console log.
So SD3 got captioned with CogVLM - is there a source for good captioning prompts (that detail the image, subjects, their clothes, pose etc. & also judges the image quality) ?
what you're trying to do is pretty complex. I'm actually impressed with the results you've achieved. No idea how to help ,but nice work
I wouldnt have thought SD would be a suitable application for that
somehow the AI needs to understand the difference between what constitutes the empty room and the "stuff"
Thank you for answering.
What is the console screen? Maybe you mean this black window?
yes
but the whole log please.
the whole text
copy paste it in a .txt. Then drop that file in here
Thank you, I understand.
(If you reply with a quote, you will receive a notification, so it will be easier to notice, so please help us)
Next time I teach the style, I will paste the log.
I have copied the text from the console screen, so please check it.
I think the last one, "AssertionError: Training models with lowvram not possible" is probably the cause, but even after searching, I don't know what to do.
| AssertionError: Training models with lowvram not possible
The final line here describes the issue.
Here's an explanation for why this is not working, on the PC you are currently using:
You are using a GeForce RTX 3060Ti 8GB.
8GB VRAM is not a lot when it comes to AI Image generation. It is basically the bare minimum to generate.
While generating, the program uses tricks to reduce the amount of VRAM that it needs, which allows you to generate.
For training, it needs all parts of the model loaded at the same time, in order to train it.
8GB are not enough to do this while using stable-diffusion-webui
There are ways to still do it on your pc, by using the program kohya or onetrainer. But it wont be easy nor very fun due to the issues you will face on an 8GB vram card.
16GB VRAM will let you do all kinds of trainings efficiently.
Else, you can also use online service to train your dataset for you. (Civitai has a free version of this. Some other sites also provide such services.)
this needs a few more examples, using new pictures which weren't in your training dataset, to see what your model has learned so far.
There's a chance its working. But there's also a chance that it learned to work only on your training dataset... which is another way to say, you've overfit it to hell and back.
(If however it is working as intended, then there are a lot of things which can be done to improve from your current situation.)
no one-fits-all solution sadly.
there are a few basic prompts, which will work on all images, at the cost of worse captions.
but you'll be much better off if you can segment your datasets into categories.
For example this is the prompt I used to generate captions of all images that have one woman in them: Caption the woman, pose and background
its given me the best results.
(Protip: use cogagent! better results than with cogvlm, barely more vram used)
Thank you for answering.
There just wasn't enough memory...😭
onetrainer
Is this it?
https://github.com/Nerogar/OneTrainer
It would be helpful if you could paste the URL of the free online service.
yes!
Just be careful with tutorials, as most of them assume that you have 12 or 16gb vram
thank you.
Either way, my computer doesn't have enough memory.
Does the "free version" you mentioned earlier use the same amount of memory?
https://rentry.org/59xed3
this tutorial is pretty accurate on most things.
use "AdaFactor" as that will work well on 8gb vram
and good luck!
Thank you, I'll try studying!
Interesting. I tend to use more complicated prompts than that... and for me CogVLM always performed better than CogAgent (which just got fine-tuned on Screenshots of UI's).
LOL, love the way you put it. for sure, i will try to post some samples from validation set.
in the case it hasn’t been overfit to hell and back, what are some things you think are worth trying?
Hi! is there a place like Civit where I could download captioned datasets for Dreambooth LoRA? How to create regularisation datasets?
Datasets are just images of whatever you want to train, which is anything under the sun. I know of no such place, and even if it existed, what are the odds that they would have exactly what you need? As for reg images, you generate them from the model you want to train on, using a class prompt (ie "a photo of a dog","a photo of a man", etc)
yes. kinda. but emphasis on "dataset" meaning all kinds of datasets... (like text, audio, images, video, etc...)
https://huggingface.co/datasets
but use the filter and find just what you need. 120000 datasets currently exist. common topics will be easy to find. niche topics require luck, or you can be the difference you want to see!
https://huggingface.co/datasets/ptx0/photo-concept-bucket/viewer/default/train?row=1
this one specifically, is good to use to build regularization sets. but its 500000 images, so you need to filter it down to the categories you want to create reg datasets for
I would like to try and replicate the green screen LoRA https://civitai.com/models/240019/green-screen, for gaining training experience, further model generalization, and own use cases.
The end goal would be to have a green screen LoRA to generate multiple characters, in any pose, on a green screen background.
For training I would rent GPU VMs on Google Cloud, perhaps with a budget of ~ $100 per month. I have for example a 16 GB VRAM instance (stopped atm), just for occasional testing generation/ port forwarding for local Web UIs etc. I do know a bit of Python.
From all the super cool discussions in this channel, I understand that I need to start from curating a good dataset, with good captioning. It should be feasible to collect enough green screen images from Shutterstock etc, in addition to generating from the mentioned LoRA. Based on a minimum of 400 (?) images, I would try and generate prompts with Cog-something. Then I should caption the images like grScr anime old man in coat standing, basically describing all features I don't want the model to learn.
Am I somehow on the right track? Based on my budget, how many images should I realistically aim for in my training dataset?
EDIT: most green screen stock photo is actually video. Could I split each video into its still images, with each image from the video given identical caption describing everything but the background, and use all of those images in the training data set? The background is "all" the model needs to learn, right?
Example: 2 seconds of 25 fps video of woman dancing in front of green screen. Split video to 50 images with caption gr_scr a woman in casual clothing dancing.
Too easy?
what is the use case?
I think I'll go crazy for SD3 and use CogVLM to caption my images with several natural language tags for different areas (like subject, composition/lighting, etc.) - what do you think? Too much or a good idea for a Model with a 512 token limit?
@shut pike you might want to try out MoAI, too - it got pretty nice results in benchmarks (I posted info about it few days ago, on #1003207327203209236)
Thanks. I'll consider it... though at the moment I'm learning how to prompt CogVLM and getting better at it. (might be better than to swap Models every day)
What do you think about having several long natural language "tokens" (and even some descriptive old-school tokens) for SD3 fine-tuning?
yeah, figuring out good prompt might be really important - that will result in accurate description of both the content, as well as all the important visual aspects
just mind even the best of currently available VLMs can struggle with some cases, even simple ones:
Oh I am absolutely aware. I figured I have "all the time in the world" until Sd3 goes public and am using cogVLM as a starting point and iterating / manually curating from there.
I would never trust auto-tags.
so might be worth to figure out some safety system for that - probably best automated (for example ask both CogVLM and MoAI for the description, and then GPT-4 or Claude if output from both models describes the same image)
and if you want to do good finetuning don't forget about regularization - there are multiple models on CivitAI that sucks terribly on that (and then turning males into females, making all faces looking the same, etc. crap)
ask for a detailed description of the image. that's it.
you also need a 24GB GPU if you want to run CogVLM at 4bit, or get an 80GB GPU. it doesn't perform as well quantized.
the lowest tier VLM I currently use, is moondream. which runs on a toaster.
it answered:
A black and white drawing of the number 3.
cogvlm, with the prompt: caption this image
4bit + 1 beam goes down to around 14ish gb vram.
so 16gb vram cards can run it with minimal settings.
for a 24gb,48,or 80gb vram gpu, it still makes more sense to instead increase beams, rather than load it unquantized. (At least according to our testing on around 50k images)
noteworthy mention. this is specifically in regards to using cogvlm for captioning datasets. If you wanna talk with it, or iterate on a conversation... then yeah. 4bit is terrible
My use case is to generate characters and backgrounds separately, and blend them together in a video editing program like DaVinci Resolve. I believe this process is called chroma keying? I just got a hunch that it should work, but how good I don't know.
@hollow spruce I just tested moondream2 (https://huggingface.co/spaces/vikhyatk/moondream2) and indeed it handled digits test nicely. I had to change your prompt a bit to produce better captions for more complex images, but I like this model - it seems it might be ready for SD3 era (at least for simple use cases) - thx for sharing it 🙂
(phi-1.5 might be the limiting factor for more complex scenes. something similar to it, but based on phi-2 or T5-XXL + CLIP + OpenCLIP could increase compatibility with SD3 style prompting)
Hey everyone, I'm curious about something. If I use specific keywords and elements to train an SD Lora for creating images, and then later change up these keywords to design clothes, do you think the designs and elements on the clothes will come out consistently? Has anyone experimented with this kind of thing before?
Hi! It's my first time preparing a dataset for Kohya SS. This will be a photo style, for realistic portraits. The set includes photos from 750 x 1000 to 2500 x 3700. Please help me understand a few points:
- What resolution is optimal for Lora SDXL? Maximum quality is important to me. Obviously, for 750 x 1000 there needs to be an upscale, and 2500 x 3700 needs to be downscaled, but to what extent?
- Should the dimensions be multiples of 16/32/32? Does it matter?
- Is it worth even at x1 to increase the detail, clarity, and remove jpeg artifacts, for example, using SUPIR?
- Is it worth compressing with 100% quality using some jpeg optimizer? The dataset is large, 200+ photos, this may affect the speed of training.
you want to use layer diffusion in comfyui
and an sd 1.5 model. use the "joint" workflow
you have to use the workflows specific to clothing if you want to change outfits on models. everything else is kind of a waste of time
this is SOTA: https://github.com/levihsu/OOTDiffusion
fine tuning will generally cause outputs to be less creative, not more.
the underlying issue is that it's very expensive to produce captioning datasets that only and correctly describe what is visible in the scene and where, and nothing else
you cannot "see" unease and mystery
looking critically at how many words it uses that are not visible things, and how few actually do, it is not good at all at the task you need it to do
"but i don't know though"
Some models can see "unease" - here you have sample caption from llava-v1.6-34b (last paragraph):
In the image, there's a character with a futuristic appearance, seated in a contemplative pose on a rocky outcropping. The character is wearing a black body armor with pink lights that suggest technological functions. The armor's design is sleek and polished, with a headpiece that includes a visor and what appears to be a communication device.
The character is facing towards the right side of the image, where a large, towering structure looms in the distance. This structure is complex and appears to be a fusion of organic and mechanical elements, with tentacles extending outward. It stands against a backdrop of a turbulent sky, where dark clouds or perhaps an otherworldly atmosphere gathers.
The ground is rugged and strewn with debris, suggesting a place that has been through significant events or perhaps a battle. The entire scene is awash in a palette of dark, moody colors with accents of pink and purple, contributing to a somber and mysterious atmosphere.
The art style is detailed and realistic with a touch of surrealism, given the fantastical elements present. The image quality is high, with a texture that suggests a digital painting. The level of detail is impressive, from the individual strands of the character's hair to the intricate patterns on the armor.
The overall atmosphere of the image is one of solitude and introspection, with a sense of anticipation or unease. The juxtaposition of the character's calm demeanor with the chaotic and threatening environment creates a powerful visual narrative.
Prompt was
Please describe this image, so another person could imagine the same picture. Include all the relevant information about the content, artistic style, image quality, interesting visual aspects, and the general atmosphere of the image. Be accurate, and concise.
hm, but even the tiny moondream2 noticed the sense of unease in this image - I am not sure why you've said it cannot be seen
you can use words to describe what concrete things that are visible
and you can call that unease
and i'm trying to tell you that using the concrete visible words make a useful caption
for these purposes, but also for the purposes of making visual art, and prompting
in my opinion this performs even worse from the point of view of the answer to the prompt
it gave you the opposite of concise
not really - it was as concise as it could to describe the things I've asked for
and the T5-XXL in SD3 should be able to comprehend long prompts like that
and with SD3 model trained on longer and more detailed captions from CogVLM I would expect SD3 to be able to generate visualizations of more complex and abstract ideas pretty well, too - we'll probably see in a month or so, when SAI will publish the weights and we'll be able to experiment with various parts of the model
if you wanted to use this for training, maybe a caption would be
a digital airbrush illustration of a rear wide angle shot of a skinny brunette woman wearing black spandex, black science fiction futuristic skinny body armor with pink energy coming from a window on the armor in the back; pink fringed neon lights on her armor; illuminated pink headphones with a pink band behind instead of on top of her head with short antenna; she is in a "thunderbolt" pose seated on her knees with her legs more apart than the traditional pose, and her legs appear fused into black webbing and tissue with an octopus tentacle tip going from her butt to the foreground bottom edge of the frame; the black tissue webbing is fused with red grass and branches of a large black tentcale form evocative of a tree, with black and purple branches above her at the top of the frame; a pink waxlike substance is melting from these branches like cables hanging. in the midground is a series of low mountaintop or wave crest forms behind and to the right of the woman; and in the upper right corner deep in the background is a platform superstructure building high in the sky, with curved cable forms as piers into the mountaintop/ocean elements. it is illuminated by small red lights; the structure appears to be a few large boxy platforms with darker greebles at the top, using a pink-purple-black palette. the sky is gesturally rendered clouds illiuminated by a mostly set sun in the background, the atmosphere is light orange.
i think science fiction futuristic skinny body armor is not good
i would have to think of a better way to describe what she is wearing but honestly it's kind of vague in the image too
@foggy inlet do you see my point?
these captions are helpful because if you want the model to generalize from conditioning correctly, it has to recognize when "the same thing" appears in different contexts
so if you wanted to generate an image of something out of sample - technically you always want to do this
okay, let's say the model has never seen "spiderman" before
like that word has never been used
it would be nice if it had seen a lot of examples of different costumed people correctly captioned with the elements of the costume
merely saying "batman" does not help you at all tackle the problem of rendering "spiderman"
am i making sense?
you can't "see" batman either, in thes ame way you can't "see" a person who isn't a celebrity
batman is a name for a collection of real visual artifacts.
so listen, you can "see" unease, but it's not helpful for training or generating out of sample images, which is like, the problem you are trying to solve
take it or leave it, this is my professional opinion
you could ask the LLM to "describe the collection of concrete visual elements used to express the emotions in this text"
i don't know how llava will work with that, but that's what the goal would be
well I agree model should see "the same thing" in a lot of different context to understand it, but I am not sure if we'd need that kind of description like you provided for that.
imagine you'd be talking to an artist - would you describe him what you want with every tiny detail, or rather describe high level concept and composition, atmosphere style etc. and then let him do his job as an domain expert and figure out the details based on his vast experience?
imagine you'd be talking to an artist - would you describe him what you want with every tiny detail, or rather describe high level concept and composition, atmosphere style etc. and then let him do his job as an domain expert and figure out the details based on his vast experience?
i do this a lot professionally too and you and i both know that's a pretty complex question lol
i hear what you are saying
i don't think the model is creative, in the same way ChatGPT struggles to be creative
SD3 is getting there
they can only go so far with the resources they have. for all the things they set up in unreal engine, such as millions of images of different three object juxtapositions and placements, it doesn't help with five objects. the model is totally capable of correctly generating images with five objects, but the state of the art approach to this stuff is limited by the generalizability of the conditioning
i think this is also why dall-e3 has such a "look" whereas SDXL does not
yeah, but I've seen MJ responding pretty well to longer description even long time ago in v4 times - so it's pretty sure possible
it is "undertrained" but also less conditioned
and current alpha SD3 revisions can respond pretty well to poetry, too:
https://twitter.com/thibaudz/status/1768009402970263667
i think we have a similar experience
probably the multi-modal models will have the best chance of being a "pretrained" object used for conditioning in "SD4"
they're just so hard to make and train
someone will have to publish it first
lots of work
dall-e3 is going to be SOTA for a while. they have telemetry. they know what people prompted for dall-e2 and 1, and hence could generate training data in whatever to improve results for those prompts
sd3 does not
stability has no meaningful telemetry
true, but there are also new methods to increase speed of training, like BTX - stuff like that will help
time will tell, but midjourney and dall-e4 have better odds
i wonder why they didn't release a pixart based model
they also need to transition to pixel diffusion like IF
then BitNet 1.58 - but that needs quantization during training to work, if I recall correctly
lots of things that need to happen
yeah
i don't know if midjourney published on their model
i figure they use pixel diffusion because of how well it recreates scenes from movies
the movies it is trained on lol
you know that meme where it's like a woman sending an email "look it wrote a whole email from a bullet point" and the recipient is saying "look it turned a whole email into a bullet point"?
that's midjourney
progressive training is done in the realm of pixels 256x256, then 512x512 etc. - why we cannot do the same in the realm of complexity (simple tasks, medium difficulty tasks, hard tasks), like we train our own brains (noones ask toddlers to learn quantum physics, isn't it?)
it's people saying something vague, and then they're really happy that midjourney is remixing scenes from famous movies
yeah i think all imagen family models are this idea in a nutshell
and wuerstchen too, just not pixels
there's also a lot of interesting papers flying around every day, more than I can read - I guess we'll soon need AIs to read all of this with comprehension (at least at high level), and then help us pick most promising ideas potentially giving biggest improvements - maybe not GPT-4, but a bit smarter models like Claude 3 Opus or GPT-5, with properly working bigger contexts - those might be able to help us with such as tasks, too 🙂
(or maybe something like that could do? model designed to support research papers analysis, published 4 days ago: https://arxiv.org/abs/2403.10301)
Optimal resolution is as close to 1 Megapixel as possible.
That could be 1024x1024, or 1153x768, or similar.
Kohya and onetrainer have this built in natively if you enable resizing + buckets. Then you don't need to worry about cropping or sizes in any way, as long as its over 1 Megapixel, as it will just be scaled down to optimal size
If you wanna go absolute quality, then do the resizing yourself via Photoshop automated process, and save as a png file, so that it's lossless. (that way you avoid jpg artifacts. It's a complicated topic... But the quality increase is incredibly small if you do all this extra work manually)
Better captioning and datasets are always more effective at increasing quality
Interesting. Thank you! It turns out that it's not just a matter of resizing one side, but then I need to crop the other side. Pre-sort all the photos by their aspect ratio. Considering I have over 250 photos in my dataset right now, that sounds a bit complicated. I could try to do an automatic workflow in comfy though.
can you show me an example of the best work you've made so far, in a comfyui workflow for example, and then that made you say "I need a better photographic style"
heh, yah more photo realistic seems to be like an endless journey, doesnt it
anyone have a current up-to-date version of this but for local (instead of google collab)
have a pair of 4090 24G cards with 64GRam, i wanna give training an sdxl lora a try
got kohya setup, same machine i have comfy installed, seems like the cuda version is different
can kohya force the version comfy is using? or am i gonna need to fully upgrade everything to get kohya to run?
The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/cuda/lib64')}
DEBUG: Possible options found for libcudart.so: set()
CUDA SETUP: PyTorch settings found: CUDA_VERSION=118, Highest Compute Capability: 8.9.
CUDA SETUP: To manually override the PyTorch CUDA version please see:https://github.com/TimDettmers/bitsandbytes/blob/main/how_to_use_nonpytorch_cuda.md
CUDA SETUP: Loading binary /home/vender3d/kohya_ss/venv/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
libcusparse.so.11: cannot open shared object file: No such file or directory
CUDA SETUP: Problem: The main issue seems to be that the main CUDA runtime library was not detected.
CUDA SETUP: Solution 1: To solve the issue the libcudart.so location needs to be added to the LD_LIBRARY_PATH variable
CUDA SETUP: Solution 1a): Find the cuda runtime library via: find / -name```
What you mean? Just because xl doesn't need a portrait style and you can do a cool style just with a promt? I agree. But only if we are talking about one portrait. I want to make 100+ for my project, of real people, to have likeness, so I want to fix the style with Lora so that the whole hundred are consistent.
have you tried creating a diffused image with the likeness of a non-celebrity without using fine tuning?
you can simply follow train network from kohya with sdxl and prodigy. that's it. fill out your dataset
in terms of interacting with python and completing basic programming and IT tasks, it's better to ask chatgpt
Yes. InstantID with 1-3 photos. Not perfect but works. Its doable but takes effort, sometimes a lot of effort
okay so what specifically do you imagine will be improved?
you need a lot of photographs and processing per person if you want to recreate a non-celebrity likeness in SDXL flexibly. are these going to be photos of real people that you take on a stage? do you have hundreds of photos per person? did you caption it?
Hi guys, sdxl finetune question, i have 5000+ large high resolution images and i don't want to crop and lose part of the image, so i resize them to fit 1024×1024 and now i have black border frame, will this frame be one the final result or will be ignored on the training ?
Example
what is your goal?
the black border frame, all else being equal, will 100% appear in anything created by your fine tuning
My goal training the full image without losing any part , because the images mostly big in height and not square and i will lose the body or the clothes if i cropping them 1204×1204 , now i resized all the images to fit in 1204×1204 but with black borders , will buckets work ? If i resize only the highest?
You don't need to make the images square, if you enable bucket ratios in koyha it will crop the images to the 1MP resolution, but they don't have to be square. I use xnviewmp to resize my images setting the resolution to 1.05MP
Thank you 🙏 🌹
i meant what is the application
Kohya ss
what are you trying to train?
People, in traditional clothes,asian,arabs, culture ,building , cities, one of the imges dimension is 4014×5017
So i will try to resize the height to 1024 and let the buckets deal with width 🤔
👆
Hey guys, need some help
I'm training my stable diffusion model with Dreambooth. I have 100 images and I go for 20 steps per image, so I have 20600 steps in total and it tells me that it will take 50 hours. How can I decrease the time that the training will take? Is it better to get rid of some images or decrease the amount of steps? Thanks in advance!
I'm creating my own architecture model
what does that mean?
you mean architecture as in Architecture
designing buildings
share the output of nvidia-smi
8s/it? I'd think a 4090 would get like 1it/s or less and be able to chew through that in like 20 mins, but having said that, you have 10 epochs there, you can spit out a checkpoint every epoch and stop when it's done, ie. test each checkpoint as it's produced and if you're happy, you can stop the training
Im training a style lora for sdxl, which preset should i use Im a beginner T-T
Ram overflow. You're not using any optimization techniques, which is why you're out of vram, which is why it's taking 50 hours instead of 20 minutes
Try using kohya or onetrainer. They built in a lot of optimizations, so it will run a lot faster.
adafactor or adamw should be fine, and remember to follow best practices for style training maybe dont even bother with captions until you know what you're doing
Hi everyone! I want to finetune stable diffusion but after loading its components separately from huggingface. I have loaded them separately and am using them for inference but I am struggling with how to finetune it now
as far as my intuition goes, i am supposed to freeze the vae and text encoder and train the unet>
This is the inference pipeline from their documentation
but how do i finetune this now
google kohya or onetrainer.
thats the easiest way to get into it
Does that work for stable video diffusion?
community sentiment goes more towards animatediff, since its compatible with most existing workflows and app pipelines
check that out if you wanna help improve video generation.
if your goal is actually SVD specifically, then there are few groups and companies that actively try it. Its not easy, nor plug &play. You probably wont find much help online, as its too complicated to casual walk into, and also requires significant hardware & compute to even make sense
I have the compute. I'm not unable to find any guides so I will post one as soon as I get it running. One thing I'm confused about is how different the input is to regular stable diffusion. I can't get the same results just calling SD for the 30 in frames in the future I want, correct?
your best bet is to contact the few people that actively pursue SVD, and see if they can get you into any research groups
google "DragNUWA - svd". its the only finetune I'm currently aware of, other than some github contributers
I mean I can just load it in huggingface and run a pipeline from stability's gen-AI, no? @hollow spruce
Any tips for epoch number/ batch size/ repeats- could any of these settings be the reason I m getting json file instead of checkpoints or s_ ? I am Training small model on a few projects based photos? Around 20 ? Any link with settings? I M having enormous RAM 128 GB latest M3 chip, base model LoRA?
the json file is normal, every time you run training it'll output the config you used for that specific batch
if you want a beginner level kohya_ss tutorial I can link a decent one
try this: it's not super new, but it's very thorough https://youtu.be/xXNr9mrdV7s?si=xAfvABoxq1VzWndY
LORA training guide/tutorial so you can understand how to use the important parameters on KohyaSS.
Train in minutes with Dreamlook.AI: https://dreamlook.ai/?via=N4T
code: "NOT4TALENT"
Join our Discord server: https://discord.gg/FWPkVbgYyK (Amazing people like LeFourbe on there)
------------- Links used in the VIDEO ----------
Folder to J...
forgot to tag you
Tnx I ll check… I mean m not getting other output beside json… no model found
Hi! I am trying to train Lora for the first time. I seem to have done everything according to the Aitrepreneur tutorial, but Kohya stops right at the beginning. Honestly, I don't even understand what line exactly is error here. Please help me understand what I'm doing wrong (there's a lot of repetition between the 2 and 3 screenshot).
you did enable the sdxl checkbox, right?
next to where you set the path to the base model
That was my guess too.
Yes, I already found the checkbox, but Kohya still stops
On my win11 I have CUDA 12.4 installed over 11.8 does this affect Kohya? Kohya was installed with no modifications clean latest version a week ago. It has its own CUDA inside and is not connected to the global or?
it does look like a cuda error tbh x_x
I'd first try onetrainer, see if that works. so you dont have to mess with your installs
assuming you're just getting started, use the included preset for sdxl lora. that one is nearly flawless for starting out
I did setup.dat things 2 and 4 one more time and now it works
so I wonder how well SDXL understands relativity. Say I'm training an IRL Homer, so I'm finding various men who look like Homer might if he were real.
In one photo, he looks like homer, but not as fat. Should I tag it "skinny" even though he's not really skinny? But he's skinny compared to Homer
you have to define what is essential about "an in real life homer," which is full of valid subjective judgements about the art that you have to express formally.
adding the skinny caption will improve the performance of the fine tuning. imagine there are two ways for the training process to "learn" how to generate your image:
- the caption closely resembles your image. this means that the image it generates "starts out" looking like your image early on. from then on, only small changes to the parameters are needed.
- there is no caption. the training process will make a lot of changes to a lot of parameters. if you don't have other training data that depends on those parameters, those changes will stick, along with whatever changes actually improve its ability to generate your particular image.
so under the theory that only a small subset of parameters are needed to improve generation of your images (true), poorly described captions will cause a "loss of generalization" aka lots of other spurious parameter changes are "kept" along with the small number of changes that actually improve performance.
as long as you are using captions that describe what you can concretely see, you will get good results.
when training styles, people omit the captions because they want a lot of changes to a lot of parameters.
a full fine tune versus lora fine tune also helps. a lora fine tuning has so many fewer parameters that the effect of having bad captions or no captions is diminished.
so the punchline is that there are concrete, scientific explanations for the behavior the community observes. "relativity" is more like, well can someone see skinny versus fat? i think so, so the conditioning that ships with sdxl (aka "CLIP" and the "conditional UNET") will correctly speed up training when you use those keywords
Hey guys! What is your Kohya train speed on 4090? On windows. Yesterday I was able to run Kohya and trained a couple of models for the first time, everything works ok but the speed... 2.30-2.50s/it on XL training, xformers, butch size 5. It's not okay, right?
sounds about right
I get about 5.5s/it using batch 8 + adamW (no normalizing, on windows with overhead)
there are a bunch of settings that make it go slightly faster or slower. but they're marginal. so your speed looks pretty normal
I was thinking of e.g.
"Homer sitting in Moe's_Tavern drinking beer wearing a pink shirt".
I don't tag "bald, beard, fat" because these elements are essential to Homer. However, if he had thick hair in that image, I might add "brown hair".
Are you saying tagging like that will screw the model?
tagging like what?
you are asking for a "brother, i "just" need an answer" answer, which is impossible
if your captions accurately describe what can be seen in the image, your model training will occur "faster" in the sense that the fine tuning will be able to create the images in your dataset with fewer iterations of backpropagation aka in less time
homer simpson is already in the sdxl training dataset
you are not teaching it a new concept
sdxl already knows what homer simpson specifically is. it might not know "the complete collection of concrete visual elements that make up homer simpson" are related to "homer simpson" 100%, but it might
I don't tag "bald, beard, fat" because these elements are essential to Homer.
CLIP already knows what "homer simpson" is. however the distance between bald, beard and fat to homer simpson is probably larger than you would assume. you could probably improve the speed at which the model trains by including bald beard and fat. but since you are using the word essential, i think you are still dancing around the hard task of deciding what subjectively defines "an in real life depiction of homer simpson"
if you spent 5 minutes writing down the concrete visual elements that describe an in real life homer simpson, you will be able to make much better captions
Hi !
Hope everyones fine.
So, I'm developing a diffusion model for a project that converts text inputs into image outputs (Text to layouts). The stable diffusion model seems to be the most suitable option for this task. My datasets consist of 4003, 256x256 images, each accompanied by detailed captions (Roughly 250 words) in text format. These datasets are hosted on Hugging Face : https://huggingface.co/datasets/jkanishkha0305/text-based-layout-generation-dataset.
However, during training(Using keras implementation : https://keras.io/examples/generative/finetune_stable_diffusion/), the model encounters an issue related to CLIP embedding, specifically mentioning a "ValueError" due to a shape mismatch. The error message states: "Cannot assign value to variable 'clip_embedding_1/embedding_3/embeddings:0': Shape mismatch. The variable shape (1000, 768), and the assigned value shape (77, 768) are incompatible." This problem ig arises because my captions are very detailed, containing roughly 250 words each.
Additionally, when attempting to train the model with a simpler dataset on platforms like Colab or Kaggle, I encounter "OOM" (Out Of Memory) issues, likely due to limited GPU memory (15GB).
I have additionally tried the method specified here : https://github.com/huggingface/diffusers/tree/main/examples/text_to_image. But it runs into "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB. GPU 0 has a total capacity of 14.75 GiB of which 101.06 MiB is free." So can someone suggest better ways pls. like may be should i go with TPU V3 ?
I need assistance in resolving these issues. So any help or guidance related to fine tuning of stable diffuion model using custom text(captions) with image dataset would be greatly appreciated.
Thank you.
Should i go for SDXL Model instead or SD v1.5 good enough for my task ?
that's a CLIP issue, which can only deal with 77 tokens at max
also, I don't think that CLIP can deal with you images at all...
I would say a general diffusion model is just not the right tool for your task
your problem has nothing to do with image generation at all. You want to extract room geometries from a text prompt. So instead of making images, just make geometries like <roomtype, x, y, width, height>. So either train a custom model on top of a text foundation model or train an instruct model that generates these geometries from a text
But baldness is intrinsic to Homer, isn't tagging bald redundant?
You don't tag in every photo "person, man, Homer, eyes, nose, mouth, lips, teeth, beard, chin, neck, shoulders, arms, hands, fingers" etc, right?
However, if he were missing an arm, I might tag "amputated" or whatever descriptive word for that the model already knows.
new lora thing for XL models
https://b-lora.github.io/B-LoRA/
Below seems very relevant for auto captioning efforts
https://github.com/IDEA-Research/T-Rex
You don't tag in every photo "person, man, Homer, eyes, nose, mouth, lips, teeth, beard, chin, neck, shoulders, arms, hands, fingers" etc, right?
you kind of should. it has nothing to do with being intrinsic or extrinsic to homer. it's that when you train on a dataset of billions of images from the public, and for hundreds to tens of thousands of epochs, there are going to be weights in the unet that are "reused" both for generating homer and for generating bald men with a complete set of organs and limbs. when you are training for some BS number of epochs on a vanishingly small number of parameters and very little data, you are hijacking preexisting stuff and "multiplying" it a little bit to bias the whole, complex process towards making copies of your image. if you want the process to find those multiplying coefficients faster, you should say person, man, homer, etc., because an image with those details is more likely to look like your goal per step of denoising.
You don't tag in every photo
no, the community doesn't tag all these mundane details in every photo. but I do.
you still haven't said what an IRL homer should look like
until you specify that you are just stabbing in the dark following guides
That's the logic of tagging "what is unique to this image"
Pangloss' method of tagging every feature is similar to how Anime models are trained. Every feature is mentioned. And while prompting is more work afterwards, they are undeniably accurate, and ignore less keywords, while also hallucinating less.
For all of my big datasets, I tag every damn feature. Due to that, they are consistently better at following prompts, and have significantly less token bleed in general.
For super small datasets, the most efficient way is to have a trigger word, and write a short sentence description of everything visible.
There are models where training becomes different, harder, or easier.
But for all base sdxl derivatives, this holds true
btw, training via 'whats unique' is a pretty easy and good way of training. But there's a definite "quality ceiling" you'll hit with it.
So its not wrong to train like that, but it is important, that once you want to achieve certain details, or being able to train multiple concepts into a single lora, you'll need to change your method of captioning, to one which has a higher quality ceiling
I want to see this in reverse, making cartoon characters real
Fat bald man with a bad combover, goatee,
I'm wanting to make a checkpoint more than a lora, ideally.
But I was just thinking, I want fat to bleed into Homer. If I tag every Homer pic with "fat", then when I prompt "Homer" won't I also be required to prompt "fat"?
I suppose after training the model/lora, I could also train a TI, in order to get all those Homer details without using so many tokens
Thank you so so much this was a really helpful suggestion. So my approach could be like text captions to <roomtype, x, y, width, height> and then use instruct model to generate layouts ryt ?
Also do you have any suggestions for instruct models ?
Using seq2seq approach for text to <roomtype, x, y, width, height> a good approach ?
I have trained now a couple of loras, but always ahve the issue, that I can't continue the trainig, what is there to do to make sure I can continue training a lora?
I want to do a first few epochs on very basic captioning and than continue training with a set of very detailed captions
where did you find this?
I saw it in another discord
kohya & onetrainer both have the option to "continue" from an existing lora.
you do need to keep the core parameters like net rank the same though. but just changing captions shouldn't be an issue (remember to not save captions to disk! as you'll want them to be generated new, when you switch them up later)
(if you have a different folder, with the images again, just different captions, then saving captions to disk is fine though)
here is what it looks like in onetrainer
which one
does onetrainer resize a base model to the new lora rank?
so if it's rank 32 base, will it resize to rank 128 in that field?
or does it error out
And how exactly do we need to prepare Datasets for the seperation?
laion
thanks, I finally found the option and could make good use of it
After improving my training and tagging I finally took on the challenge to create a Loha that captures the essence of the wonderful and fantastic s...
good morning.
I tried to create a LORA from a photo, but I can't.
I can't understand the meaning even if I translate the command prompt text, so could you please explain it to me in a simple way?
Supplement.
When replying, you will receive a notification if you use a quote, so it will be easier to understand.
Which trainer are you using?
Kohya, one trainer or some other trainer?
realistically, you should have at least 10 photos from different angles. If you can only get 1, then go with face swap/controlnet vs a lora
and you can cheat, using controlnet to generate photos to train a lora too
Good morning, thank you for your reply.
I use this method when creating LORA.
https://www.youtube.com/watch?v=N1tXVR9lplM
🐸本動画は、画像生成AI『Stable Diffusion webUI AUTOMATIC1111』(ローカル版)で使用する学習データの自作方法について解説したものです。23年4月時点での仕様に基づいて作っています。
🐸本動画は技術研究目的で作成しています。またソフトウェアの使用方法の例を紹介した動画であってその使用を推奨するものではありません。
This video is edited for technical research purposes.
It only explains how to install the software and its features, and is NOT intended as a recommendation to use itself.
...
Thank you for answering.
A total of 13 photos of people were used: front, diagonal, side, and back.
I am installing it right now
I personally use kohya ss gui which is the front end of this
yes
thank you.
First, read the page description.🙂
Sorry, I have an additional question, but when using photos of real people, does it affect learning if the image size is too large? Please let me know if you have a suitable image size.
im sorry i cant get it to work all the python files just die on me the second i open them
though just try kohya ss gui its slightly better to use since it does have an interface instead of being a commandline tool
my apologies from not being able to help with your issue
the installation process should be similair enough with the key difference that you launch the setup.bat file and afterwards the gui.bat file
No, don't worry about it.🙂
I'm still working on something else and haven't read the page description, but is kohya ss gu an extension of Stable Diffusion?
Or is it another application?
it is another application
I would suggest making a new empty folder on your home screen
and follow the instructions described on the github page
It was no good...
"During the installation process, ensure that you select the option to add Python to the 'PATH' environment variable."
At this point, I no longer understand.😫
Oh that’s an easy fix
Uninstall python and reinstall it make sure that when you install it you check the box on add python to PATH
thank you.
I am currently installing python3.10.6 on drive F of my computer using Stable Diffusion1.5.
I'm planning to put kohya ss on another drive, but do I still need to uninstall python3.10.6?
yes uninstall python and reinstall it.
the changes should all apply to each drive / disk on your computer
when you are reinstalling python make sure these boxes have been checked
thank you.
If Python is successfully reinstalled, can I use Stable Diffusion without any problems?
Yes
(Sorry I didn't notice your reply)
I had to migrate from 3.10.6 to 3.10.9 and I had no issues with stable diffusion or any other program that uses python
What matters is that python is added to PATH so you can run the venv and the pip install lines
hi
anyone has experience
with funituning stable diffusion
and is willing to help me out with a project?
I only have experience with kohya ss gui doing lora training
https://github.com/bmaltais/kohya_ss
finetuning is similair to training lora's but takes alot longer and needs alot more data (about 10-100x that of a lora dataset) from what I have heard
I suggest you start with training a few lora's before you move on to full finetuning
@celest pier
thank you.
I don't have much time on weekdays, so I'll try again on the weekend.
Hello there ! I have a strange result training a Lora above SDXL. when I use the layer, images are oversaturated, with bands artefacts
I don't have this issue with SD 1.5
I tried playing with the lora's weight, with no luck
Is there anyone with an idea?
i have question about this https://civitai.com/models/4201?modelVersionId=130072, are the no vae checkpoints ones that require no additional vae, or ones that contain no real vae and require the user to use an seperate vae. and are there any vae files on the page, if so which ones are the vaes and which ones are the checkpoints
using kohya_ss? perhaps try one of the presets for sdxl. make sure you select sdxl on first lora page, and that your resolution is set to 1024x1024 vs 512x512. same for your sample images if you're doing those during the training, noticed this one you shared is a strange resolution
Yes with Kohya. My samples should be at 1024? And if my training image are 512 , is it ok? ( I shared a print screen I think)
sdxl does not generate very well at resolutions less than 1024x1024
Hi everyone, I am very new to stable diffusion and looking to fine tune a model so that I can do image to image style transfer for many images, with a consistent pencil sketch style. I've tried DreamBooth, but found that the style I created is not what I am looking for. Does anyone know of any resources that I can look into?
this is the script I found and ran:
https://colab.research.google.com/drive/1hMXWO1f9Q344XiixDirZAPqbEXT2pRGF?usp=sharing
dreambooth style training is what I'd recommend, but style training can be tricky. Maybe look at this reddit post as a reference? https://www.reddit.com/r/StableDiffusion/comments/14rcr7t/kohya_ui_settings_as_asked_stylecharacter_training/
forgot to tag you above
subject training you are describing the scene, the pose, etc, with style training it's normally the opposite, maybe try no captions except "ohwx style" or some custom keyword you choose
Thank you for the link! I'm trying dreambooth but I will look into configuring it differently
Does 1epoch 30repeat equal to 1repeat 30epoch?
Hello,
I have a quick question: Any tips on training a model on a reflective objects, such as a stainless steel bottles? I've captured the dataset myself, but the issue is that in all the data, reflections on the bottle are visible, and the model seems to be generating them after training
Mathematically yes, but there is a subtle difference, as a completed epoch describes the point at which all images in the dataset have been processed.
Thank, it should be the same as I think.
Epochs produce more trash on your drive, but you can better pick "best trained model/Lora" candidates. Find a nice in-between.
I am implementing my own training script. I could skip some epochs to save the disk.
22、
does anyone have any opinions on how to render dice? this seems exceptionally hard
Im late but couldn't you use controlnet to influence its shape and layout? Though you'd need a jpeg for the dice frame
hello, help i don't mess this up because i'm in the medial of captioning 11k images of people
do i have to caption and describe only the subject and the clothing with the background or without the background
what about tagging , do i add them after described the image like , (man, wearing cowboy hat in farm with big mustache, man,mustache.cowboy,4k,sunset
i want the model to be flexible and focus on the subject not the background but with an 11k image maybe it's ok to caption the background ?
what is your goal?
oh yeah
i remember now
sophisticated users of diffusion models describe everything that is concretely visible in the scene, in unambiguous terms.
let's say you have this image:
worst: spiderman
bad: spiderman on top of a skyscraper in london
bad-okay: spiderman, squatting, high in the air. behind him is the big ben in london on a sunny day.
okay: spiderman, a man dressed in a red and black spiderman costume, squatting with legs outstretched on top of girder high in the air, medium closeup level angle, backgrounded by an in focus shot of the big ben in london far away down below
best:
spiderman: a fit young adult male wearing a skintight nylon costume fully covering his body. the costume is red all over with black sections: black from the waist to below the knees, from the bicep to the elbows on both arms, from the tricep to the base of the fingers on his lower arms, with a widely spaced black grid on the red areas evocative of a spider web; a black arachnid symbol in the center of the chest about 2 inches in diameter; and graphic eye paint about three times as large as ordinary human eyes that resemble upside down almond shapes, with a white fill and a thick black stroke and curved accents of the stroke at distant corner of the eye evocative of spider compound eyes). he is posed with his right knee bent and his left leg bent and outstretched in a low squat, right arm forward, three quarters profile, and left arm out in the [insert the tai chi stance that they are using here, i don't know what it's called]. he is squatting on an architectural, cropped triangular element that resembles 14 inch steel pipes welded together at the top of skyscrapers. he and the architecutre are composed on the right side of the frame. behind him deep in the distance is a shot of [the london neighborhood with the big ben], showing the big ben near the bottom of the frame,westminster, then [more description of the concrete visible city elements of london]. the sky is an overexposed bright haze on the left and some sunset illuminated clouds on the right, it appears to be near dusk on a cloudy day.
@neon quail do you see
this isn't practicable
but this is how you get Ideogram quality results
thank you this is really helpful, I spent six days and 12 hours every day to separating the images in folders and sorting them by , men, women,clothes, then captioning them and manually edit the captions , now I'm on the final 1000 images and i got scared and confused by other captions tutorials 🌹 🌹
yeah
focus on stuff that matters to you
ultimately if you have a bad learning rate for your text encoder, you're not even going to use the captions correctly
you should test on 5 images and see what kind of results you get first
and try to understand what all the parameters do
if you plan to use prodigy, you are wasting your time captioning
are you doing a full fine tune or lora fine tune?
i test on 500 ,learning rate 0,00001 ,40 repet 1 epoch and the result is good
full fine tune
captions don't matter for full fine tune ?
is 10 epoch 1 repeats good for 11k images ? or should I go higher with epoch 🤖
I'm curious, what is it about prodigy that means captioning is waste of time compared to other optimizers?
Is the approach to captioning different for a lora vs ft? (assuming same number of images). Is it possible to train a lora with 11k images and get good results?
every training run is different
there's no generalizable advice
i get that there exists generalized advice but it doesn't mean it's correct, in even any case
i don't know if for @neon quail 's particular problem, if he "just" prompted better, or used "just" clipvision, he could achieve most of what he wants
Maybe I misunderstood your statement. Is there any difference with prodigy compared to other optimizers regarding captioning?
Hey everyone! so.. did kohya newest updates break lora training? all of a sudden now when I make new Loras, they do not work (same settings as before)
Anyone here who is experienced at LoRA training?
I am looking for anybody who can train and upload a Ponydiffusion LoRA with best parameters, using a high-quality and diverse synthetic dataset I will provide. (No need to credit me)
I'm curious too,I know prodigy is pretty aggressive. I've had terrible results with it on sdxl, which is probably a setting, but I never figured out which. It worked great on 1.5. @dusky urchin
I know that's not specific, basically it burns out too fast
Hello i'm trying to train a model with my mother's artworks, I have a small dataset (to start) of 52 artworks
I would like to let the model hallucinate without any prompts and let it generate new artworks based on the one it has been trained on
Could you help me?
🙏
What kind of art is it? I suppose you would have more luck with abstract art if you don't prompt.
If you can please give me some help to start, I use automatic1111 but can learn using ComfyUI as well
I had in mind to start from scratch and train fully a model from my dataset but i've been told that I could train a lora and use random weights
What I would try is make a Lora and caption every image with "art by {whateverName}". Then use that on inverence as a prompt and check what happens.
so the prompt would be empty or "art by name"?
Art by name
you're talking about the caption for the embedding right?
Caption for each image for training the Lora, yes
okay i'll try that
how long is it to train a LORA generally? I have a dataset of 52 pictures to start from
What graphics card?
I would say not more then 20 min
If you have problems you can DM me.
thanks 
Does someone know the best way to do a fine tuning with art style. Is indeed using dreambooth?
what kind or regularization images should be used for a style?
you dont need them, imo
so is it just me or do LoRAs seem to overfit really easily on generated images?
well overfit is not bad necessarily, unless you mean like artifacts. when training a lora, somtimes it should be overfit, meaning that you dont want a bunch of variation, you want exactly one specific output
the other thing to consider is that some non-base checkpoints dont work well with certain loras. the reason everyone loves juggernaut so much is that in general, it's extremely forgiving to use with loras
having said that I downloaded a likeness lora from civit recently, and it had artifacts even at like .5 strength, so not sure what that creator was thinking when they uploaded it
hi huys, i spent allomost the last 2 weeks trying to get kohya_ss up and running
i tried windows 10 as well as ubuntu22, however i cant seem to get it fully working to use the GPUs
anyone has propper instruction especially on driver versions,os versions and cuda versions
i basically tried cuda11.8 which apparenly is allmost uninstallable on ubuntu22 by now because the drivers 520 wont build
with 535 i cant get any real speed however the gpus is utilized
what OS Ubuntu version do you recommend?
and what cuda and driver version?
you didnt even mention your card, but you might want to post in tech-support room in any case
what I mean is like... when training with generated images it seems a lot harder to get it to pin down the intended concepts/styles without learning every single thing in the images
I mean that's natural for training yes, and normally captioning everything besides the main concept works fine with non-generated images. but every time I've tried training on my own generated images, captioned or not it always learns minute details to the extent that every resulting LoRA looks like an overcrowded mess. the concept is usually learned perfectly fine, but... with everything else along with it
in the past I could always skirt around it by simply training with the model used to generate the dataset, so it only learned the difference. but right now I'm in a conundrum where my generated image dataset was made with a combo of multiple base models and LoRAs per each image, so that.. doesn't work too well
@jade hornet im on RTX4090
anyone recently installed kohyass on ubuntu 22 or ubuntu20?
it appears its impossible
my next install with cuda11.8 and 520 drivers failed with blank display
this is gerring really redicolous
anyone can help with this
Hey, I am trying to fine-tune SDXL with 30 sets of images, each set having a specific building, I am thinking of using Dreambooth for this. If I am not wrong I think for each set I have to fine-tune the model separately. So, I am thinking of fine-tuning the model with one set of building images and then saving the model and using the saved model to do the next set of images and so on. Do you think doing so will cause issues with the model? Is there a better way for doing this?
if i wanted to do a batch of images, can i set it up so that i can choose which images out of the batch to upscale (in the same workflow)?
for example, a batch of 5 images takes 20 seconds on the first step, but the next step of upscaling takes about 2 minutes per image.
i dont want to upscale all 5 images, but choose images before proceeding to the next step. is it possible to set the workflow to "pause" and wait for me to choose images for the next step?
i guess the alternative is to just do the batch, save the files you want, and upscale them in a different workflow?
sudo ubuntu-drivers install nvidia:525
in kohya_ss venv:
pip install torch==2.1.2+cu121 torchvision==0.16.2+cu121 xformers==0.0.23.post1 --extra-index-url https://download.pytorch.org/whl/cu121
i thik that there is a general problem in kohya with multi gpu
i tested 4 version: 24.x.x. 22.x.x. and 21.x.x.
same machine same cuda same driver same gpus (4x4090)
kohya 21 takes 1:50, kohya 22 and 24 take ~28:00
its like 15 times slower
any ideas?
i tested under ubuntu22.04, ubuntu20.04 and windows 10. no matter what i do i can not get the speeds back to the speeds of 21
i tested also with cuda12 and cuda11
anyone got any ideas?
i even tested on 2 systems, one with 4 gpus and one with 2 gus, one intel one amd
so im quite sure that everybody will run into this issue
tested gloo and nccl
I had no idea multi gpu was supported
I am training a Lora in One trainer with 450 images and 150 epochs with 2 batch sizes but it consumes 10.4 GB of the 12 GB of vram that my RTX 3060 12GB has. But the problem is that it is very slow, there is no way to increase the batch size consuming the same amount of vram or only 12 vram to make the training faster?
is it working? if so rejoice. why stress about the speed, you got a hot date coming up?
Is it normal that it takes 36 hours to do that training? es sdxl 450 images, 150 epochs, 2 batch size, 12 vram and prodigy optimizer
that seems long yah, I'd think a 3090 would get through that in about 30 mins. I often do training on a cloud gpu for that reason
no, that's normal. You train way too many epochs. In total you have 33750 steps that could take between 20-40 hours depending on your gpu
what are you trying to generate?
hmm what do you mean you tested under the different operating systems? what is your goal?
use kohya 21 since it at least works, and then use geohot's p2p patch for the 4090
don't waste your time on updates in kohya
a concept
can you be more specific?
It is a specific style of photography with an aesthetic atmosphere in the landscape and some people posing in the background in an aesthetic way.
hmm that is pretty vague
do you like the results you are getting?
The entertainment is not over yet, there are 21 hours left
without any more details about what you are trying to do, it is hard to say how to make things go faster
images similar to this and sometimes people appear, they are not just landscapes
geohot's p2p patch seems interesting but im currently on windows
however this with nccl could give quite a speed bump.
can anyone share his settings for dreambooth finetune sdxl from scratch
especially with ppl in it
Final 1000!!
how many are the complete set?
lol that's pretty fun
can you remove the text? then, do unet only training.
If that's what I did remove all the text from each image
has anyone here tried using SDPA instead of xformers for cross attention, in Kohya training? https://pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html?
Does anyone know if 225 is a hard limit for max token length for lora training? I've been wanting to try out natural language captions for lora training and some of mine are a fair bit over that limit. koya_ss seems to cap it at 225, but I don't know if that's just a UI limit
75 is the clip limit
I have installed the following extensions in Stable Diffusion, but Lora is not generated.
https://github.com/liasece/sd-webui-train-tools
It is not generated as attached and [Ending job] is displayed on the command, but
On train-tools
There is no change even after an hour has passed since "A: 4.34 GB, R: 4.48 GB, Sys: 6.1/23.9883 GB (25.4%)" is displayed.
What I confirmed
・Stable Diffusion version change
(1.8.0→1.6.0→1.5.1→1.7.0)
・Can the Train base model be used to generate Lora unless it is adapted?
hi all. i've lost my patience with Kohya. in my images folder i have pngs with corresponding .txt files that match filenames (UTF-8, permissions are fine). for whatever reason when i begin training, the caption files show up as missing. anyone else experience this? any solutions? training goes fine otherwise.
can you be more specific? what error do you see
what is your goal?
In Terminal I'm getting this message: " WARNING No caption file found for 140 images. Training will continue without train_util.py:1459 captions for these images. If class token exists, it will be used. /" even though there are indeed caption files for each of the images living in that same folder with the same filenames.
what are you trying to train?
do you have the caption extension specified as ".txt"? the default setting is ".caption"
training a set of illustrations (testing really, I don't have faith the style will work). the illustrations are pngs accompanied by txt files.
GAH that's it. thank you so much!
I updated caption.py so the default is .txt but still not working. Must be hardcoded somewhere else
Updated this file too: "merge_captions_to_metadata.py" but still didn't fix the problem :/
are you not using the gui? There's a setting in the gui to choose which extension to use
it's also a command line arg if you're using sd-scripts I think
I believe caption.py is used for kohya_ss's caption-file generating tool, so it wouldn't affect training for imagesets with existing captions. I'd have to take a closer look to see what merge_captions_to_medata does, but from a skim, I think it is used to help generate the output lora metadata, so not directly involved in training
it's best not to modify the repo files in general. Changes will make your local copy out of sync with the repo, possibly introducing bugs that are hard to diagnose
(the caption extension parameter is under the "Parameters" section with the name "Caption file extension". The command line arg is --caption_extension)
can you share one?
and a corresponding caption
have you tried an online service for this? a LoRA training costs like $0.50
I would like to be able to generate Lora with train-tools.
But if it doesn't seem to work now, I'll consider another Lora generation tool.
Omg, can't believe I missed that. You're right, I see it now under Parameters in the GUI. Guess I should undo my file edits! Thank you so much for your help 💚
Solved! Can't share one bc it's for a work project. Trying to learn to do everything on my own for learning's sake. Otherwise I'd give up and use something online 🙂
you're welcome!
Hey everyone 😊. Does anyone know if I can use 300dpi image to train LoRa? Or can point to some documentation that goes over size and resolution? Thanks
what matters for lora training is resolution and quality. sd1.5 loras should be trained on images with at least 512x512 resolution, and 768x768 is often recommended. sdxl should be trained on 1024x1024. You can train on higher res images and it'll be fine (they'll be bucketed by aspect ratio and scaled down to hover around the training resolution), but lower res images can be bad for the output lora quality.
for documentation, I dunno if there's any one guide that's been helpful for everything, but there're a lot of videos and articles online covering some basics. https://www.reddit.com/r/StableDiffusion/comments/11vw5k3/lora_training_guide_version_3_i_go_more_indepth/ is a popular one, and https://rentry.co/59xed3 for a bit more in-depth details. https://civitai.com/articles/3522/valstrixs-crash-course-guide-to-lora-and-lycoris-training seems nice too
Stable Diffusion LoRA training science and notes
By yours truly, The Other LoRA Rentry Guy.
This is not a how to install guide, it is a guide about how to improve your results, describe what options do, and hints on how to train characters using bad or few images.
Due to the higher prevalence of...
does anyone have experience training with lion optimiser for style models?
anyone know if there's a trainer that allows multiple text captions for the same image? I'm experimenting with natural language captioning and the regular "keep tokens" and "shuffle caption" functionality won't work for that. I know you can just copy the image and write new captions for each image, but that messes with the repeats and adds more images to be cached
EDIT: nvm, figure out the option: sd-scripts has --enable_wildcard which looks like it can do this.
stupid question when someone has a moment. is it possible to train simple 2D vector icons (after converting them to png)? i'm talking like, very simple as in graphic of a globe/smartphone/flower/envelope, on white/black backgrounds, etc. All in the same color palette and not very detailed. they're not truly "images" so i assume not...
I'm not sure what you mean by "not truly images". icon loras are out there though: https://civitai.com/models/49021/minimalist-icons
https://civitai.com/models/141066/game-icon
This is a Lora-model for creating minimalist icons. You will no longer have copyright issues. Just generate icons and use them! Trained on the Deli...
Prompt 2d icon. {your prompt}. lora:game_icon_v1.0:1 The shorter the prompt, the better. Use the "2d icon" modifier at the beginning. Then ...
I'm using Kohya, trying various types of LoRAs. So usually, the sample images are at best, a rough indication of how the training is going and a means to tell when the model is over fitted. We expect them to be pretty bad.
What do you do when the training samples look more like the target than anything you can generate in A1111 or Comfy?
Just to be clear, the training samples look rough, but they get the face right. She has an uncommon face, but the sample images look like bad pictures of her, while in A1111 and Comfy, the facial structure, lips, nose just isn't right. Like it's been overly normalized by the rest of the model.
Thanks. I've been wanting to try something like that.
I'd say double-check that the checkpoint you're generating on is the same as the one you're using the lora on (default training checkpoint is base 1.5). After that, check if the sampler, cfg scale, etc. on the sampler settings is the same as the ones you're using on automatic/comfy.
thanks, those are some great references. "not truly images" meaning they are just flat solid color 2d svgs. nothing really photorealistic about them. but seeing there are icon loras out there gives me some hope that it's possible!
Just more steps then, and if it burns out before it gets it right, you need to either slow down the LR or tweak your data set to have better angles
i've been reading some hero's thread about improving the loss function for fine tuning, which is definitely appearing to work well for my training
i don't think arbitrarily stopping the training is a good idea in general and it's been limiting the capability of fine tuning for a long time
I am trying to create a fine-tuned model based on SDXL (using either KohyaSS or Huggingface libraries). I have a captioned set of ~500k images in ~400 categories that I want to train on to create an initial checkpoint.
I have a couple of questions regarding how to prepare images as far as cropping/resizing the image dataset to prepare it for training:
All of my images are available in full 1MP resolution but in a variety of different aspect ratios (e.g. 1024x1024, 1280x720). Are square images generally best? Should I train on both the uncropped full version and a version that is cropped to square? For cropping to square, it better to use random cropping, center cropping, or to use object segmentation and try to crop around important subjects?
Also, is it beneficial to train on smaller copies of images, so that the model is getting trained on generating the subject at all resolutions? For instance, if I have a 1024x1024 image, is there any benefit to also creating a 512x512 and 256x256 copy of this image and training on those too? I was thinking this might improve generation of the subject at lower resolutions, but was worried about it overfitting with repeated use of same image.
jesus chill tf out
Stop writing bibles here
lol, it's you that needs to chill my dude. You're spending too much time on twitter if you think writing a couple short paragraphs about a complex issue is a "bible"
Do something about it big guy
I don't need to do anything, you're just an angry little yapper with nothing better to do that be an annoyance to someone seeking info on a technical question. Get a life.
kohya ss implemented bucket for variety of different aspect ratios But you could do the bucket by your own implementation before feeding your dataset to kohya ss script. It has very little different for different cropping strategies. The default is center crop and again, if you implemented your own bucket script, you could try to use different strategies.
I have an idea. I want to transfer the pose from the anime style to realistic style. Both image created by the same model same prompt with different style. Anyone has idea how to achieve that via training?
Hi, I would like, learn to fine tune a model to modify existing image. Where can I start to learn to do this?
You can train a pose using subject style captions, but why not just use controlnet with canny and openpose
I have done with a training with some img2img and the result is pretty good.
Hello. Are the stable diffusion models not available for fine-tuning through the API?
they should be, if you're into that it should be on modelslab or whatever. but if there's an issue, you dont need the api
Thanks for the reply. Do you mean that we don't need the api to fine tune the diffusion models?
correct
I found that we could fine tune the models through HuggingFace. But I am supposed to do it on a huge dataset. That is why I was looking for a solution where I don't have to write the code to do the distributed training and manage the necessary compute myself.
are you trying to say training on the noun project stuff?
what is your goal?
Hi Guyz i am new here can any body help me to prevent text generation on the image from stable diuffusion
tell me if can we finetune the Sdxl for not producing the text on any image, becasue stable diffusion is very bad in producing text so can we somehow stop sdxl to product any kind of text on the generated image, Currently i am using negative prompt
NEGATIVE PROMT = "text, fonts,words,3d, cartoon, anime, (deformed eyes, nose, ears, nose), bad quality,bad anatomy, ugly"
but it does not listen to the negatrive prompt so well
i want to generate canva template for christmas halloween etc with no text on it but it always put text with wrong spelling
Hello everyone. Can anyone suggest me a feature in which I can create layers similar like photoshop through AI
I think I was able to install kohya_ss, but it seems like the version is wrong and the installation URL is displayed.
Which one should I download?
Please let me ask you an additional question.
After updating to Python 3.10.11, Stable Diffusion no longer starts...
I was prompted to "Press any key" on the command prompt screen, so I pressed the enter key, but the screen just closed and the SD did not start.
I will write down my machine specs.
Model number: ILeDEs-M07M-A134-SASXB
CPU: Intel Core i5-13400
Memory: 16GB
HDD: 8TB
graphic board:Ge Force RTX 3060Ti 8GB
If you receive a reply with a quote, you will receive a notification and it will be easier to notice.
Hello can anyone suggest same feature that Runway uses for Erase and replace (ai-tools/erase-and-replace) in Stable Diffusion sdxl? I have used inpainting but i cannot replicate the same through prompt which runway does.
Inpainting is the answer, despite it working differently, that's the workflow you would use
Iirc you just want to downsize it to say 1024x1024 along with the rest of the images if you desire the model to be SDXL, or use OG resolution tall, if you desire it to natively be higher res
Thank you 😊
Hi Guyz can any one help me i want to finetune sdxl so it generate every image with the "solid color empty background" where i can put text in future i want stable diffusion to give me result like this :
I generated this image by this prompt "solid color background, christmas sales template,soft lightning,8k" but this type of prompt does not wokr if i want to make fathers day template , halloween templat ebut i need this type of thing for every image generation where i have room for the text on image
is it possible to finetune sdxl to get this type of result and one more thing there should not be a text on any image.
fituning lora would be better or the other one?
is there a captioning tool or web based UI that anyone likes?
you do not need to train this. have you tried layered diffusion?
no i never try layered diffusiuon
can you tell me what is layered diffusion ?
is it the type of finetuning method
did you try googling it?
also have you tried ideogram? what is this for?
seripously i dont even know ideogram, i am new giuy i onlyknow lora finetuning
what is your goal?
i mean why do you want to generate greeting cards?
@dusky urchin i want to generate greeting card and put text on these type of templates so i can use it any where like in my shop to display christmas discoiunt offer or any thing
this ic_light model . might change the game for synthetic dataset creation
So blip 3 just dropped, looks like another step up for captioning.
We introduce #BLIP3, a series of large multimodal models (LMMs) developed by Salesforce AI Research.
#BLIP3 is a new SOTA model under 5B on few-shot learning and multimodal benchmarks.
Check our first HF release at https://t.co/uyZmC33zak, and stay tuned for the coming technical…
I installed Forge, but the following problem occurs.
・“Error Connection errored out.” occurs frequently.
・I installed sd_xl_base_1.0.safetensors and Pony Diffusion V6 XL, but LORA does not appear (F:\webui_forge\webui\models\Lora)
I've been struggling for about a week now.
help me! (It will be easier to understand if you reply with a quote)
Who do I have to beg to try an implementation of this? https://twitter.com/rasbt/status/1758502685995589698
I noticed it months ago but haven't seen any support for it in training repos and don't have the skill to implement it myself
While everyone is talking about Sora, there's a potential successor to LoRA (low-rank adaptation) called DoRA. Here's a closer look at the "DoRA: Weight-Decomposed Low-Rank Adaptation" paper: https://t.co/Mmjhy3xTpd
LoRA is probably the most widely used parameter-efficient…
nvm it's actually been supported for a while I just missed it
Hi! Is it possible to pause and resume train in OneTrainer? I would like to pause the training and test the model in a real workflow and if it is not trained enough, then resume training from the same place
Hi everyone! I need since help fine tuning stable diffusion for a product, if there is anyone that can help?? Appreciate you all!
I just added a standalone version of the greedy search bad caption detection script that my CogVLM captioning tool uses: https://github.com/ProGamerGov/VLM-Captioning-Tools/blob/main/bad_caption_finder.py
You can use the script to determine the extent of the issue in your own datasets!
Note that the greedy search caption failure issue is present in all automatic captioning tools to varying degrees, and it can impact up to 3% or more of your total dataset
For those who don't know what greedy search is, all you have to know is that the greedy search caption failure occurs when you come across captions that are endlessly repeating, letters, characters, phrases, and sentences. Greedy search is used in all VLMs and captioning models currently available
Hey! 🆘 I'm working on a project with stable diffusion Finetuning and ControlNet and need some HELP. If you're experienced with these, I’d appreciate your input. Thanks!
GPT4O is a lot less restrictive with what it's willing to caption
new cogvlm is out
https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B
Does anyone know a guide on dreambooth for artstyle? I want to do the fine-tuning with some real paintings and the trained model must match the artist's painting technique!
Does anyone know how to create exact variants of a particular image. For example, if I want to create exact variants of a shirt in this format:
When remixing, I have not been able to get the exact orientation of the template
Please PM me!
Maybe start with a blank one and use controlnet to generate the remixes? That'd be my first thought.
still has all the same issues with lack of batch inputs, CCP license, and others
give me a office task for employee working and decrease heat
I am curious, are there any good tutorials on to fine tune a lora? I am in the process of doing a redo of an old one and I want to fix it up, how do I go about this? It was trained on Kohya at basically the default, 22 images, no regulations and at 4400 steps - I would like to know how to... fix it up so it looks a bit more cleaner ((at the moment it looks quite whack)) -- Images used where large and clear
Determining what caused the learning to not produce what you intended could be many things. Maybe a different training model. Maybe it just needed more steps. Maybe some of the photos are to far apart in concept and the AI was confused on what it was supposed to learn. Maybe your captions need tweaked so there is more guidance on the particular scenarios that you wish to produce later. There's probably not a guide you will find to tell you exactly how to improve your specific scenario
Hey, I was wondering if it is possible to train a lora to a person( full body, close ups, etc.) and if so, how to go about it and what checkpoint is better at realistic photography for this job.
- yes it's possible 2. you would find a tutorial on kohya_ss, I can find you a decent one if you struggle 3. I'd say juggernaut is pretty good at people and realism and responds well to training
what the max number of images do i can use to train a style using dreambooth?
infinite?
I always train on sdxl_base. As long as you don't use any extremely fancy model you can use your base loras on other checkpoints, too. In particular with Juggernaut I had really bad experiences, training it didn't worked at all. But maybe that changed with never versions. At least old versions of juggernaut used some stupid pyradimal noise that fucked up any finetuning attempts
at somepoint for a single subject you may have diminishing returns, but if you just used supervised finetuning (labeled images without the "regularization" images proposed by dreambooth) you can train as many things at once as you want
Thanks! I separated some paintings into several 512x512 samples for the model to focus on the artist's technique, but I have like 700 samples to use in the finetuning
Hello everyone, can you suggest me a good model for open pose ?
Have you considered doing the 700 samples as a lora?
Hello, I have a problem with kohya_ss. I started making my first model today, I did everything as in the guide, and my model was ready in 3 seconds while in others it takes up to 9 hours
my monitor is small so sorry for this logs:(
i will try it but i will train on a good pc, so probably i will try to train using the 700 samples in normal way too
Hi, I'm using an ID-preserving pipeline with ControlNet (canny|pose) and SDXL for a use case that involves only one reference image for the pose and a face image to interpolate that face onto the pose. What is a good approach to fine-tune my SDXL or any other components in the pipeline to achieve better throughput? We are inferring on an image with the same prompt, so the model should generate only one use case but with a consistent pose.
i'm having trouble opening stable diffusion I keep getting this message
I was able to train SD3 LoRA using example code: train_dreambooth_lora_sd3.py, but then file generated seems to be in diffufers format not webui/kohya. In ComfyUI I get errors like:
lora key not loaded: transformer_transformer_blocks_6_attn_to_v.alpha
lora key not loaded: transformer_transformer_blocks_6_attn_to_v.lora_down.weight
lora key not loaded: transformer_transformer_blocks_6_attn_to_v.lora_up.weight
lora key not loaded: transformer_transformer_blocks_7_attn_to_k.alpha
lora key not loaded: transformer_transformer_blocks_7_attn_to_k.lora_down.weight
lora key not loaded: transformer_transformer_blocks_7_attn_to_k.lora_up.weight
So i tried convert_diffusers_sdxl_lora_to_webui.py but it seems it doesn't convert it properly. Any ideas or tips how to progress with that?
This model: https://civitai.com/models/512239/pixel-art-medium-128 seems to have the same issue- file header looks similary to my generated file. I bet it was trained the same method.
i started using kohya to finetune sdxl on a dataset of 343 images of my art. the resulting checkpoint kinda had the style i was looking for but lacked the details i'm hoping for, it seemed like it might require more training. since i only did 10 epochs, i'd like to train it with another 10 to see if that helps. to pick up where i left off, would i choose the first version of the model as the source model instead of sdxl base? if not, how would i go about training more on top of what i already did?
are there any newer tools for training SDXL lora's that are more simplistic than Kohya or Invoke-Training?
OneTrainer
was just reading about that one, will check it out more
thanks!
How are people training sd3 lora already can anyone tell me where they are training lora from
The diffusers github repo has a script
You will need a 3090/4090 to make a lora, and a a100 or h100 to do a full finetune
Is there any tuning sample dataset available?
Apart from those dog and other small ones. I would like to see a full scale one.
damn i'll have to learn how to do that
Hello everyone, is it possible to locally finetune an SDXL model on an RTX 4060 ti 8 gb GPU if the dataset has 300 images? If this is possible, then approximately how long will it take?
I dont think you can make a lora for sdxl in 8gb, let alone a full finetune
(我迟了一天还回复 😅 )
关于SD3的LORA 现在这个其实只是拿来尝鲜一下罢了。 因为现在为止开放的训练方式就只有用diffusers训练的方式,而且这个训练方式有很多限制,包括只能支持一个概念的训练。 不过现在我们倒是有SimpleTuner的大佬支持SD3 LORA训练。前提是你要有3090或者比它还要更多显存的卡,这个diffusers训练方式还是挺怂的。
中文圈其实也没有太多很好的关于SD3的文档。不过你假如真有一张3090,你可以Google Translate一下这个文档(官方hugging face)或者上SimpleTuner ( https://github.com/bghira/SimpleTuner )
https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sd3.md
at this point you can. well kinda
OneTrainer added fused back pass support. it means the absolute minimum for actual sdxl finetuning went down to 6.3gb
its still a lot of work though, and takes quite literally forever to train. but just wanted to point out that its at least possible now
^ In onetrainer it's possible.
but its a lot of effort to set up correctly, and will prob take an entire day to train for each attempt. (like 12~24h)
So if you really really wanna do it. it's possible. but I would recommend 16gb vram to actually have fun with training, and to avoid time consuming frustrations due to vram limits
LoRA would be easier to train with your limits. and will take like 1~7h to train depending on your settings. will still be a lot of effort to get it working the first time though. (but after first time, settings barely change - so if it works, it works)
is it possible to use sdxl lora with sd3?
You can only use the clip from the lora
and btw is diffusers only cloud based? (newbie question)
also can you give me a article if there's one or a quick guide on how to install it
Diffusers is a python library
so I use it inside a terminal right?
and the dataset required for making sd3 lora should be only about 5 images or we can add more like 50?
I find that lora will happily accept lots of images
alright thanks
How much do captions help with training lora? Are they important? Mostly for style not learning a new concept
always use captions describing the image. You can/should still use a small caption dropout
Meta FAIR is releasing several new research artifacts. Our hope is that the research community can use them to innovate, explore, and discover new ways to apply AI at scale.
Hey guys.Does it happen when the style from the dataset is not copied in fine-tuning at low epochs (1-2 epochs)?
Hello everyone, can anyone suggest me why my generation of mask is not applying on the left side of the image. I am keeping the dimensions - 1016 x 504 but there is this left black patch that is coming again and again
What if i go really crazy on the descriptions using gpt4o
Like super descriptive. Will this hep or is a simple obvious description all you need
use the descriptionyou want to prompt when using the model
How do I extract the parameters from a lora? like comfyui images have workflow
I trained a lora once and I can't remember it's settings now
i've been using blip2, which is pretty accurate and verbose ... i think it helps to have lengthy descriptions. I add my own tags in, as well as wd14 tags
Do these captioning models beat out llm vision models like gpt4o and claude 3.5 sonnet?
Also is this the blip-2 you are referring to: https://replicate.com/smartinezbragado/salesforce-blip2
For example which would be a better caption and why. After ill provide you with more images to show the kinds of images im wanting to train for
Llava-13b:
The image is a colorful illustration featuring a woman with pink hair, wearing a yellow raincoat and a frog hat. She appears to be staring at the viewer with a somewhat angry expression. The woman is also wearing a nose ring, adding to her unique appearance. The illustration is likely created using a digital medium, as it has a vibrant and detailed style. The combination of the frog hat, pink hair, and yellow raincoat gives the image a whimsical and quirky vibe.
Gpt4o:
The image depicts a person with vibrant, neon-pink hair, accentuating a striking and bold fashion statement. They wear a distinctive frog hat with large eyes, combined with pink-tinted glasses and a red clown-like nose, enhancing the unique and eclectic style. The individual dons a yellow and black jacket with a high collar, further contributing to the bold and modern aesthetic.The background is bright yellow with white Japanese characters, creating a vivid and eye-catching contrast. The overall style is reminiscent of modern digital illustration and graphic art, heavily influenced by Japanese pop culture and cyberpunk elements. The neon colors and street fashion sensibilities give the portrait a contemporary and edgy feel.
These are the type of images i want to tune for:
hey there is there any actual way right now to fine tune stable diffusion with 2 input images to give out one output image??
I want to combine some aspects of each input image to give an output if that makes sense
From personal experience, SD/SDXL doesn't understand a majority of the LLM caption junk.
What I would take from Llava, then Gpt4o.
colorful illustration, colorful, illustration, woman, pink hair, yellow raincoat, frog hat, looking at the viewer, angry expression, nose ring, vibrant, whimsical and quirky vibe.
-> whimsical and quirky vibe might be a terrible choice, as these prompts already have a pre-existing function.
vibrant hair, neon-pink hair, frog hat, pink-tinted glasses, pink glasses, yellow jacket, black jacket, black and yellow jacket, bright yellow background, digital illustration, cyberpunk, neon colors, street fashion, portrait
-> You can consider removing the "bright yellow background"
Depending on the images you have, if they're side-facing, consider making a copy of them & rotate them horizontally, to not create a bias 
that's the one, it's also integrated into Kohya_SS utilities tab now ... so you just point that to the directory of images, and let it rip, and it will create caption files
gpt4o and claude 3.5 sonnet are the latest hotness ... but blip2 is pretty good at describing what is in the image, that's what it was designed for and it's optimized for doing that, so it may be faster ... whereas gpt4o and claude are general purpose. Also blip2 is free and already integrated into Kohya. I haven't compared them head to head, but I'm guessing they are similar ... i'm not sure if Claude or gpt4o use a separate vision model to ingest images
Thanks guys i recieve my training pc tmrw so im excited to hit the ground running with your tips
My process is: 1) use blip2, in Kohya_ss then 2) I run exiftool from the commandline to extract exif tags that I put in the images via lightroom, these are appened to the .txt caption file. then 3) back to kohya_ss and run wd_14 tagger with the append option, this puts the wd_14 tags on the end. This last one tends to produce some nsfw images, but also seems to pick up a lot of creative and artwork anime characters as well.
Also really helpful ... if you are training a lora of a single person, it helps to use "starbyface" website ... this will give a random celebrity that looks a lot like your person. Include this celebrity doppleganger name in the caption file ... it will put your subject in more scenes and give more character.
I missed this post, but these captions seem far better than blip2, although it is only set to write a short caption by default; I haven't set it to write more than just one or two sentences. Keep an eye on your caption size limit for training.
specs?
Intel I914400k, nvidia rtx 4090 24gb vram, 64gb ram, 4t ssd
So pretty much the standard consumer build
Hi guys, i want to try out finetuning and i am following the following:
https://next.platform.stability.ai/docs/features/fine-tuning
however I get an ModuleNotFoundError: No module named 'stability_sdk.finetune' even after installing stability-sdk.
Is there another package that needs to be installed?
stability-sdk version is 0.8.6
if I increase my batch size in khoya ss lora training from 1 to 4 on a 3090 will it affect the output quality of the lora
try it? I tend to run with a batch size of 4
in theory no it's just how many you're doing at once, some swear it makes it better. maybe there's more blending when you're doing them at the same time, hard to say.
Hello everyone! I have a question and wonder if anybody know how to do. I am an architect and want to fine tune a model for a specific case study. Such as a tower ruin that designer require to create a new design within it, like adding new structures on it (adaptive reuse). I have the photos of tower ruin and 300 images of design alternative renders for the tower (there was a competition and I took images from there). So the question is: can I fine tune the stable diffusion that it can generate new design solution by keeping the site and ruins but adding new structures on it?
Which model would be best for it
@tired wind Hi, I saw your repo https://github.com/ratulrafsan/Comfyui-SAL-VTON but my question is, does this only work for upper body garments? and also, does it work only on women? thanks for your awesome work 🙂
@fading sail Hi, Thank you, all credits goes to the original author. My node is just a wrapper around their implementation.
https://openaccess.thecvf.com/content/CVPR2023/papers/Yan_Linking_Garment_With_Person_via_Semantically_Associated_Landmarks_for_Virtual_CVPR_2023_paper.pdf
Their paper only discusses finding landmarks on upper body garments. The dataset VTON HD (https://www.kaggle.com/datasets/marquis03/high-resolution-viton-zalando-dataset/data) is upper body only as well. So current implementation & model won't work for lower body garments,
And, it should work for both man and woman, as long as the model can detect the landmarks appropriately.
You might want to take a look at this as well
https://rlawjdghek.github.io/StableVITON/
ah thanks for replying, il check it out. thx 🙂
im mainly looking for something that does all the body parts, so like upper, lower or full dress, i think IDM VTON does it and the magicclothing one, but they seem to require a lot of VRAM, but il keep searching
Whats your guys workflow for captioning many images and making sure they are correct. What app do you guys use? And what features are your favorites?
Specifically to train loras and checkpoints
Cuz vision models often get things wrong
hey all, i'm attempting to train a sd(1.5) u-net from scratch, on a small (~2000 images) dataset that's not varied (specific subject).
my theory is i can use kohya_ss modified to re-initialize the weights at the start of the training loop to effectively reset the u-net.
technically this is working, but the output images aren't sensical yet. Wondering if there's anyone here who I can talk to, to explore this further.
I guess you would have to define many. If I was doing thousands, I would put them in categorized folders so I could tag them by similar groups and styles. I like wd14 convnext v2 even for realistic, and I tend to use either dataset tag editor or taggui. If I'm doing realistic, I get rid of certain tags like 1girl that are strictly danbooru
There are lots of attempts to make something completely automated, but forget that, dataset prep is where you should be spending the most time
I mean you cannot train anything reasonable with 2000 images. That not even the data is memoized might be because transformers are very unstable at beginning and need long training time and a low lr at start
How hard it is for a beginner to create a pixelart model that will output similar results to this? I can get thousands of graphics in similar style and exact same resolution and format to train it.
Hello everyone,
Does anybody know any good SDXL checkpoint training guide?
We have a nice dataset of 5,000 images in a specific style, but we just can't find any tutorials and articles about checkpoint training (not LORA)
/imagane Real style, exterior, day, school building in long shot, a group of four 11-year-old children are standing on the far balcony, on the second floor by a railing, the children are talking excitedly. The first child is fair skinned with curly black hair and is wearing a red t-shirt and gray pants And talking to Solo, the second boy is tanned with straight blond hair he is wearing a green button up shirt and gray pants he is talking to the first boy, the third boy is light skinned with straight black hair he is wearing a light orange t-shirt and gray pants he is looking at Solo
The boy in the center: solo, tanned skin tone devilish blond wavy hair, blue t-shirt, short jeans and a gray school bag on his back
.
I just finished training my first lora. I am now left with 800 safetensor files. How do i test them all to see which is most efficient?
just test the last. If it ok, boom.
Ok lol
Hello everyone, what should be the input prompt to extend the below images. I am trying to automate the extend feature so I require one specific prompt for the extension of both the images.
for patterned images, it's working absolutely fine. But for plain background images, it's adding some background which is not matching with the original image bg.
Prompt used: Generate creative background scene matching original image. Environmental scene, city life, dressing style, nature, building, non-living objects.
Neg prompt used: Blurry, bordered, zoomed, solid color, monotonic background, disfigured, human figure, living objects, gore, dead, hazy, dull.
Ty for anyone that helped answering my questions. My first lora came out great. I just posted in #✨|sdxl now im wanting to mix 2 models together because i find myself having to make an image then denoise that image with another model to get the result i want. Would merging the models save me from having to do this?
Imagine a sleek, modern laptop displaying a vibrant and futuristic website interface. The screen showcases innovative design elements like smooth animations, interactive features, and bold typography. In the background, there are creative tools scattered around, symbolizing the process of crafting cutting-edge digital experiences. The scene conveys a sense of forward-thinking design, blending creativity with technology to shape the future of web development.
what is the prompt you gave to Llava and gpt4?
anyone here has information about fine-tuning? I have no idea how much more/less data it needs. the example I had from dreambooth dataset was like 10 images per object/class.
#🔧|finetune Imagine a sleek, modern laptop displaying a vibrant and futuristic website interface. The screen showcases innovative design elements like smooth animations, interactive features, and bold typography. In the background, there are creative tools scattered around, symbolizing the process of crafting cutting-edge digital experiences. The scene conveys a sense of forward-thinking design, blending creativity with technology to shape the future of web development.
Prompt used: Generate creative background scene matching original image. Environmental scene, city life, dressing style, nature, building, non-living objects.
Neg prompt used: Blurry, bordered, zoomed, solid color, monotonic background, disfigured, human figure, living objects, gore, dead, hazy, dull.
I never done a checkpoint b4 but perplexity said about 200. Anything less should be a lora
Hi, I'm fine-tuning my own style (with a fixed pose and background) using about 30K images. These images were generated using the SDXL model. Now, I want to fine-tune the SD15 model to adapt to that style for better performance.
All of my samples in this dataset use the same prompts. However, the results after one epoch are bad. The model seems too vibrant. I don't know if this is due to my dataset preparation (one prompt for all) or something else. Has any developer struggled with the same issue?
I need help setting up a workflow to train a Lora. I can get the text description files to generate but It will not generate the actual Lora file at the end. Is anyone familiar with doing this?
Hi everyone
i trying to train Canny Controlnet to generate clothing images. in my training set (of 100K images), i have only Cloth items but while inferencing SD 1.5 drawing cloth along with humans. I using HF diffuser to train my controlnet.
- Is it normal for SD 1.5 to generate human even if there are no in training set.
- Currently i am not using any negative embedding. can negative embedding help removing unwanted human.
Hey friends! I'm looking for paid lessons from someone with experience in training masked multi-resolution SDXL fine-tuning using OneTrainer.
I've been attempting to create a photo-realistic fine-tune to generate images of Miranda Kerr, the Australian model. For my base model, I used epicrealismXL_v7FinalDestination. This model can produce good pictures at both 1024x1024 and 1024x1280 resolutions. I thought that training on a mix of images in these resolutions might improve the fine-tune, but the results have been disappointing, indicating I might be missing something crucial.
Here's what I did:
- I prepared 20 images at 1024x1024 resolution (half face close-ups and half-body shots) and 38 images at 1024x1280 resolution (a mix of half-body and full-body shots).
- All images were masked with SAM masking, and captions were generated using "WD14 VIT v2," with "mirandakerr" as the first word.
- For the first training concept, which used only 1024x1024 images, I set "Resolution Override" to "1024x1024". For the second concept, which used only 1024x1280 images, I set "Resolution Override" to "1024x1280".
- I enabled "Aspect Ratio Bucketing" during training, hoping for at least decent results.
Despite this, the output quality is poor. Starting from epoch 1, I see degraded quality, and by epoch 5, the quality resembles a painting style (really bad).
The dataset used:
- Concept 1 (20 masked images): 1024x1024 dataset: Google Drive Link
- Concept 2 (38 masked images): 1024x1280 dataset: Google Drive Link
In total, I trained with 58 masked images per epoch.
I tried training for 100 epochs, and it can be seen that after the first 5 epochs, the quality degrades and stays at the degraded level (paint-like style). I'm attaching 4 samples of 1024x1024 resolution (epochs: 0, 5, 50, 100) and 4 samples of 1024x1280 resolution (same epochs: 0, 5, 50, 100). It is clear that the image quality at epoch 0 is good, but anything after the first epoch is unusable due to the really bad quality. Learning rate used "3e-6". Learning rate "2e-6" gives similar results (but slower training).
I'm also attaching the training preset "Tuned SDXL FineTune BFloat16 3.json." I have 16GB VRAM and used ADAFACTOR with the bfloat16 data type. However, I can rent a better machine if you believe the results will be better on a bigger machine.
Note, "epicrealismXL_v7FinalDestination" can already produce "Miranda Kerr". However, I'm trying to teach the model to recognize her using the unique trigger word "mirandakerr".
In fact, I'm not interested in training this specific model and I'm not interested in the final checkpoint. My goal is to learn how to train photo-realistic images of real people.
Currently, I'm concerned about the poor quality (fried skin).
The reason for using masked training is that I need to train the model on photo studio face shoots where the background is always white. In such cases, I don't want the model to learn to reproduce a white background in all photos.
In the training dataset, you may find black bars on the sides - these were created to match the training resolution. However, these black bars are masked and thus not included in the training.
I have never used Kohya_ss, so I can't say if it gives better results, but I think I will experiment with Kohya_ss as well because currently, I'm a bit stuck with OneTrainer.
I would really appreciate any help or advice on effectively training a photo-realistic model using OneTrainer.
Disclaimer: "Voplica" is the project where I'm conducting my experiments. The final goal is to create a service for photo-realistic model training. I am personally a software developer. All content at Voplica is AI-generated.
Looking forward to finding a teacher and potential partner.
Thank you!
Does anyone have advice for Logo prompts and how high the CFG should be?
Been using Redmonds Logo Lora
with prompts like this
logo, claw reaching out of the screen (animal paw with claws), neon claw, 3d realistic <lora:LogoRedmondV2-Logo-LogoRedmAF:1>
but i get mostly shit
using fp8 to not run out of vram idk if thats an issue
and sdxl
Hey all! Been a while since i did any fintuning so was wondering if you know any great guides in fintuning and dataset building (mostly how to make a quality dataset) thanks for your help!
are there any good guides, like not tutorials, just in depth information regarding the building and structuring of a large dataset of thousands or 100s of thousands of images
Hey all, asked this question over in tech support and we could not land on an answer... Maybe you guys can figure it out..
I am training a lora and saving samples during the process. The samples have very clear influence of the training data in them so the lora is for sure seeing the results from the training. Fast forward to using the lora and no matter what I do, I can't get the image output using the lora to have any difference than the same prompt/seed than the image without the lora.
Using other loras, this is not the case, it only seems to impact the ones I have trained. I see the lora listed in the prompt details so A1111 is picking up the lora is there but it just seems to have zero effect.
Not sure where else to look at this point. Any thoughts or advice is greatly appreciated.
Could be that you are training Lora on the other type of base model than the one you are using in A1111. For example, if you are training SD1.5 but using SDXL or vice versa.
Does the meta data of the images you generate acknowledge the Lora? Does the console record the Lora loading successfully or does it throw errors? does the size of the Lora seem right from your training parameters?
I am training on the same model I am using to produce the images, both 1.5 model.
Yes, the metadata shows a "Lora hashes:" section that is populated with the lora I trained and a hash. I don't see any errors in the console loading the lora. The lora is just over 2Gb.
Update: The position is now closed. Thank you everyone who applied ❤️
Leaving the below information for reference. This position is now filled, but we will notify when we have new positions. Moreover, we are always open for collaboration, partnerships, or just meet other tech enthusiasts in this field.
Voplica is a startup specializing in AI model training, image generation, and image enhancement. We are seeking a Python Engineer with strong expertise in Stable Diffusion (SD) models for a full-time position.
Job Responsibilities:
- Develop automated model training processes that require minimal human oversight.
- Develop pipelines for dataset preparation, including cropping, masking, and image analysis using various models.
- Create inference processes utilizing fine-tuned Stable Diffusion models.
- Build pipelines for image restoration and enhancement, such as upscaling and fixing details in hands, feet, and faces.
- Contribute to the kohya_ss/sd_scripts project by automating parameter tuning and improving model training processes. This includes enhancing training on non-diverse datasets, such as masked training for photo studio datasets with consistent backgrounds.
- Quickly learn and integrate new AI technologies into our projects.
Qualifications:
- Strong expertise in Python programming.
- Extensive experience with Stable Diffusion models.
- Knowledge of data structures and algorithms (Big O notation, space and time optimization, hash maps, trees, heaps).
- Preferred: Experience with SDXL hyperparameters tuning, learning rate loss analysis using TensorBoard.
- Preferred: Knowledge of microservices architecture, developing Python workers for job queues (RabbitMQ, Kafka), experience with S3 storage or other object storages (OpenStack Swift, Ceph, etc.), gRPC.
- Preferred: Experience with context switching optimizations for LLMs.
If you are passionate about advancing AI technology and improving image processing capabilities, we would love to hear from you.
Best regards,
Alex
Hey friends,
Some of you have sent me DMs here, but your privacy settings don't allow me to reply or send you a friend request. Please ensure your privacy settings are adjusted, or send me an email instead.
Best regards,
Alex
probably an issue when the lora is saved
either you messed up a setting, in how it is saved in kohya/onetrainer/diffusers
or your venv/torch got messed up.
I'd recommend for you to check the settings for how your lora is saved, and if you dont see any obvious issues, then wipe your current venv (virtual python environment), then let it install fresh, then try again.
nop 😦
but based on the large scale finetuners (not companies) I've met, they all fall into 3 camps:
Amazon S3 Buckets. Example: https://huggingface.co/datasets/ptx0/photo-concept-bucket which is then precomputed and hosted on S3 and loaded via a json file
JSON only Datasets: https://huggingface.co/datasets/CaptionEmporium/furry-e621-sfw-7m-hq which include multiple captions for each image, enabling multi-caption style training (very effective, but also costly to train)
For complete local storage, but datasets of above 100k images, you either do the lots of folders + json solution, or go the hydrus network route. Both are equally painful.
If you were talking about the balancing act of said dataset, then there's just no agreed upon solution.
and most companies are just winging it (terribly) - which is why we have these extreme biases to begin with.
another victim of SECourses/Furkan? 🥲
You've prob already found someone to help you.
But just wanted to point out key things:
• Training photorealistic is very very different from general training. You actually wanna avoid training things like details, but instead the concept of how a person looks.
Example I trained using 9 images, to make a point of this: https://civitai.com/models/349773/oc-aline
• a "full finetune" opens up more possibilities for training. but that's not always a good thing. Unless you at least roughly know what you're doing, a full finetune will often do more harm then good. A simple lora training in 10~30minutes will often do the trick just fine.
• Experiences cannot be transferred 1:1 for training models. Training a lora of a white woman, will usually have a similar pipeline. But training one of an indian or chinese woman or man will have significant differences. (Hence why fully automated solutions don't work, unless you have a genuinely balanced base model)
to recognize her using the unique trigger word "mirandakerr"
trigger words, especially using real names, are a very complex topic which I've literally witnessed kill an AI startup.
tokens already have meanings assigned to them. Meaning there is no "one size fits all" fully automated solution.
For the sake of simplicity, you need to train both the unet and the clip models correctly without overfitting either, if you want custom trigger words to work.
(It's not that custom trigger words dont work - THEY DO! just that they sometimes work better, and sometimes worse, and that's something you need to be aware of. Sometimes a lora fails completely, simply due to the chosen triggerword, so you have to pick a different one)
Currently, I'm concerned about the poor quality (fried skin)
Your training is picking up the fine details before picking up the general concept. Min SNR 5 is your friend here. Also, doing only a net rank 32 lora will help as well, since you affect less parameters. (meaning less details can be messed up)
Kohya and Onetrainer give fairly similar results if you've got a grasp on the parameters. Masked training, while useful, does also have negative side-effects. so relying on it completely isn't a good idea either.
The final goal is to create a service for photo-realistic model training
If you wanna match the currently existing services (phone apps and a few online sites), that's not too hard. They rely mostly on overfitting a fair bit, and thus don't bother with the typical issues that might pop up. This can be automated fairly simply.
If you wanna beat them in quality, then that's genuinely hard, due to the issues I mentioned earlier, which will all occur if you don't cheat while training.
The Pareto principle really applies here. You can get 80% of the way, with 20% of the effort. But the closer you wanna get to 100%, the amount of time and effort you invest will rise exponentially.
Wow. So much useful information you pointed here. Thank you so much!
I haven’t checked SNR yet, but will definitely take a look at it.
Regarding masked training. Many datasets I’m working with may be a photo studio shots with constant background (white background) and I’m afraid the model will learn white background during the training (which I don’t what to do).
I noticed you trained DoRa models as well. Have you noticed differences in quality between LoRa, DoRa and full fine tunes by any chance?
and I’m afraid the model will learn white background during the training
• With a very simple photoshop script, you can automatically replace the background <- good enough to add random backgrounds to images, so that none of it gets picked up by training
• "simple background, white background" can be added to the captions of the whole dataset, then used as a negative during inference. that also gets rid of the background. <- change white to whatever background color you actually have
• masks can be used, but doing so means that model loses the context of "where" to generate your new data. <- also works, but isnt less work than the other 2 options. best to try all 3 for your unique scenario
Have you noticed differences in quality between LoRa, DoRa and full fine tunes by any chance
Yeah. A lot.
DoRA is a massive improvement to everything, on the same scale as full finetuning, then extracting a lora. But it does come with the same possible downfalls of full finetuning, where more can get messed up. But once you have experience with captioning & a basic preset that works, DoRA is basically as easy to use as a LoRA, just better in a lot of ways.
Hi everyone,can you advise for 370 images, how many minimum epochs and steps are needed for average quality?
sdxl base, sdxl pony, sd1.5?
sdxl pony
using kohya?
no,one trainer
then you can just copy these values to onetrainer
names are pretty much identical in onetrainer, so should be fairly easy
just remember to turn on min snr 5 in onetrainer, since that ones hidden away in a dropdown
epoch 100 will be your target
I usually let it run to 150, in case I prefer a slight overfitting
https://civitai.com/models/597892/juno-overwatch-for-pony-properly-trained
and
https://civitai.com/models/596487/sonoshee-mclaren-for-pony-redline
were both trained with those settings, on pony sdxl. so you can look at those for what quality to expect from that preset
Core Tags ( with suit ): juno overwatch, purple hair, short hair, gloves, bodysuit, covered navel, breasts, medium breasts, blue gloves, multicolor...
Thank you
you're welcome. and remember to adjust learning rate in case you change the batch size
your learning rate / batch size = 0.0001
thats why the preset has 0.0008 with a batch size of 8
works if you have a 3090 or 4090.
if you have a smaller gpu, then just adjust it using that simple formular
Lol I actually didn't. thank you!
Hi could you please advise me on how to format/partition? I got a new 4tb SSD I want to install Linux and automatuc1111 on it and use it to store all my big files (models, loras, etc), AND use it sometimes for webuis in windows as well (SD.next, forge). So I want to reuse the same directory for loras and models across windows and Linux dual boot, issue is, sharing requires ntfs and I understand having the models on an ntfs partition will slow down the Linux perf of automatuc1111? Is that correct?
Little chubby guy eats fruits
for linux distro, use:
pop!os 22.04 LTS Nvidia
^ that solves your drivers + cuda + torch issues
On that, set up a folder for "AI"
under models, is where I have all checkpoints, loras and stuff saved. its also where I automatically store things I train myself
And every application has some way of being able to point at a different directory for models.
For A1111, you edit the webui-user.sh
export COMMANDLINE_ARGS="--data-dir '/mnt/md/0/AI/A1111/stable-diffusion-webui' --ckpt-dir '/mnt/md/0/AI/MODELS/checkpoints' --lora-dir '/mnt/md/0/AI/MODELS/lora' --embeddings-dir '/mnt/md/0/AI/MODELS/embeddings' --listen --port 16999"
comfy, swarmUI, and all the other generators can also have different locations set up - you just point them at that models folder
as for windows and linux sharing a drive - there's info on that to be found online, and is not specific to stable diffusion. You can basically just "mount" the shared partition in linux. pop!os makes that fairly simple - so read up on their site or ask on their discord
Hi, I have a question about finetuning stable diffusion for inpainting.
I came across this paper which is similar to the work we are doing: https://arxiv.org/abs/2312.03606. They finetuned the Stable Diffusion model to create synthetic satellite images. After reading this paper I was wondering it would be possible to finetune the Stable Diffusion Inpainting model as well. For instance, we could mask out an area in an image and then our finetuned model could fill in that area with a road, river, or something of that nature.
I currently have input images and masks of land cover features on hand. I do not have any prompts but I am under the assumption I could possibly create some myself if needed. What would be the best way to go about finetuning the stable diffusion for inpainting model on my dataset (~10000 images)?
For reference, I have trained a custom UNet2DConditionModel for inpainting from scratch with simple labels such as grass, trees, water, etc... using my dataset that I posted about here: https://discuss.huggingface.co/t/custom-pipeline-inference-speed-extremely-slow/89642 and was able to get some decent results (I posted this because the inference speed was slow but that has now been fixed). I pulled the off the shelf stable diffusion inpainting model and attempted to inpaint certain features into the images and it did pretty well but could definitely use some work. With that being said, I was hoping that finetuning the stable diffusion inpainting model could outperform my current current model
Hi, I have been using Diffusers recently and was able to create a custom inpainting model. The model is based off the Palette repo: GitHub - Janspiry/Palette-Image-to-Image-Diffusion-Models: Unofficial implementation of Palette: Image-to-Image Diffusion Models by Pytorch. My model is able to take an image, mask, and label as input and inpaint th...
Hi all, looking to make my first dataset. Are there any web-based or mobile-friendly tagging tools? had a look at taggui but it's desktop only from what i can tell
any 4090 guys here trained flux yet?
still trying. It works but results are not really good yet
Has anyone tried applying ReFT to any diffusion model yet? Seems to beat DoRA and with much fewer parameters for LLM benchmarks https://arxiv.org/abs/2404.03592
Parameter-efficient finetuning (PEFT) methods seek to adapt large neural models via updates to a small number of weights. However, much prior interpretability work has shown that representations encode rich semantic information, suggesting that editing representations might be a more powerful alternative. We pursue this hypothesis by developing ...
which method is the best for train a artistic style with a lot of images? (Lora,Dreambooth,Custom Diffusion, etc)
hey not sure if this is the right place to ask such a question, but i am currently trying to finetune sd1.5 with python and i dont get any errors but my outputs test images are exactly the same (pixelcomparisson), also i compared the .safetensors from the unet cmd: "fc /b file1 file2" and no changes were detected.
my idea for the dummy data was to just use the same image several times with the same description as a proof of concept. I then use exactly the same prompt to generate test images, hoping to see some resembelence to my dummy input image
any help would be highly appreciated ❤️
also tipps where to ask would help 😄
why 1.5? Old model. Anyway maybe use kohya vs trying to write it yourself
Hello!
Does anyone trying to train Juggernaut XL that version https://civitai.com/models/133005?modelVersionId=456194
I did, but the results was not good enought. Dataset was captioned by the same tool as jugger, near the 300 images.
Can you give me some advice how to improve it?
Some examples of captions:
A cocktail, amber fluid color, old-fashioned glass, ice cubes 75% ice, lemon slice garnish in glass, reflective surface, vibrant surroundings, illuminated counter, glistening drink
A cocktail, vibrant magenta fluid color, coupe glass, lime wheel garnish on the rim, rose petals around glass, blurred background with bokeh lights, indoor setting
For business inquires, commercial licensing, custom models, and consultation contact me under juggernaut@rundiffusion.com Join Juggernaut now on X/...
Its a very good model, and always had a lot of respect for it. But of recent they seem to be slipping from the number one spot with the training .
Yeah its good model, but looks like not good for lora training
it used to be, I trained a lot of my first SDXL loras with jugg and copax
back before I was training checkpoints
Cool, what I did wrong with jugg? Any ideas why so bad results in my loras?
if you want a model to train a lora really well use this SDXL lightning https://tensor.art/models/751519943066912725/LIGHTNING-Dream-Diffusion-By-DICE-v1
I'd prefer to use jugg, its know more concepts related to cocktails. Also its trained on gpt4v captions.
I want to train lora on jugg for now.
ok but lightning models are the best to train with as you need a lot of images and lightning runs CFG 1 and steps 2
I feel fine, to rent h100 for 16hours.
Try to render a cocktail, sdxl know a less concepts for coctails.
what a cocktail drink?
A close-up of a champagne flute on a marble surface, filled with sparkling champagne. A lemon twist spirals around the inner rim of the glass. The background is dark, emphasizing the clarity and bubbles in the drink, creating an elegant and sophisticated atmosphere.
Something like that
Im targeting to something like that, but without bottle
i have just written this and its rendering now on the beach in the sand is a cocktail glass with a colourful drink inside, with ice cubes and a light condensation on the glass, in the back ground is a stunning sunset on the ocean horizon,
The main problem another. I need an image to recipe. So i need full control on, glass type, fluid color, soft drink or not, with/without ice, garnish etc
tell me what glass and colur drink
glass: champagne flute
fluid color: transparent golden
extras:
without ice
with lemon twist garnish on the glass rim
if your training you need every type of colour not nesseryly the brand of drink as the check point will know that you just need primery colurs and differnt glasses
ice, garnish is all handled by the checkpoint
Some of my captions
A cocktail, amber fluid color, old-fashioned glass, ice cubes 75% ice, lemon slice garnish in glass, reflective surface, vibrant surroundings, illuminated counter, glistening drink
A cocktail, pink fluid color, lowball glass, crushed ice 90%, cucumber slices in glass, grapefruit slice on rim, mint sprig on rim, tray surface, daytime
A cocktail, white fluid color with foam on top, coupe glass, cinnamon stick garnish on rim, pink and yellow background
334 image total
the lora is a very basic guide for more custom cheraters for the newer models as the new models are so well trained they dont really need a lora
ill put those prompts into flux and you will see no need for a lora
ok used this one A cocktail, pink fluid color, lowball glass, crushed ice 90%, cucumber slices in glass, grapefruit slice on rim, mint sprig on rim, tray surface, daytime
spot on ,,,, no need for a lora or neg prompt
could make it more real with 1 word added to your prompt
thats not what your prompt ask for tho lol
Flux 😄
it's another for example, another prompt, another generation should be. Not about pink
oh
The source image for that prompt was:
At least that fluid color is not pinkish enough
Flux works better, it's true.
see cucumber slice, in the generation is citrus+cucumber slice.
rendering for your red drink
Flux works better, it's true.
my flux hyper on the left your google image to the right
A cocktail, red fluid color, tumbler glass, medium ice cubes 90%, lime slices in glass, rosemary, pommegranite, white marble table, daytime
What is did with the Pomegranate xD
Flux better, but you see )
well trained it even knew what i ment from that shambles
running it correctly now
wah lah better
lol
the next problem is the layers (real photo below)
A cocktail, amber and green gradient fluid color, highball glass, crushed ice, 80% ice, mint sprig garnish on top, tequila bottle in background, ice on plate, bar setting
ok here goes
A cocktail, amber and green gradient fluid color, highball glass, crushed ice, mint sprig garnish on top, tequila bottle in background, ice on plate, bar setting, UHD, microscopic photography, magnified, molecular, unseen worlds revealed, scientific exploration, capturing molecular details, professional imaging techniques, precise focusing, revealing hidden beauty, scientific discovery, artistic interpretation, nano scale, revealing the wonders of the unseen
img2img looks like a cheating
ip adapter
not to much difference. I can't do it in my case. Only txt2img.
Mia Khalifa serving food in a restaurant wearing a protective revealing leather outfit
Does someone know If I can caption my images with llama 3.1 to create some loras?, (in comfy ui) (I will just use 20-30 images)
Commercial photography, powerful yellow powder explosion, fried chicken, black background, bright environment, white lighting, studio lighting, OC rendering, super detail, solid color isolation platform, professional photography, color gradinging About Midjourney Parameters --ar 9:16 --v 5.2 --s 750 --c 0 --q 1
hi guys, i learned recently a bit about finetuning and i have some questions about how dynamic it is.
im building an app where one could input their biz/startup/game idea and go through steps where they will be generating context about it (objective, target audience, biz model, etc)
every time you generate something, it is used as context for next time you interact with an AI
rn im also working on a step to generate a logo for that brand.
what im doing currently is im dumping the brand context into GPT and asking it to create a prompt for SDXL, which i then insert with some other prompt keywords to make sure it looks like a proper logo
the issue is GPT's prompts are kind of trash.
i was wondering how well fine tuning would work if i:
- Gotten a bunch of pairs of
brand context dump->ideal logo image - Trained a SDXL model on it
i.e would it work well if the "prompt" is a huge context about a brand, and not rather some small trigger keyword?
or is that not how it works?
I fine-tuned a model with images of an actor from an old movie. All of them have a characteristic look due to the camera quality. Now the model generates images in that particular style, not with the face that I wanted to achieve. What did I do wrong? Should I improve captioning or prompting? I don't want to replicate the style, but face.
Hey everyone
i want to train a LLM SDXL fine tune model with about 100k images i have trained 84k images model till now but haven't gotten any better results till now, can anyone tell me how to start it?
what's an LLM SDXL model? like, you use captions from an LLM?
don't use an LLM with vision capabilities. they can be good but they're over kill for flux training. This is out. Florence 2 is a Vision Language Model that's more light weight and specialized. Look into that.
why when i installed control net in stable diffusion i didnt see a tab with controll net
yo i got a question about dataset preparation.
i am currently distilling a lora by making a big amount of wildcard prompt based images and simply discarding low quality and bad character features.
the idea i'm having to make the lora less biased towards one style is to try and generate multiple different styles using the already existing lora.
so i'll make an equal amount of different style images of the character using the old lora and sdxl, take the best images and then train a flux lora on those images
how many images does it usually require for a flux lora and how can i make it learn the characters features abstractly and not directly associate it to one particular style
You mention style and you mention character, normally a character training would use subject captioning, where you describe as much in the image as possible except the character, because you want the ai to learn the character. Style training is different, you want to describe as little as possible, because you want it attempt to learn the style which is a more abstract concept.
yeah i dont describe the features of the character, just what the character is doing and emotions and environment
i used joycaption to caption the dataset and then removed things that describe the recurring features of the character
nevertheless, I feel like your goals are diametrically opposed. style training and character training are different methods. Eithre try to train with no captions and just see what happens, or do them separately, or with a multiple concept lora with one trying to capture the style and one trying to capture the character. different triggers for both. or you could do 2 separate loras worst case. remember that flux is very new to all of us still, so any advise is to be taken with a grain of salt
huh i just want to capture the character and nothing else, no style
i want a token to be associated to the character
i dont want to have to describe my character in detail just to get him
as it is with the pure joy caption loras where character features havent been purged
i did a new dataset and new captions and this time i'm just referring to the character as if the model already knew it
using the token i chose
the description that has the most content is the background and style of the image
I see, well character training is something flux should handle easily, even with few images
because i want a lora that can portray the character in different styles
not just in the original one it was made in
there's lots of strategies here. If your dataset is consistent enough, you can get away with just using one trigger word to train. If the dataset is varied, you'll want to use captioning. This guy has been writing tons of training diaries. this and his other writings about flux are insightful
https://civitai.com/articles/6868/flux-character-caption-differences-training-diary
if anyone wants to try it out, here's a new optimizer. Modified adamw iirc.
https://github.com/lodestone-rock/compass_optimizer
I'm not technical enough to explain it properly, but the author described it as rmsprop with a low pass filter ontop.
Hi:) I just finished a dataset of around 40gb of architectural images. Tagged by scraping the original captions, extracting keywords + florence2 descriptions.
The goal is ideally to make a full fine tune of Flux, since with so many images a lora might not make sense. Any tips, guides?
Around 75k images
you would typically be right with that approach. 75k images dataset probably not suitable for dreambooth/lora. The problem is flux is so new I dont know anyone that has done a full finetune on it yet. basically, you're treading into new territory here
hey hey quick question if anyone knows! kinda new to this. I am using kohya to train loras, but want to queue up the trainings to test different parameters (when one training ends, the other starts up without me having to manually start it). All i have found so far is to print the training command for each configuration, and then paste them into a script or bat? is first question. 🙂
nvm i just figured it out! in case anyone else asks, after hitting print training command, your terminal will show a line that starts with "kohya_ss\venv\Scripts\accelerate.EXE launch .... and ends with your .toml file (any arguments you have will show after the toml, for example, --network_train_unet_only)
hi all, I am a complete beginner to finetuning SD models. I have tried to set up the Automatic1111 web ui on me mac but failed badly as I was getting some error related to device type being mps which I was unable to fix.
do you have any advice of where to start if I am a complete beginner with not much knowledge in software development and need an easy way to finetune the sd model? The use case is teaching it to generate images of a specific product. Thank you in advance. And sorry if this is written anywhere, I was unabel to find it
- dont do it on mac if you can avoid it. what matters most is gpu, and unfortunately apple does not shine here. 2. automatic1111 wont help with finetuning, it's only for doing image generation locally. 3. check out this link for a good training app. maybe you should look into doing it on a cloud service such as vast.ai, if you have limited compute choices locally. https://github.com/bmaltais/kohya_ss
thank you! Have you ever tried running the colab notebook? I am getting this error and no idea what to do with it:
ImportError: cannot import name 'split_torch_state_dict_into_shards' from 'huggingface_hub' (/usr/local/lib/python3.10/dist-packages/huggingface_hub/init.py)
I stopped using colab when they changed their EULA to forbid it on their free tier. you can pay for colab pro, but honestly vast is better imo
first time i manage to make my loss graph look this beautiful lmao
.08? wow...hopefully the actual results match that
umm, well depending on what you're going for... 😄
I think best epoch was this one:
I'm trying to revive a dead artist style, but sdxl can't keep up with original, I've been trying for weeks
this is source, looks at those patterns.. wish sdxl could do it lmao
that's a lot of detail
yah.. might have better luck with flux, but I don't got the hardware/patience for it
hard to say, I'm still working on flux loras, and kohya only jsut recently added the ability to train the text encoder.
any luck with text encoder training? I'm using to training unet only, not sure if text enc benefits this style usecases
well it converges quicker I can say that, but my current training is still in progress, so I'm not satisfied with it yet we'll say
I see, pretty nice to share to insights into this blackbox that is training lmao
I'm doing one with multiple concepts just to see if it works...when I get strange defects I cant tell if it's going off course or if I just need more steps, so onward
yah, too many params, overfitting can take many forms. I tried the 2 layer flux training on civitai and it created carbon copies of source material, tried again burning a heap of buzz, 32dim,16 alpha, and got decent results
can't beat good ol adamw with low learning rate
I kinda like it, but not good enough for me..
Maybe lowering learning rate could help with the patterns and details?
already running Loha at 32rank so pretty high
Yah.. gave a shot on flux..
I'll say that's definitely an interesting and complex style to try to mimic, it's not bad even if it misses the mark
yah.. I got some interesting results after training.. much closer to source
just released it on civitai, "ayahuasca dreams"
but ofc there no soul.... original artist had intention over each detail
Hey there everyone!
I want to develop an application which converts hand drawn sketches to images of clothes/garment etc
Can i achieve this by finetuning StableDiffusion? If yes, how so?
I would appreciate resources on this
Moreover, my model will have image+text input
Is there any other better approach than stable diffusion?
Open to all suggestions, thankyou!
SD can already do this using img2img. Take a look into controlnets, "canny" and "lineart"
by using prompt + img2img + lower strenght controlnets you can achieve this
Finetuning or LoRAs are used incase you want to teach new concepts for the models, but there are a bunch of models with photorealistic or fantasy already, so unless you need something very specific you wouldn't use it.
Something of note is that there are "inpaint" models, which works better on some img2img scenarios. from personal experience "pony" models work best too.
this is a pretty bad example using canny only, not even img2img with a paint sketch
