Excellent Tips for LoRA Training (Patreon)
Content
Patreon exclusive posts index to find our scripts easily
Join discord to get help, chat, discuss and also tell me your discord username to get your special rank : SECourses Discord
Please also Star, Watch and Fork our Stable Diffusion & Generative AI GitHub repository and join our Reddit subreddit
These tips are from GitHub user madman404. I have slightly modified and improved the tips.
So if you want to do LoRA training here excellent tips for you. Still I suggest full model training and extracting LoRA but i plan to do an intensive hyper parameter search for LoRA training as well for 8GB GPU owners as I did for Fine-Tuning / DreamBooth
You should read our this article (public) : Full Workflow For Newbie Stable Diffusion Trainers For SD 1.5 Models & SDXL Models Training With DreamBooth & LoRA
If you don't use regularization images and use proper captions, Kohya DreamBooth becomes fine-tuning.
With OneTrainer when we add class images to the training it mimics DreamBooth effect.
For SD 1.5 Models best Kohya DreamBooth Workflow : https://www.patreon.com/posts/very-best-kohya-97379147
For SDXL Models best Kohya DreamBooth Workflow : https://www.patreon.com/posts/very-best-for-of-89213064
For SD 1.5 Models best OneTrainer Fine-Tuning Workflow : https://www.patreon.com/posts/very-best-config-97381002
For SDXL Models best OneTrainer Fine-Tuning Workflow : https://www.patreon.com/posts/96028218
Tips And Golden Information For LoRA Training
Network Rank - Larger networks, all else equal, need a lower learning rate to be stable. This relationship seems to hold at scale, i.e LoRA usually need learning rates ~10x higher than the original model.
Network Alpha - Is literally just a scalar on the effective learning rate, but consequently any suggested learning rate from anyone else is completely meaningless unless they also provide this parameter and the rank. Your chosen learning rate is effectively multiplied by (alpha/rank) to get your "real" learning rate.
Optimizer - Valid learning rates are not compatible across optimizers. Different optimizers will require different learning rates. So Adafactor and AdamW will not be same.
Batch Size - Changing the batch size (or gradient accumulation steps, which acts effectively as a multiplier both on the batch size and time per optimizer update step in equal measure) decreases the overall gradient noise by getting a more representative sample of the dataset, and those less "noisy," more useful gradients allow you to use marginally higher learning rates. Adam mostly diminishes the effect of this, though. So higher Batch size will reduce your Learning rate impact. What this means is, if you get overtraining with batch size 1, you may not get overtraining with batch size 8 with same learning rate.
Precision - If you change the LoRA weight dtype from FP32, you will probably have to adjust the learning rate. BF16 has low precision and high range, and compared to FP16 or even FP32 will need a higher learning rate to get the update steps to actually do anything. So what this means is, learning rate will change according to the used precision. FP32, FP16 and BF16 will require different learning rates. Therefore, either follow my workflows exactly or you need to do more training experimentation.
As for identifying a learning rate:
There is no easy way to do this. All you can do is run trainings at various learning rates until it works. I like to sweep 1e-7, 1e-6, 1e-5, 1e-4, and 1e-3 first to see which is stable, and then go halfway between the two most stable results and repeat until I am satisfied. Some things to look out for when sweeping learning rates:
A learning rate that is too low will make little to no progress.
A learning rate that is too high will diverge, making oversaturated, ugly, or generally non-representative samples that do not appear to even be moving generally in the direction of your dataset.
There is a limit to what learning rates will allow your model to converge in a stable manner, and once you've identified it (ideally identified the learning rate that performs the best on a short test run), you'll have to run that learning rate and instead increase the length of the training until it converges at a result you think fits well enough. I usually aim 150 epochs for training users and this really generalize for everything.
Other notes you may find helpful:
If you aren't already, use min snr gamma. It's pretty much just free lunch, and using a value of 5 (default) or 1 (recommendation by birch-san for latent models like Stable Diffusion, stable in my own testing) will allow your training to converge faster.
Weight Decay (0.01) usually provides better results when doing DreamBooth / Fine-Tuning but it may be depended on used Optimizer. Give it a try too.