Home Artists Posts Import Register

Downloads

Content

Patreon exclusive posts index

Join discord and tell me your discord username to get a special rank : SECourses Discord

How to use NGROK to connect Gradio apps on free Kaggle notebooks : https://youtu.be/iBT6rhH0Fjs

12 March 2024 Update

22 February 2024 Update

  • I have found the bug in Kohya SS scripts and fixed it
  • SDXL 1.0 Base DreamBooth working again on Dual T4 GPU - super fast since batch size being 2
  • But I suggest SD 1.5 newest workflow model training on Kaggle since we can train FP32 with SD 1.5 meanwhile we can train only FP16 with SDXL (BF16 yields better results)
  • Both SDXL and SD 1.5 Kaggle very best configs will be auto downloaded into kaggle working directory
  • Read more info here about best configs and why SD 1.5 better on Kaggle : https://www.patreon.com/posts/full-workflow-sd-98620163

30 January 2024 Update


Tutorial link for this notebook file : https://www.youtube.com/watch?v=16-b1AjvyBE

Register a free Kaggle account : https://www.kaggle.com/

Verify your Phone number : https://www.kaggle.com/settings

Start a new notebook by clicking + Create button

Upload below attached notebook (import notebook)

Click Here to open GitHub readme file of this tutorial

Comments

Rasika Singal

please make changes to the code: !wget https://www.pokemonpets.com/woman_3786_imgs_1024x1024px.zip,man code is repetitive

Anonymous

I have a question regarding the usage of kaggle notebooks. Is it possible to run the notebook on kaggle, turn off the own computer and then come back to kaggle after the estimated training time for downloading the created loras? When I tried this, all data was deleted and the training has stopped.

San Milano

If I install everything specified in the tutorial, will Kaggle store it for next time or do I have to install everything again?

Furkan Gözükara

You need to reinstall. It is fast. I will also hopefully update notebook for even more performance if I can make. Automatic1111 notebook also will get sdxl controlnet support. Working on it

Anonymous

I tried add the new parameters, but still getting the error: Your notebook tried to allocate more memory than is available. It has restarted.

Furkan Gözükara

please show your full command here. a lot of people already did successful training. something must be different . you can also message in discord much better

Anonymous

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="/kaggle/working/results/img" --reg_data_dir="/kaggle/working/results/reg" --resolution="1024,1024" --output_dir="/kaggle/working/results/model" --logging_dir="/kaggle/working/results/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=32 --output_name="kaggle_mjero_1" --lr_scheduler_num_cycles="8" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="4800" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0 --lowram

Anonymous

It seems to loss nan when I train my lora on the special edition SDXL model. How can I solve this problem. The SDXL basemodel I choose is https://civitai.com/models/139562/realvisxl-v10

Anonymous

import network module: networks.lora create LoRA network. base dim (rank): 32, alpha: 1.0 neuron dropout: p=None, rank dropout: p=None, module dropout: p=None create LoRA for Text Encoder: 264 modules. create LoRA for U-Net: 722 modules. enable LoRA for text encoder enable LoRA for U-Net use Adafactor optimizer | {'relative_step': True} relative_step is true / relative_stepがtrueです learning rate is used as initial_lr / 指定したlearning rateはinitial_lrとして使用されます unet_lr and text_encoder_lr are ignored / unet_lrとtext_encoder_lrは無視されます use adafactor_scheduler / スケジューラにadafactor_schedulerを使用します override steps. steps for 10 epochs is / 指定エポックまでのステップ数: 1840 running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 275 num reg images / 正則化画像の数: 4318 num batches per epoch / 1epochのバッチ数: 184 num epochs / epoch数: 10 batch size per device / バッチサイズ: 3 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 1840 steps: 1%|█ | 1/1840 2.97s/it, loss=nan

Anonymous

How to train Lora for other large models? I don't want to use sdxl0.9vae, for example, I want to use this to train Lora, https://civitai.com/models/139562?modelVersionId=154590 What should we do?

Furkan Gözükara

It is working. Here the change you need to make. I also shared on Discord easier to see there : https://i.ibb.co/yQPmRwP/image.png

Paula Kühn

hey there! first of all: thanks for your great work! I have a question: I want to train a little clay figurine, and i am following your lora kaggle tutorial, i trained it succesfully with the previous sd version, without regularization images, can i skip it here as well?

Furkan Gözükara

yes you can. moreover i have completed my SDXL DreamBooth workflow. About to publish on Patreon. check it out once published. much better than LoRA

Paula Kühn

Why do you think it is much better than Lora? You mean specifically for non human characters?I’ve got so much to learn. :)

Anonymous

hi~thanks for you notebook~ I run it success yesterday, but when i run webui command in kaggle today,it killing the server,what i do it yesterday is ok,does anyone know what happened? "Keyboard interruption in main thread... closing server."

Furkan Gözükara

ye Google don't allows web uis anymore. now you need to prepare dataset manually and execute training command. you can also prepare them in your computer and upload and use there with correct path > https://twitter.com/GozukaraFurkan/status/1702476057880756733

Anonymous

First of all, thanks for the hard work you put into this. It would be nice if you could make an update on the post about this. I was trying to make it work for several hours yesterday.

Furkan Gözükara

added to the very top. maybe i can make a short tutorial for how to prepare dataset and command in your pc and run on kaggle.

Anonymous

My notebook was supposed to run for 16h, but they stop it after 12h on the free version. Is there a way to feed our existing safetensors files and continue the process from there?

Anonymous

failed to launch

Furkan Gözükara

how did you do it can you show as screenshot? this is tested today and working. after ngrok you need to start kohya

Quentin Guittard

Hi, I got memory error message when running my training command. "Your notebook tried to allocate more memory than is available. It has restarted." My training dataset is 10 1024*1024 images, and I followed the tutorial to free the RAM but it reaches the 13Go Ram limit. How can I continue to optimize my training, review the parameter or reduce the number of images or both? Without reducing the quality result to much.

Quentin Guittard

!accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="/kaggle/working/results/img" --reg_data_dir="/kaggle/working/results/reg" --resolution="1024,1024" --output_dir="/kaggle/working/results/model" --logging_dir="/kaggle/working/results/log" --network_alpha="1" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0004 --unet_lr=0.0004 --network_dim=32 --output_name="kaggle_test_1" --lr_scheduler_num_cycles="8" --no_half_vae --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="4000" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --xformers --bucket_no_upscale --noise_offset=0.0 --lowram

Quentin Guittard

Hmm I see where is the problem. I need to apply the updates of the 4th of September. Thanks! I will let you if it works :)

Mikael Svenson

Getting 4.32s/it when using dimension=64. GPU seems to be at 12.4/14GB per GPU. Let's see what the results are once complete - if my training set was good :)

Mikael Svenson

"#$"#$ I managed to turn off the book before downloading all the checkpoints and did not have file persistence.. got two of them at least. Oh well. I'll restart another one later.

Furkan Gözükara

sorry i mistakenly deleted your comment :/ as you said higher DIM will go out of RAM. so above 64 causes RAM error. thanks for letting us know

Anonymous

Sorry, new tier for you, what do you mean in this entry when you say using "ngrok"?. Is the kaggle option still working?

Furkan Gözükara

ngrok added to the notebook. we now use ngrok tunneling to reach locally runing automatic1111 since public gradio share is banned by google colab and kaggle.

Anonymous

can we use custom models already or the ram issue in koyha is still happening ?

Anonymous

got it , thanks ! I am using a 3080 (10gb vram ) and with these settings, It will take ~3 days to complete. Not sure if this is normal or not with my card but feels excessive.

Anonymous

Is it possible to use this lora with comfyui? When running this in comfyui (using the lora created via kaggle) it just resuts in a black image

Furkan Gözükara

yes you can use. you must have a config error. i used them on auto1111. you can try with auto1111 and see if error is from comfyui or from generated lora file

Anonymous

yeah can´t load the auto1111 sdxl model due to RTX 4060 laptop card. I will try running training again !

Juan Chen

a question, i see you use many regularization images. and this is optional. But if I want to train a cartoon animal, where to find regularization images for this kind? please let me know. Thanks.

Anonymous

8, anyways I just got it to work and generate images in comfyUI, lovely!

Anonymous

New code is not working, it just makes a private url (not a public one) and when I try to open the local one it says " cant connect to the server" Even I tried changing !bash gui.sh --headless to !bash gui.sh --share --headless

Furkan Gözükara

if you use share it won't work. kaggle banned it. here watch this to learn how to use ngrok : https://huggingface.co/MonsterMMORPG/SECourses/resolve/main/2023-09-23%2014-35-29.mkv

Anonymous

Hi, thanks for the support, i try to follow this guide: https://www.youtube.com/watch?v=JF2P7BIUpIU&ab_channel=SECourses but i don't understand when started the gui because when i try to navigate http://127.0.0.1:7860/ i received this error: dial tcp 127.0.0.1:7860: connect: connection refused Can you help me please?

Anonymous

Hi, thanks for doing this tutorial! I am unfortunately unable to get this to run. 1. I am unable to get past this point without pressing the "play" button again (It does ^c which gets past the message): N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details. Do you want to accept these changes and continue updating from this repository? [y/N] 2. The Kohya GUI looks different than yours and is missing a few Parameters (v.22.0.1): a. Lora Type dropdown b. Text Encoder Rate c. Unet learning rate d. Network Rank 3. It fails when caching latents: RuntimeError: NaN detected in latents: /kaggle/working/results/img/25_ohwx woman/00100lrPORTRAIT_00100_BURST20200125175208649_COVER.jpg Here's my output: accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train.py" --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" --train_data_dir="/kaggle/working/results/img" --reg_data_dir="/kaggle/working/results/reg" --resolution="1024,1024" --output_dir="/kaggle/working/results/model" --logging_dir="/kaggle/working/results/log" --save_model_as=safetensors --output_name="test_lora_1" --lr_scheduler_num_cycles="8" --max_data_loader_n_workers="0" --learning_rate="0.0004" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="4400" --save_every_n_epochs="1" --mixed_precision="fp16" --save_precision="fp16" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_data_loader_n_workers="0" --bucket_reso_steps=64 --gradient_checkpointing --full_fp16 --xformers --bucket_no_upscale --noise_offset=0.0 --lowram

Furkan Gözükara

you have 2 issues 1: your command is wrong. you are using DreamBooth tab not LoRA 2: your images are corrupted. can you do this? install paint . net . it is a free open source tool. open everyone of your images. save as png with paint . net. give them shorter names and try again please.

Anonymous

I watched 3 of your videos about this (including the short update video for the latest update on the 9th of Oct). I'm following you step by step but the very first command gives me the same error I don't know how to fix it. The error says: N: This must be accepted explicitly before updates for this repository can be applied. See apt-secure(8) manpage for details. Do you want to accept these changes and continue updating from this repository? [y/N] How can I say "yes" to this? It adds ^C in the end and then continues if I hit the "play" button for that cell. Please help! I already spent hours on this.

DAVID PEREZ

Hello when i try to download the v4 file gives this error in a new web browser tab: {"errors":[{"code":902,"code_name":"AttachmentNotFound","detail":"Attachment with id 16135838 was not found.","id":"abb94cdb-b94c-5571-80d5-e685c3b63769","status":"404","title":"Attachment was not found."}]} could you fix it please

Juan Chen

v6 comes out? i just came across with 03:36:11-458929 INFO nVidia toolkit detected 03:36:13-934002 INFO Torch 2.0.1+cu118 03:36:14-033892 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 03:36:14-055546 INFO Torch detected GPU: Tesla T4 VRAM 15110 Arch (7, 5) Cores 40 03:36:14-056912 INFO Verifying modules instalation status from /kaggle/working/kohya_ss/requirements_linux.txt... 03:36:14-060349 INFO Verifying modules instalation status from requirements.txt... ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /kaggle/working/kohya_ss/kohya_gui.py:13 in │ │ │ │ 12 from library.custom_logging import setup_logging │ │ ❱ 13 from library.localization_ext import add_javascript │ │ 14 │ see kaggle notebook appears not working again it worked well yesterday

Anonymous

Hi, as I'm running the training I got the following error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 13.17 GiB already allocated; 7.75 MiB free; 13.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF I'm using more training images (16) so I have 6400 training steps. But this failed very early on

Anonymous

seems like the issue was that somehow --gradient_checkpointing was dropped from my list of params. I'm trying again with it

DAVID PEREZ

Thank you for the this huge Update Furkan! Please, could you make an updated github text guide or updated short video tutorial for best setting to train in SD 1.5 models please

Anonymous

hey, can i install TensorRT on runpod in automatic 1111

Anonymous

I am having this same problem. I can't view or open the link you provided above on huggingface. I also downloaded the ngrok link and entered my ngrok token into the script. I still don't get a gradio link or ngrok link.

Furkan Gözükara

please use kohya-sdxl-lora-training-on-a-free-kaggle-notebook_v7.ipynb and watch this : https://youtu.be/_xVq23d2pgE

Anonymous

Ok. Thanks for the response. {edit} It seems to be working now. :)

DAVID PEREZ

Could you make a quick video tutorial for SDXL dreambooth training using Kaggle, or GitHub tutorial or a notebook update, please Furkan

Furkan Gözükara

we how to use tutorial here and we have kaggle notebook too : https://youtu.be/EEV8RPohsbw - https://youtu.be/_xVq23d2pgE - if you need any other help let me know. by the way you don't have to kill kaggle Kohya GUI anymore. you can run from GUI - https://www.patreon.com/posts/kohya-sdxl-lora-88397937 - hopefully i will make a new video after Kohya GUI updated into master

Juan Chen

ValueError: Pipeline expected {'scheduler', 'vae', 'tokenizer_2', 'text_encoder_2', 'tokenizer', 'unet', 'text_encoder'}, but only {'scheduler', 'vae', 'tokenizer_2', 'text_encoder_2', 'tokenizer', 'unet'} were passed. Traceback (most recent call last):

Juan Chen

possible to show me settings for checkpoint training on kaggle this weekend?

Anonymous

help! ( error training File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== ./sdxl_train.py FAILED ----------------------------------------------------- Failures: ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-11-27_19:08:29 host : c7aaca832a42 rank : 1 (local_rank: 1) exitcode : -6 (pid: 1053) error_file: traceback : Signal 6 (SIGABRT) received by PID 1053 =====================================================

Juan Chen

i tried to install version 1.16.5, more errors come out, please double check

Furkan Gözükara

well you are not giving any information that can let us to understand your problem. please join discord and message from there in the channel

Juan Chen

then i was told scip and numpy is incompatible

Anonymous

hi I am having issues running the flask/ngrok cell where it says ## first put your ngrok token to the below and then run this code ## it will give a link like this at below : https://2fc5-34-134-226-xxx.ngrok-free.app ## open it and then run web ui and once web ui started that link will start working. I have put my auth token in there and it runs without error, but when I visit site from the 'ttps://2fc5-34-134-226-xxx.ngrok-free.app' I just get a message saying 'Hello from Colab!' from the video tutorial it seems I should be getting some web based UI with dreambooth on it... what am I missing?

Furkan Gözükara

the order like this. put token. start ngrok get link. dont click visit site. then start kohya gui. once it started click visit site

Alex

Hi! I have successfully trained Dreambooth, but I didn't like the results. Now I'm trying to train LORA, they make the settings right, but when I turn on the start of training, I get this error. epoch 1/16 Traceback (most recent call last): File "/kaggle/working/kohya_ss/./sdxl_train_network.py", line 185, in trainer.train(args) File "/kaggle/working/kohya_ss/train_network.py", line 825, in train accelerator.backward(loss) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1983, in backward self.scaler.scale(loss).backward(**kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED Traceback (most recent call last): File "/kaggle/working/kohya_ss/./sdxl_train_network.py", line 185, in trainer.train(args) File "/kaggle/working/kohya_ss/train_network.py", line 825, in train accelerator.backward(loss) File "/opt/conda/lib/python3.10/site-packages/accelerate/accelerator.py", line 1983, in backward self.scaler.scale(loss).backward(**kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/opt/conda/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED steps: 0%| | 0/9600 [00:06 sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./sdxl_train_network.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-12-06_08:36:19 host : 5fdccbe9f1b0 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1136) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-12-06_08:36:19 host : 5fdccbe9f1b0 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1135) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

Furkan Gözükara

i would say start fresh and do lora. it should work. looks like for some reason it failed to use CUDA device. make sure that T4 GPUs selected

Anonymous

File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ===================================================== ./sdxl_train.py FAILED ----------------------------------------------------- Failures: ----------------------------------------------------- Root Cause (first observed failure): [0]: time : 2023-12-10_10:31:23 host : 3e6bcd590114 rank : 1 (local_rank: 1) exitcode : -6 (pid: 1051) error_file: traceback : Signal 6 (SIGABRT) received by PID 1051 ===================================================== looks like I have some issue with John Ceed

Furkan Gözükara

this doesnt show error reason. if caching takes more than 30 min click start training again after loading config

Anonymous

Hey I am getting this error since yesterday. Before everything worked fine. I have used the same model, same setting etc. Just different training dataset now. I have the 2x T4 selected and attempted a Lora training. Also whats weird is that the 2x T4 GPUs don't show any sign of life in the resource monitor tab. VRAM is at 0 bytes all the time and for usage its the same... loading image sizes. 2%|█ | 1/40 [00:00<00:01, 22.09it/s] Traceback (most recent call last): File "/kaggle/working/kohya_ss/./sdxl_train_network.py", line 185, in trainer.train(args) File "/kaggle/working/kohya_ss/train_network.py", line 192, in train train_dataset_group = config_util.generate_dataset_group_by_blueprint(blueprint.dataset_group) File "/kaggle/working/kohya_ss/library/config_util.py", line 495, in generate_dataset_group_by_blueprint dataset.make_buckets() File "/kaggle/working/kohya_ss/library/train_util.py", line 763, in make_buckets info.image_size = self.get_image_size(info.absolute_path) File "/kaggle/working/kohya_ss/library/train_util.py", line 996, in get_image_size image = Image.open(image_path) File "/opt/conda/lib/python3.10/site-packages/PIL/Image.py", line 3298, in open raise UnidentifiedImageError(msg) PIL.UnidentifiedImageError: cannot identify image file '/kaggle/working/Lora/img/20_teveo leggings leggings/Teveo_Leggings_10).png' Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./sdxl_train_network.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2023-12-16_09:37:01 host : 66e4585c1664 rank : 1 (local_rank: 1) exitcode : 1 (pid: 1040) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-12-16_09:37:01 host : 66e4585c1664 rank : 0 (local_rank: 0) exitcode : 1 (pid: 1039) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

Furkan Gözükara

hello. what are the resolution of images? are you using bucketing? this more looks like images are corrupted. change all images into different format such as JPEG and try again

Anonymous

Yes I am using bucketing. The resolution is all over the place raning from about 800 to 1400 pixels. Yes I will try fixing the images, hopefully it works. Thanks!

Anonymous

I converted all the images to jpg again and now its working like normal!! :D

Ec Jep

Quick question: in this kaggle json file you use fp16 and xformers where in other SOTA json setups you use bf16 and no xformers? I assume that is due to kaggle gpu's and available ram? With my 4090 24gb gpu, I am still using bf16 and no xformers for excellent training so I just wanted to check with you. Also, do you prefer questions here or in discord?

Ec Jep

you answered my question in the later part of the video.

Anonymous

I get this with V12 right in the first step: Err:1 http://packages.cloud.google.com/apt gcsfuse-focal InRelease Temporary failure resolving 'packages.cloud.google.com' Err:2 http://archive.ubuntu.com/ubuntu focal InRelease Temporary failure resolving 'archive.ubuntu.com' Err:3 http://security.ubuntu.com/ubuntu focal-security InRelease Temporary failure resolving 'security.ubuntu.com' Err:4 https://packages.cloud.google.com/apt cloud-sdk InRelease Temporary failure resolving 'packages.cloud.google.com' Err:5 http://archive.ubuntu.com/ubuntu focal-updates InRelease Temporary failure resolving 'archive.ubuntu.com' Err:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease Temporary failure resolving 'archive.ubuntu.com'

Anonymous

I was able to successfully complete a training using RealVisXL_V3.0. Thanks for the excellent instructions! I tried again using a model I uploaded to huggingface myself, and the training failed because it couldn't find the model. I think it's because the my folder is missing a model_index.json file but I'm not sure. How can we do this with models we have downloaded elsewhere, but aren't already on huggingface the way RealVisXL models are?

Furkan Gözükara

you can get them into kaggle working directory with wget command. then give their kaggle path. do that before starting the gui

Anonymous

I was able to wget the the model into the kaggle working directory and tell the gui where to find it, but I get out an out of memory error before any checkpoints are saved. Training with the sdxl base model still works fine, so maybe the custom one is just too big. Thanks for the extra help!

Anonymous

i am trying to run the code for the token, but it keeps giving me this error: ModuleNotFoundError Traceback (most recent call last) Cell In[1], line 8 5 import threading 7 from flask import Flask ----> 8 from pyngrok import ngrok, conf 10 conf.get_default().auth_token = "2aAM5cepa9mI4BDrXweLQbWfmjM_7oBKhHhjoAQGSn29zPMn8" 12 os.environ["FLASK_ENV"] = "development" ModuleNotFoundError: No module named 'pyngrok' what should i do?

Furkan Gözükara

hello. this happens when something went wrong in installation. turn off session. start again from beginning. also the order matters. the newest video is here : https://youtu.be/16-b1AjvyBE

Anonymous

Hey! I keep getting this error. Any suggestions? running training / 学習開始 num examples / サンプル数: 1300 num batches per epoch / 1epochのバッチ数: 1300 num epochs / epoch数: 2 batch size per device / バッチサイズ: 1 gradient accumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 2600 steps: 0%| | 0/2600 [00:00 train(args) File "/kaggle/working/kohya_ss/./sdxl_train.py", line 512, in train encoder_hidden_states1, encoder_hidden_states2, pool2 = train_util.get_hidden_states_sdxl( File "/kaggle/working/kohya_ss/library/train_util.py", line 4100, in get_hidden_states_sdxl enc_out = text_encoder2(input_ids2, output_hidden_states=True, return_dict=True) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 1230, in forward text_outputs = self.text_model( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 740, in forward encoder_outputs = self.encoder( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 654, in forward layer_outputs = encoder_layer( File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 393, in forward hidden_states = self.mlp(hidden_states) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py", line 350, in forward hidden_states = self.fc2(hidden_states) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 14.76 GiB total capacity; 12.34 GiB already allocated; 3.75 MiB free; 13.47 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF steps: 0%| | 3/2600 [00:19<4:47:58, 6.65s/it, avr_loss=0.176] Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./sdxl_train.py FAILED ------------------------------------------------------------ Failures: ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-12-31_12:57:20 host : a9b77c1db37b rank : 0 (local_rank: 0) exitcode : 1 (pid: 1046) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

Furkan Gözükara

hello. After this happens please start gui again and start training again. when it first time caches the images it leaves some VRAM left over. second time starting no more caching so it starts training

Anonymous

do you have any advice on how to train dreambooth / lora with turbo models? it seems like it doesnt work well out of the box, giving low quality results

Anonymous

I have tried twice to train lora and got the same error I am seeing others are having. AssertionError: full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。 Traceback (most recent call last): File "/kaggle/working/kohya_ss/./sdxl_train_network.py", line 189, in trainer.train(args) File "/kaggle/working/kohya_ss/train_network.py", line 234, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) File "/kaggle/working/kohya_ss/./sdxl_train_network.py", line 47, in load_target_model ) = sdxl_train_util.load_target_model(args, accelerator, sdxl_model_util.MODEL_VERSION_SDXL_BASE_V1_0, weight_dtype) File "/kaggle/working/kohya_ss/library/sdxl_train_util.py", line 21, in load_target_model model_dtype = match_mixed_precision(args, weight_dtype) # prepare fp16/bf16 File "/kaggle/working/kohya_ss/library/sdxl_train_util.py", line 167, in match_mixed_precision weight_dtype == torch.bfloat16 AssertionError: full_bf16 requires mixed precision='bf16' / full_bf16を使う場合はmixed_precision='bf16'を指定してください。 Traceback (most recent call last): File "/opt/conda/bin/accelerate", line 8, in sys.exit(main()) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main args.func(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command multi_gpu_launcher(args) File "/opt/conda/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher distrib_run.run(args) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ./sdxl_train_network.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-01-03_08:36:15 host : d32f44ffac1b rank : 1 (local_rank: 1) exitcode : 1 (pid: 1068) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-01-03_08:36:15 host : d32f44ffac1b rank : 0 (local_rank: 0) exitcode : 1 (pid: 1067) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html Is there a fix for this? I followed the tutorial exactly both times and watched the updated section with additional parameters. I deleted the notebook and started from scratch.

Anonymous

one strange trick I found is keeping the CFG high at low number of steps somehow helps, whereas for turbo you'd expect CFG 1-2 to be best

Anonyme pas trop anonyme

Hello Dr. Gozukara. This message is not about the post but it is a question. I'm new to Patreon and i was wondering somethig by watching all the tutorials. Have or will you make a video of training with a Ai generated dataset ? Can AI produce its own photorealistic models (ofc we do the training) ? Thanks for your answer and have a great day !

Furkan Gözükara

Hello. This is a good question. Actually I am doing a research right now similar to this for a company. i don't have data yet so can't answer. but i assume that if you can manually generate your dataset you can fine tune model for that purpose. like generate certain type of donuts and train model to generate such

Rick B

I get as far as starting the ngrok session, but get an error. Can anyone explain this error: * ngrok tunnel "https://a2a1-104-197-102-215.ngrok-free.app" -> "http://127.0.0.1:7860/" * Serving Flask app '__main__' * Debug mode: off Address already in use Port 5000 is in use by another program. Either identify and stop that program, or start the server with a different port

Furkan Gözükara

it is not an error. open this link : https://a2a1-104-197-102-215.ngrok-free.app dont click visit site start kohya gui and then click visit site and it will work

Anonymous

Hello Dr. Gozukara, Thank you for your detailed SDXL_Dreambooth training tutorial! I trained one model yesterday. It took me around 5 hours and I ended up getting 6 . safetensors files(around 6G for each) and uploaded to my hugging face account. Now I have trouble running them. I do not have powerful GPU, so I decided to use Kaggle to run Automatic1111. Again I followed your tutorial and Kaggle script. I used wget command to pull my 6 models from huggingface website ,and after that it appeared in my /kaggle/working/models folder. The problem comes after I launched the Automatic1111: I could not get my model from the checkpoint, even though they are all in there. I can get stable-diffusion-xl-base-1.0 running but not my fine-tuned ones. Please help me. Thank you!

Anonymous

I do not have local powerful GPU, that is why I chose to run Automatic1111 and train models with Kaggle. On Kaggle, I use GPU T4*2, which is fine for both model training and Automatic1111. The problem is I cannot launch the models from checkpoints in Automatic1111. it just does not respond. I also tried comfyUI with Google colab. I did not work either. Here is the error I got from ComfyUI: Error occurred when executing CheckpointLoaderSimple: Error while deserializing header: HeaderTooLarge. So I got error from ComfyUI. For Automatic1111, no errors, but I could not get it responded when tying to launch my models from checkpoint.

Furkan Gözükara

there must be an error somewhere. you can send me model and i can try locally to see if model is accurately trained or not

Anonymous

Thank you, Dr.! Here is the link on my Hugging face. I make it public and you should be able to download it: https://huggingface.co/Terresa/SDXL_training/tree/main

Furkan Gözükara

I just tested on my computer and it works. tested step00002084. can you record a video and how you are trying to use it on kaggle?

Anonymous

Thank you, Dr. Gozukara! I screen recorded my whole process of launching Automatic1111 with Kaggle and edited a 3 minutes video. I pointed out all the problems happened during the process. Here is the video link on Hugging face: https://huggingface.co/datasets/Terresa/video/tree/main. Please check it! Thank you!

Furkan Gözükara

your links are incorrect. you see it downloaded only 36kb. here an accurate link : https://huggingface.co/Terresa/SDXL_training/resolve/main/My_DB_Kaggle-step00002084.safetensors

Anonymous

Hi Dr. Gozukara, Thank you for your instruction! It was my bad. What a silly mistake! Now it worked. For Automatic1111 with kaggle, I found checkpoint comparison via x/y plot computational extremely expensive. The system crushed several times. I will try it again. Do you have any thoughts on that?

Anonymous

I launched the checkpoints one by one, use them individually and manually compared the results. Problem solved! Again, Thank you for your help!

Art

hi... yesterday I got my favourite model finally... it's a checkpoint, and I sould like to extract a lora finetuned version from that one: can you suggest how to achieve that ising a kaggle notebook?

Art

thank you... but it seems I'm a little too dumb to find that resource... can you point me to that one, please?

Furkan Gözükara

hello here : https://cdn-uploads.huggingface.co/production/uploads/6345bd89fe134dfd7a0dba40/TT9D-nOtJqop9nozTGZOA.png

Art

can I do this with the Kaggle/Khoya notebook I used to generate my checkpoint?

Art

tried, but not able to get an output folder for lora... uploaded checkpoint (mine and sdxl 19 but setting an output folder fom the web ui retturns a non writeable folder on kaggle notebook... maybe it's an easy fix to do, but I'm really a beginner... any help?

Art

hi... I uploaded my checkpoint and yhe sdxl 1 in the root of the working folder, and ran the gui, the i made settings and tried to output lora in the same folder o the uploaded checkpoints, the output folder of the temp folder of the six checkpoints of the normal workflow and any other folder of the kohya instaaltion but none of them were writeable... Maybe I have to run all the cells of the normal workflow to get the lora in the temp folder?

Art

hi... I uploaded my checkpoint and yhe sdxl 1 in the root of the working folder, and ran the gui, the i made settings and tried to output lora in the same folder o the uploaded checkpoints, the output folder of the temp folder of the six checkpoints of the normal workflow and any other folder of the kohya instaaltion but none of them were writeable... Maybe I have to run all the cells of the normal workflow to get the lora in the temp folder?

Art

I'm noticing something strange... after 5 complete training sessions (and a couple of failures), I notice that out of the 6 safetensors files generated, the ones that most resemble the person are the ones in the middle of the training (the third or fourth generated) ...does this happen to you too? do you have an explanation for this?

Art

Hi, Doctor, running the lora cell after the 6 safetensors generation, it will extract the loras from checkpoints? can you explain this step? And do you have right parameters to do the same with SD1.5 generated models?

Furkan Gözükara

no you have to manually extract LoRA. use Kohya GUI SS with these settings : https://cdn-uploads.huggingface.co/production/uploads/6345bd89fe134dfd7a0dba40/NtToBFK2uumY_YMHOZ7MW.png

Art

already tried with a SD1.5 checkpoint (uploaded in the working folder of kohya): it saves a 600 mb model (file without extension), but it doesn't generate images of myself (I was the subject of a training of yesterday: the safetensor checkpoint works well on automatic1111); where am I wrong? Do I need also the path to safetensor base model to input in the field "Stable Diffusion Base Model"? The chekpoints of the sdxl training I'm doing right now have to be retrieved from the /kaggle/temp/models repository one by one?

Furkan Gözükara

you need to use the model you trained yourself on as base model. whichever the model you used for training that model. by the way lets say you trained yourself on model A, then extract lora and used that lora on model B may not work very well

Art

yes: now I explain in details: yesterday i trained myself with the SD1.5 workflow, then I tested and found one that worked best; today I launched the kohya kaggle notebook and ran it uploading in the kaggle/working directory the ckeckpoint of myself, the v1-5-pruned.safetensors (from HF), then setted the lora extractions according to your image, end extracted in the directory of kohya model (copying the path from the lora training script in the notebook); the result is a file of 600 md called "model" without extension; I renamed it putting the extension .safetensors and testd in Automatic1111, but that lora doesn't generate nothing but random images (houses, persons, objects)... hot to resolve this?

Art

I trained with the workflow you updated on the last version of the kaggle/Kohya notebook... I assumed it was the base version of SD1.5... do I have to use another one? and which one? Can you give links of the base models used for training both SDXL and SD1.5, so that I can upload them together with checkpoints of myself? Thank you...

Art

Thank you Furkan! Can you say to me also what SDXL model base is used by Kohya in the Kaggle notebook of this thread?

Art

hello everyone (and Furkan in particular); after having made a series of models with Kohya+Kaggle, I make some considerations: - the notebook works very well, and the materials provided by Furkan are very valid (I'm thinking of the reg images and the json) - among the 5 checkpoints generated, the best (due to similarity of the subject) are always the third or fourth (but I would say more the third) - I tested checkpoints on Automatic1111 on both Kaggle and Colab Pro: the outputs on Kaggle are much superior; and I'm wondering if I can get better quality on Runpod, so I might consider switching from Colab to Runpod - this notebook by Khoya (but also reading elsewhere) does not extract a LORA from a checkpoint; it's absolutely impossible to do this, and it seems to be a Khoya problem for some time now: can you tell me an effective method to correctly extract a LORA from a checkpoint? I greet everyone and await your feedback!

Furkan Gözükara

Use RunPod and you will have no issues for quality and also extracting a lora. we have auto install update scripts for Automatic1111 on RunPod. LoRA extraction currently uses too much RAM or VRAM therefore I also just opened an issue on GUI github to reduce : https://github.com/bmaltais/kohya_ss/issues/1933

Anonymous

Hello, new in this field but already passionate I start to test various things by following your tutorials which are geniaux!!!! I have a problem on this tuto: Everything works fine until this error message. Can you help me? venv folder does not exist. Not activating... 16:58:24-113538 INFO Version: v22.6.0 16:58:24-120372 INFO nVidia toolkit detected 16:58:28-413512 INFO Torch 2.0.1+cu118 16:58:28-490005 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 16:58:28-519266 INFO Torch detected GPU: Tesla T4 VRAM 15102 Arch (7, 5) Cores 40 16:58:28-521168 INFO Verifying modules installation status from /kaggle/working/kohya_ss/requirements_linux.txt... 16:58:28-525550 INFO Verifying modules installation status from requirements.txt... Traceback (most recent call last): File "/kaggle/working/kohya_ss/kohya_gui.py", line 1, in import gradio as gr File "/opt/conda/lib/python3.10/site-packages/gradio/__init__.py", line 3, in import gradio.components as components File "/opt/conda/lib/python3.10/site-packages/gradio/components/__init__.py", line 1, in from gradio.components.annotated_image import AnnotatedImage File "/opt/conda/lib/python3.10/site-packages/gradio/components/annotated_image.py", line 8, in from gradio_client.documentation import document, set_documentation_group ImportError: cannot import name 'set_documentation_group' from 'gradio_client.documentation' (/opt/conda/lib/python3.10/site-packages/gradio_client/documentation.py)

Anonymous

Great I try again ;) Very thank you for your help ;)

Art

Hello, Furkan... I want to try another 1.5 training, but it's not clear what training images do I have to use: it's better to train 1024x1024 or 512x512 images? The saving steps calculation is the sama of the sdxl workflow?

Art

thaank you! so: 1024x1024 training images, 768x768 reg images, steps calculation + 1...

Anonymous

error, comes out 07:48:02-762329 INFO Version: v23.0.1 07:48:02-769461 INFO nVidia toolkit detected 07:48:04-589434 INFO Torch 2.1.2+cu118 07:48:04-630076 INFO Torch backend: nVidia CUDA 11.8 cuDNN 8700 07:48:04-654485 INFO Torch detected GPU: Tesla T4 VRAM 15102 Arch (7, 5) Cores 40 07:48:04-714962 INFO Submodule initialized and updated. 07:48:04-716439 INFO Verifying modules installation status from /kaggle/working/kohya_ss/requirements_linux.txt... 07:48:04-720915 INFO Installing package: torch==2.1.2+cu118 torchvision==0.16.2+cu118 xformers==0.0.23.post1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118 07:48:17-363068 INFO Verifying modules installation status from requirements.txt... ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │ /kaggle/working/kohya_ss/kohya_gui.py:7 in │ │ │ │ 6 from textual_inversion_gui import ti_tab │ │ ❱ 7 from library.utilities import utilities_tab │ │ 8 from lora_gui import lora_tab │ ╰──────────────────────────────────────────────────────────────────────────────╯ ModuleNotFoundError: No module named 'library.utilities'

BecauseReasons

Thanks. The Kaggle dreambooth link is broken though.

Art

Hi, Furkan... after some succesful 1.5 models, I wanted to try SDXL on one of my best dataset; I noticed that with new configuration jsons, the step calculation is (2000 / 1 / 1 * 1 * 1) = 2000 (i'm using 20 training images) and not (2000 / 1 / 1 * 1 * 2) = 4000 like the last time I used it... Is that ok, or there is some mistake in the parameters?

Furkan Gözükara

hi that would mean regularization images folder is missing. have you set it? hopefully i will make updated tutorial. i am waiting kohya to finish newest interface. by the way, on kaggle, you should make half number of steps. because it uses 2 gpu.