Home Artists Posts Import Register
The Offical Matrix Groupchat is online! >>CLICK HERE<<

Downloads

Content

I will update this page as I progress. I am just starting so I will add the info. Nothing ready yet

27 August 2024 Update

  • Download latest : Kohya_GUI_Flux_Installer_v17.zip

  • Batch size experiments section added and config for 4x GPU added

  • Configs are inside Best_Configs folder inside above zip file

  • 34 Distinct unique prompts for testing added to the zip file

  • Hopefully I am going to test time step shift more and Clip L Text Encoder training (Kohya just announced)

Newest Configs :

  • Rank 1 > Rank 2 > Rank 3 and so on - quality assessment

  • The quality difference between Rank 1 to Rank 6 are not very much but starting from Rank 7 the quality will degrade since we reduce training resolution and LoRA rank

  • So you can pick faster over slower if you are in hurry

  • These are early steps speeds (like first 50 and Kohya doesn't display latest step speed) so as you train more you will get like at least 25% faster speeds

  • These are raw VRAM usages so you need to look your VRAM usage before starting training

  • The tests are made on Linux on Massed Compute on RTX A6000 GPU - 1024x1024 resolution - 128 LoRA rank

  • Rank_1_28700MB_Slow.json - 16bit - 8.53 second / it

  • Rank_2_27360MB_Fast.json - 16bit - 4.49 second / it

  • Rank_3_18246MB_Slow.json - 8bit - 8.62 second / it

  • Rank_4_16960MB_Fast.json - 8bit - 4.61 second / it

  • Rank_5_11498MB_Slow.json - 8bit - Single Layers - 12.12 second / it

  • Rank_6_10222MB_Fast.json - 8bit - Single Layers - 9.42 second / it

  • Rank_7_9502MB.json - 8bit - Single Layers - 8.61 second / it - 896px

  • Rank_8_15406MB.json - 8bit - 3.71 second / it - 64 LoRA Rank - 896px

  • Rank_9_7514MB.json - 8bit - Single Layers - 5.8 second / it - 64 LoRA Rank - 512px

My suggestions for GPUs

  • 8 GB GPUs : Rank_9_7514MB.json

  • 10 GB GPUs : Rank_7_9502MB.json

  • 12 GB GPUs : Rank_5_11498MB_Slow.json

  • 16 GB GPUs : Rank_5_11498MB_Slow.json - if you need speed : Rank_8_15406MB.json

  • 24 GB GPUs : Rank_3_18246MB_Slow.json

  • 48 GB GPUs : Rank_1_28700MB_Slow.json

Batch Size Experiments And Multi GPU Usage

  • I have used Rank_2_27360MB_Fast.json config to test batch size impact on RTX A6000

  • The speed gain from batch size is almost none thus I don't suggest since you lose quality

  • Lesser batch size = better quality

  • Only Batch size 2 gives you some gain so you may use it if you wish

  • Batch size 1 : 4.54 second / it : effective speed same

  • Batch size 2 : 7.98 second / it : effective speed per step 3.99 second / it

  • Batch size 3 : 12.43 second / it : effective speed per step 4.14 second / it

  • Batch size 4 : 15.28 second / it : effective speed per step 3.82 second / it

  • Batch size 5 : 20.18 second / it : effective speed per step 4.03 second / it

  • Therefore I have added batch size 1 but 4 A6000 GPU Speed config : 4x_GPU_Batch_Size_1.json

  • With this config you get 5.75 second / it and effective speed per step is : 1.4375 second / it

  • When you use multiple GPU you need to divide epoch count to number of GPUs

  • So 200 epoch becomes 50 for 4x GPU

  • Also increase LR with this formula - best LR x (batch size * number of GPUs / 2) so in this case 0.00005 * (4/2) = 0.0001

  • So if you make batch size 2 it becomes = 0.00005 x (2x4/2) = 0.0002

  • The zip file now has 4x_GPU_Batch_Size_1.json and 4x_GPU_Batch_Size_2.json

  • I suggest you to use 4x_GPU_Batch_Size_1.json on a 4x RTX A6000 GPU machine

Inference Config for Evaluating Results of Experiments

  • I use FP16 FLUX dev model in 16 bit precision with UniPC sampler and 30 steps and SwarmUI grid system

You Don't Necessarily Need Class Prompt With FLUX

Different LoRA Ranks Experiments

Higher Resolution Training Impact

QKV Split Attention and JoyCaption Detailed Captioning Trainings Results

  • I have comprehensively tested QKV split attention and detailed captioned dataset trainings

  • I don't find split QKV brings benefit and brings huge speed loss

  • For person training detailed captions reduced the resemblance and didn't improve and flexibility or quality. But the resemblance loss was way lower than SDXL

  • So my conclusions are do not use split QKV and detailed captions for training a person

  • For style training probably detailed Joy captions will work great - i should test it too

  • Full grid results are below

  • QKV Split and JoyCaption Result 1024px.jpg , QKV Split and JoyCaption Result 1920px.jpg

Time Step Sampling Shift Experiments

  • This one hugely improved stylization capability of the model but reduced resemblance and the realism

  • I plan to hopefully do more research on this - 8 more trainings

  • Wait for them. Full grids below

  • Time_Step_Sampling_Shift_Grid_Full_Grid.jpg

Apply T5 Attention Mask and detailed Regularization / Classifications Images Impact Experiments

More Different Experiments Are Historically shared below read the entire thread

-

24 August 2024 Update V2

  • Windows_Download_Training_Model_Files.bat file added to download necessary training model files into the bat file run directory

  • A 1-min video published to show how to set Accelerate for Kohya SS GUI : https://youtu.be/adVhm9aI9Gc

  • This set fixes caching stuck problem

24 August 2024 Update

  • Massed Compute installer fixed and super simplified just 1 step and auto opens browser

  • Massed compute instructions updated and auto models downloader added - downloads all necessary models automatically Dev FP16, T5 FP16, VAE and Clip

  • Fix_For_FLUX_Step_2.bat improved - don't forget to run this for lower VRAM usage and 10GB Config

  • 10 GB config updated for latest GUI

  • Solution for KeyError: 'time_embed.0.weight' error

  • When you loaded the config make sure that this is selected

23 August 2024 Update

  • Massed Compute and RunPod Installers fully added (read Massed_Compute_Kohya_FLUX_Instructions and RunPod_Install_Instructions)

  • 16_GB_Config_May_Be_Lower_Quality_In_Test added - still in test - changes are rank 32 and 896 pixel resolution training but super fast compared to lowest VRAM

  • Download newest Kohya_GUI_Flux_Installer_v13.zip and extract into any folder

  • 48_GB_GPUs_v2 (27 GB VRAM) and 24_GB_GPUs_v2 (17 GB VRAM) yields almost same quality

  • 10_12_16GB_GPUs_v2.json (yields almost same quality as best ones but 3-5 times slower due to VRAM optimizations) - uses 10183 MB peak so if uses shared VRAM and be too slow reduce LoRA Rank or resolution to like 896x896 - I am testing LoRA rank impact at the moment

  • 16_GB_Config_Slightly_Lower_Than_10GB_But_5_Times_Faster.json yields lower quality than 10_12_16GB_GPUs_v2.json but 3-5 times faster than it

  • Newest test Grids are as below

  • Reg_T5_Attention_Mask_CivitAI_50_Epoch.jpg, Reg_T5_Attention_Mask_CivitAI_100_Epoch.jpg, Reg_T5_Attention_Mask_CivitAI_150_Epoch.jpg, T5_Attention_Mask_v1.jpg, Civit_Reg_15GB_150_Epoch_Compare.jpg

  • Tested all configs file updated to V5 : Tested_All_Configs_V6.zip

  • So what are the newest changes? I have enabled T5 Attention Mask which I believe slightly improves results

  • T5 Attention Mask Tradeoffs 

    • However T5 Attention Mask increases VRAM usage by 1 GB for 16gb, 24gb and 48 gb configs, thus I didn't enable it for 16 GB config. Now 48 GB config uses 28 GB, 24 GB config uses 18 GB with attention masking

    • Moreover, T5 Attention Mask slows down the training significantly so that is another tradeoff

    • What does it do explained here : https://poe.com/s/EHIviAdVuZds5XBGDYWP

  • What is next? I am going to test quality impact of different LoRA network ranks next 4, 8, 16, 32, 64, 128

  • 16 GB Config yielded slightly worse results than 10 GB config probably due to reduced resolution and LoRA network rank

  • CivitAI default settings training performed worse than our very best config with exactly same dataset and caption

  • Regularization / classification images tests failed. Still yields non-resemblance images. I don't see its benefit at this point. Tested with different Prior loss weight parameters but non yielded better results

  • Windows training is still significantly slower than Linux training. I am still researching a fix for this issue. Asked on Pytorch and Diffusers GitHubs

  • Full Kohya SS GUI setup screenshots added to the installer zip file

22 August 2024 Update

  • How to setup Kohya interface for FLUX full screenshot : Example Full Setup.jpg

  • FLUX LoRA training almost perfected

  • Use LoRA tab to load configs not DreamBooth

  • I also made 1 click Kohya SS GUI installer for FLUX

  • Run Windows_Install_Step_1.bat - select option 1 and then exit with option 7 once completed - no need other options

  • Run Fix_For_FLUX_Step_2.bat and then use Windows_Start_Kohya_SS.bat to start

  • Step 2 will improve training speed like 100% and VRAM usage like 25% with just proper library upgrade

  • Same upgrade added for Massed Compute read the

  • You are ready to use FLUX training with most performant way now

  • From newest configs I feel like Best_v1_5e_5_max_grad_norm_0 is the best

  • But Best_v1_5e_5 is also very good. There is like some lightning difference

  • When you extract the above zip file you will see configs as

  • 48_GB_GPUs.json for 48 GB GPUs

  • 24_GB_GPUs.json for 24 GB GPUs - uses around 17-18 gb with newest libraries

  • 16GB_Compare_Speed_With_10_12_16_Config.json - If you have 16 GB try this and compare speed with 10_12_16GB_GPUs.json

  • 10_12_16GB_GPUs.json - For 10, 12, 16 GB GPUs. If you have 10 GB and be very slow you can reduce LoRA Rank and also resolution to 512,512

  • 512px training yields inferior quality and VRAM usage almost same - but 2.5x speed

  • New tested configs full quality grids : 50 epoch (750 steps) , 100 epoch (1500 steps) , 150 epoch (2250 steps)

  • For 24 GB Config RTX 3090 Speed should be close to 4 second / it - 1024x1024px

  • For 10_12_16 GB config speed drops like 3 to 5 times due to full optimizations

  • Training dataset 15 images, 1 repeat, train like 200 epochs

22 August 2024 Update

  • Section "FLUX Huge Sampler + Scheduler Test For a Very Hard Prompt" added to the post

  • Section "FLUX Guidance Scale Grid Test on LoRA" added to the post

  • Newests tests completed and new best configs uploaded as

  • Surprise that now wen have 10 GB config (LoRA rank 128) - 10 GB config will be published hopefully in 8-9 hours

  • If you have 10 GB GPU try to reduce LoRA rank until it works better also minimize your VRAM usage before starting the training

  • So the current best configs are like this

  • For 48 GB GPU : 6e_05_best_raw_sigmoid.json

  • For 24 GB GPU : 6e_05_best_raw_sigmoid_24GB.json,

  • For 10, 12, 16 GB GPUs : lowest_vram.json

  • For lowest VRAM config to work you have to activate VENV of Kohya GUI and execute below commands 1 by 1

  • This below thing will speed up training almost 100% and reduce VRAM usage greatly but results unknown yet (training atm). You can also do that if you have 24 GB GPUs like 4090

  • First update Kohya to latest, run 1 time, then activate venv and install below, then edit gui.bat file and comment or remove python.exe .\setup\validate_requirements.py and if %errorlevel% neq 0 exit /b %errorlevel% to prevent auto reinstall requirements

  • pip install torch==2.4.0+cu124 --index-url https://download.pytorch.org/whl/cu124

  • pip install torchvision==0.19.0+cu124 --index-url https://download.pytorch.org/whl/cu124

21 August 2024 Update

  • So far 16 different tests completed

  • For 48GB GPUs use 6e_05_bf16_128_rank_full_bf16.json - train up to 200 epochs and compare checkpoints

  • For 24GB GPUs use 6e_05_fp8_bf16_accelerate_full_bf16_32_rank.json - train up to 200 epochs and compare

  • I will try to get better results and reduce number of necessary epochs

  • I will like reg images would help but I haven't tested yet

20 August 2024 Update

  • So far tested configs added to the attachment and more results added to the post

  • More configs are on training

Windows Requirements

  • Python 3.10, FFmpeg, Cuda 11.8, C++ tools and Git

  • If it doesn't work make sure to below tutorial and install everything exactly as shown in this below tutorial

  • https://youtu.be/-NjNy7afOQ0

How To Install

  • I am going to use GUI version of the Kohya

  • https://github.com/bmaltais/kohya_ss

  • Currently It doesn't have FLUX at the main branch

  • So clone it into a new folder like c:/flux_train

  • Before installing open a cmd

  • Type : git checkout sd3-flux.1

  • Then install as usual

FLUX Training Discussions With Lots of Info

Models Links Downloads

FLUX Fine Tuning Lower VRAM Optimizations

  • --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False"

  • --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0

  • --blockwise_fused_optimizers --double_blocks_to_swap 6 --cpu_offload_checkpointing

  • Options are almost the same as LoRA training. The difference is `--blockwise_fused_optimizers`, `--double_blocks_to_swap` and `--cpu_offload_checkpointing`. `--single_blocks_to_swap` is also available.

  • `--blockwise_fused_optimizers` enables the fusing of the optimizer for each block. This is similar to `--fused_backward_pass`. Any optimizer can be used, but Adafactor is recommended for memory efficiency. `--fused_optimizer_groups` is deprecated due to the addition of this option for FLUX.1 training.

  • `--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--blockwise_fused_optimizers`.

  • `--cpu_offload_checkpointing` is to offload the gradient checkpointing to CPU. This reduces about 2GB of VRAM usage.

  • --network_args train_blocks=single - reduces vram

  • All these options are experimental and may change in the future.

  • The increasing the number of blocks to swap may reduce the memory usage, but the training speed will be slower. `--cpu_offload_checkpointing` also slows down the training.

  • Swap 6 double blocks and use cpu offload checkpointing may be a good starting point. Please try different settings according to VRAM usage and training speed.

Previous Master Kohya Tutorials

Massed Compute

  • Massed compute A100 discount : A100_2ybG1e9StAdA

Auxiliary Scripts and Tutorials

FLUX Guidance Scale Grid Test on LoRA

  • As you know FLUX has CFG set as 1

  • FLUX dev model has FLUX guidance scale

  • You can download full grid test from this link and check out

  • I feel like FLUX Guidance 4.0 is best but you can push more to get more accurate prompt following

FLUX Huge Sampler + Scheduler Test For a Very Hard Prompt

  • Prompt is : A cinematic shot featuring a photo-realistic image of an Ohwx man riding an enormous and terrifying Tyrannosaurus rex in the heart of Jurassic Park. The scene is set against a backdrop of dense, prehistoric jungle with towering trees and lush greenery, creating an intense and thrilling atmosphere. The T-rex is depicted in all its majestic and fearsome glory, with its massive jaws open wide, sharp teeth glistening, and powerful muscles rippling as it charges forward. The Ohwx man, in contrast, is shown with a determined and fearless expression as he expertly rides the gigantic predator, his posture exuding confidence and control. The image is captured in a highly cinematic style, with dramatic lighting and dynamic angles that emphasize the scale and intensity of the scene. The focus is partially directed towards the face of the Ohwx man, ensuring that his features are clearly visible and convey a sense of resolve and bravery, while maintaining the overall composition of the dynamic action scene.<segment:face,0.7>photo of ohwx man

  • So far my findings are as below as promising sampler + scheduler

  • Full grid link : click here

    • euler + normal - default

    • euler + karras - cinematic look

    • heun + karras , heunpp2 + karras

    • dpm_2 + karras or exponential

    • dpm_fast + normal

    • lcm + normal or karras : cartoon look

    • unipc + normal or karras

  • I did bigger tests and my conclusion is that sadly Karras scheduler which makes it more realistic is only useable with DMP_2 and even that is not great

  • Also I think best sampler is UniPC

  • Below all test results

  • LCM_Normal_vs_Karras.jpg , Eular_A_Normal_vs_UniPC_Normal.jpg , Eular_Normal_vs_Karras.jpg , UniPC_Normal_vs_Karras.jpg , DPM_2_Normal_vs_Karras.jpg

Experiments and VRAM Usages

  • A6000

  • Accelerate launch Mixed precision BF16 - highvram - 128 rank : 44.18 GB - 8.25 second / it

  • Accelerate launch Mixed precision BF16 - full BF16 - highvram 128 rank : 42.58 GB - 8.25 second / it

  • Accelerate launch Mixed precision BF16 - lowvram - 128 rank : 44.18 GB - 8.25 second / it : 44.18 GB - 8.25 second / it

  • Accelerate launch Mixed precision BF16 - full BF16 - lowvram 128 rank : 42.58 GB - 8.25 second / it

  • Accelerate launch Mixed precision FP16 - highvram - 128 rank : 44.14 GB - 7.83 second / it

  • Accelerate launch Mixed precision FP16 - full FP16 - highvram - 128 rank : 42.55 GB - 7.83 second / it

  • Accelerate launch Mixed precision BF16 - highvram - 128 rank - Memory efficient attention : 44.18 GB - 8.25 second / it

  • Accelerate launch Mixed precision BF16 - highvram - 128 rank - Split Mode : error

  • Accelerate launch Mixed precision BF16 - highvram - 128 rank - Split Mode train blocks double : error

  • Accelerate launch Mixed precision BF16 - highvram - 128 rank - Split Mode train blocks single : error

  • Accelerate launch Mixed precision FP16 - fp8 base - highvram - 128 rank : 33.53 GB - 7.99 second / it

  • Accelerate launch Mixed precision FP8 - fp8 base - highvram - 128 rank : error

  • Accelerate launch Mixed precision FP8 - fp8 base - full FP16 - highvram - 128 rank : error

  • Accelerate launch Mixed precision FP16 - fp8 base - highvram - full FP16 - 128 rank : 24.39 GB - 7.96 second / it

  • Accelerate launch Mixed precision BF16 - fp8 base - highvram - 128 rank : 33.55 GB - 8.38 second / it

  • Accelerate launch Mixed precision BF16 - fp8 base - full BF16 - highvram - 128 rank : 24.4 GB - 8.35 second / it

  • Accelerate launch Mixed precision FP16 - fp8 base - highvram - full FP16 --network_args train_blocks=single - 128 rank : 18.39 GB - 18 second / it

  • Accelerate launch Mixed precision BF16 - fp8 base - highvram - full BF16 --network_args train_blocks=single - 128 rank : 18.39 GB - 21 second / it

Running Experiments

  • 1e_04_bf16_128_rank : completed - 48 gb GPU

  • 1e_04_fp16_128_rank : completed - 48 gb GPU

  • 1e_04_bf16_128_rank_full_bf16 : completed - 48 gb GPU

  • 1e_04_fp16_128_rank_full_fp16 : completed - 48 gb GPU

  • 1e_04_fp8_32_rank : completed - 23.58 GB

  • 1e_04_fp8_full_fp16_32_rank : completed - 23.28 GB

  • 1e_04_fp8_bf16_accelerate_32_rank : completed - 23.57 GB

  • 1e_04_fp8_bf16_accelerate_full_bf16_32_rank : completed - 23.28 GB

Early Testing Results

  • Looks like FP16 accelerator training yields way inferior results compared to BF16 accelerator training

  • This results is very early experiments only 50 epoch

 

More results obtained

  • So far best working cases are 1e_04_bf16_128_rank_full_bf16 and 1e_04_fp8_bf16_accelerate_full_bf16_32_rank

  • FP16 training broken

  • Starting 8 more test right now

  • These below images has no face inpainting raw output

 

  • These are early results of my extensive FLUX LoRA training but 8 bit is way inferior to 16 bit atm

  • 8 bit looks undertrained perhaps and 16 bit is overtrained with current LR and 150 epoch

  • Gonna do more testing now

  • No face inpainting raw 1024x1024 images

 Some more full comparisons attached to attachments

New Started Trainings

48 GB Configs - uses 42.6 GB VRAM

  • 9e_05_bf16_128_rank_full_bf16 - completed

  • 8e_05_bf16_128_rank_full_bf16 - completed

  • 7e_05_bf16_128_rank_full_bf16 - completed best at 150 epoch

  • 6e_05_bf16_128_rank_full_bf16 - completed very best at 200 epoch

24 GB Configs - uses 23.30 GB VRAM

  • 2e_04_fp8_bf16_accelerate_full_bf16_32_rank

  • 3e_04_fp8_bf16_accelerate_full_bf16_32_rank

  • 9e_05_fp8_bf16_accelerate_full_bf16_32_rank

  • 8e_05_fp8_bf16_accelerate_full_bf16_32_rank

New Training Results for 48 GB Configs

24 GB Experiments

Started 7 New Trainings

Files

Comments

Yannis

Hey :) Can you help us installing the latest version of ForgeUi for Flux on Runpod ? :) There are no updated templates for now...

Josh Baker

im not clear on how to install the flux into kohya? do install kohya first then how do i add flux into it as an option?

Furkan Gözükara

you need to git clone then do git checkout sd3-flux.1 and then install. i am preparing tutorial will show there

Vlad

Hey! Is it possible to train on 12GB VRAM ?

Sugar Coat VFX Design

Thanks ~! Can you teach us how to train multiple concept at the same time in the coming tutorial?

Furkan Gözükara

that can be another time. but can you give me example of which concepts ? so i may make another tutorial for that

Manpreet Singh

Nice. When is a video coming?

楠 陈

Which is the json of the best training parameters at present?

楠 陈

Is regularized training set currently supported?

Sugar Coat VFX Design

Which route is more difficult? I think first attempt is same category. Like Same Brand, Shoe A, Shoe B, Shoe C.

hazwam

can you do a video tutorial, step by step, i like video more cuz it's easier

Diggy Dre

Can this be done with a 4080?

s h a r k e y

From what I've tested it wouldn't right now. I haven't seen my training take less than 18gb vram. That will change without weeks if not days though. So probably yes shortly.

s h a r k e y

how many repeats are you running for these tests, and how are you naming your dataset? I've had decent results naming ( 20_woman - 16 or so images - 1 epoch - then retraining the output lora about 4/5 times )

Cemil Hacimahmutoglu

hocam bunu 12 gb lık ekran kartı ile yaptığınız test varmı

Furkan Gözükara

Henüz 12 gb çalışmıyor kohya ile konuşuyorum çalışması lazım diyor ama 18 gb kullanıyor aşırı yavaş

楠 陈

Have you tried Dreambooth's comprehensive fine-tuning?

Furkan Gözükara

for 4080 if kohya fixes the bug yes. currently sadly using 18 gb minimum which is supposed to be 12 gb. i am talking with him

Arcon Septim

Thank you for research and testing for us! I hope the video will be soon for Lora and full model fine tuning. For cloud services and PCs.

GomezBro

I downloaded the 24 GB json, loaded Kohya and loaded the json. I changed all directories and put the fluxdev tensor in documents and pointed at it. Also put images in photos folder. I hit train button and get error: \Desktop\Kohya\kohya_ss\venv\lib\site-packages\gradio\queueing.py", line 532, in process_events response = await route_utils.call_process_api( \Desktop\Kohya\kohya_ss\venv\lib\site-packages\gradio\route_utils.py", line 276, in call_process_api output = await app.get_blocks().process_api( \Desktop\Kohya\kohya_ss\venv\lib\site-packages\gradio\blocks.py", line 1928, in process_api result = await self.call_function( \Kohya\kohya_ss\venv\lib\site-packages\gradio\blocks.py", line 1514, in call_function prediction = await anyio.to_thread.run_sync( \Desktop\Kohya\kohya_ss\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 2177, in run_sync_in_worker_thread return await future \Desktop\Kohya\kohya_ss\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 859, in run result = context.run(func, *args) \Desktop\Kohya\kohya_ss\venv\lib\site-packages\gradio\utils.py", line 832, in wrapper response = f(*args, **kwargs) \Desktop\Kohya\kohya_ss\kohya_gui\lora_gui.py", line 1171, in train_model "network_module": network_module, \Desktop\Kohya\kohya_ss\kohya_gui\lora_gui.py", line 1171, in train_model "network_module": network_module, UnboundLocalError: local variable 'network_module' referenced before assignment

Vasiliy Bulanov

Constantly having this issue: "AssertionError: network for Text Encoder cannot be trained with caching Text Encoder outputs". If disabled caching: "AttributeError: 'T5EncoderModel' object has no attribute 'text_model'" can't beat this thing. any tips, guys?

Giuseppe Liguori

Hi, How many images did you use to train your face in these test? How many repetitions and epochs?

Furkan Gözükara

1 repeat and 200 epoch. save like every 20 epoch and compare. i have used 15 images adding training dataset screenshot to the post now so refresh

Jeroen Van Harten

After I run the Fix (torch 2.4.0 + cu124 update) I get Package wrong version and it reinstalls torch 2.1.2 and cu118 when i run the gui.bat. Any ideas how to bypass that?

Furkan Gözükara

yes you need to edit gui.bat file. remove update requirement part . the post has a screenshot of it look from top to bottom

Manpreet Singh

Is 1e_04_fp8_bf16_accelerate_full_bf16_32_rank.json still the best for 24GB GPUs? Was 16_rank noticeably worse?

Manpreet Singh

A written step-by-step guide somewhere would be awesome, even without a video. Is this image of yours all I would need? https://www.patreon.com/file?h=110293257&i=20234120

Furkan Gözükara

yes this image all you need with configs :) https://www.patreon.com/file?h=110293257&i=20234120 set as training folder as 1_ohwx man - so we do repeat 1

Manpreet Singh

For 24GB GPUs, what difference do you see in generation quality and likeness between rank 16 and rank 32?

Jeroen Van Harten

Would you mind helping me out? I always used to do 20 repeats, but when I do that now with 27 images and the suggested 200 epochs I get to (540 / 1 / 1 * 200 * 1) = 108000 steps. Way more than I'm used to. Is this correct?

s h a r k e y

Sure, easy answer change repeats to 1 (1_ohwx woman for instance) that basically means the training is running through in your case the 27 images once then doing epoch 2 and so fourth and when it gets to epoch 10 it will save out a Lora file and continue on till 20 then 30 etc etc. as a rule of thumb I would recommend keeping you dataset lower than 20 (for now) this means less time per epoch but also forces you to only train the very best of your training set.

Furkan Gözükara

you dont need to reduce rank anymore. just use Fix_For_FLUX_Step_2.bat and it works as low as 17 gb with 128 rank. use latest 24gb config. but i didnt see very much difference

Furkan Gözükara

ok now 27 images do this. 1 repeat and 200 epoch and save every 20 or 15 epoch and compare them because we do not use reg images. for reg images usage i am in special research

Steve

(got this error trying to train I used the lowest Vram option) NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

Steve

Tried to use other config "10_12_16GB_GPUs" got an error with the text encoder instead: File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\flux_train_network.py", line 411, in trainer.train(args) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\train_network.py", line 330, in train self.assert_extra_args(args, train_dataset_group) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\flux_train_network.py", line 43, in assert_extra_args assert ( AssertionError: network for Text Encoder cannot be trained with caching Text Encoder outputs / Text Encoderの出力をキャッシュしながらText Encoderのネットワークを学習することはできません Traceback (most recent call last): File "C:\Users\SysOp\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\SysOp\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command simple_launcher(args) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['W:\\Kohya_GUI_Flux_Installer_v3\\kohya_ss\\venv\\Scripts\\python.exe', 'W:/Kohya_GUI_Flux_Installer_v3/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', 'W:/NewLoraFactory/model/config_lora-20240823-043513.toml', '--network_args', 'train_blocks=single']' returned non-zero exit status 1.

CodePlug

Windows_Start_Kohya_SS.bat can't open the gui..after installing all dependencies..

Vasiliy Bulanov

Yep, using LorA tab. Now having this issue with both configs "6e_05_best_raw_sigmoid.json" and latest "48_GB_GPUs.json": accelerator.unwrap_model(flux).move_to_device_except_swap_blocks(accelerator.device) # reduce peak memory usage File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 973, in move_to_device_except_swap_blocks self.to(device) ... NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. Using clear install with RunPod Pytorch 2.2.0 template. with A40 GPU

Vasiliy Bulanov

Hope your video on this topic coming soon! :) It's little bit to hard for me tackle all this minor issues :)

Furkan Gözükara

probably you installed inaccurately. enter inside kohya ss and try to run gui.bat file and see if works. do you have python 3.10? i will make a video hopefully very soon. almost completed trainings

Furkan Gözükara

ye i will make video. a40 should work actually i tested on rtx 3090 and 4090 both works perfect on runpod. but i used pytorch template : RunPod Pytorch 2.1 runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04 you should use same template as me

Vasiliy Bulanov

Hm. Works well with 2.1 pytorch template. but...... speed is terribly low, like 434 hrs for 40 img on 200epochs: steps: 0%| | 23/320000 [01:52<434:28:02, 4.89s/it, avr_loss=0.354]

Furkan Gözükara

nope speed is normal. but you have more than 1 repeat. make repeating 1 and it will be done in 200 * 40 * 4.8 / 60 / 60 = 11 hours at most. to further speed up you can do 150 epoch and a lower resolution like 512 but it reduces quality

George Gostyshev

Nice research! Can I ask to make some shorter\compact version of it?

Franco Antonelli

i got the same config as in the example but geting this error once i start training 13:55:05-144169 INFO Start training Dreambooth... 13:55:05-146170 INFO Validating lr scheduler arguments... 13:55:05-149171 INFO Validating optimizer arguments... 13:55:05-151171 INFO Validating C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/loras existence and writability... SUCCESS 13:55:05-152172 INFO Validating C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/unet/flux1-dev.safetensors existence... SUCCESS 13:55:05-153172 INFO Validating C:/Users/billi/AI/flux training/training_imgs existence... SUCCESS 13:55:05-157173 INFO Error: 'ok_messi' does not contain an underscore, skipping... 13:55:05-159173 INFO Regulatization factor: 1 13:55:05-160173 INFO Total steps: 0 13:55:05-163174 INFO Train batch size: 1 13:55:05-164174 INFO Gradient accumulation steps: 1 13:55:05-165174 INFO Epoch: 200 13:55:05-166175 INFO max_train_steps (0 / 1 / 1 * 200 * 1) = 0 13:55:05-167175 INFO lr_warmup_steps = 0 13:55:05-170176 INFO Saving training config to C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/loras\Best_v1_5e_5_max_grad_norm_0_20240823-135505.json... 13:55:05-172176 INFO Executing command: C:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\Scripts\accelerate.EXE launch --dynamo_backend no --dynamo_mode default --mixed_precision bf16 --num_processes 1 --num_machines 1 --num_cpu_threads_per_process 2 C:/Kohya_GUI_Flux_Installer_v3/kohya_ss/sd-scripts/flux_train.py --config_file C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/loras/config_dreambooth-20240823-135505.toml 2024-08-23 13:55:15 INFO Loading settings from train_util.py:4189 C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/loras/config_dreambooth-20240823-135505.toml... INFO C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/loras/config_dreambooth-20240823-135505 train_util.py:42082024-08-23 13:55:15 INFO Using DreamBooth method. flux_train.py:101 WARNING ignore directory without repeats / 繰り返し回数のないディレクトリを無視します: ok_messi config_util.py:589 INFO prepare images. train_util.py:1803 INFO 0 train images with repeating. train_util.py:1844 INFO 0 reg images. train_util.py:1847 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1852 INFO [Dataset 0] config_util.py:570 batch_size: 1 resolution: (1024, 1024) enable_bucket: False network_multiplier: 1.0 INFO [Dataset 0] config_util.py:576 INFO loading image sizes. train_util.py:8760it [00:00, ?it/s] INFO prepare dataset train_util.py:884 ERROR No data found. Please verify the metadata file and train_data_dir option. / flux_train.py:155 画像がありません。メタデータおよびtrain_data_dirオプションを確認してください。 13:55:17-451793 INFO Training has ended.

Furkan Gözükara

you didnt set your training images folder path accurately max_train_steps (0 / 1 / 1 * 200 * 1) = 0 watch this tutorial carefully to understand how kohya works until i make a video. or i give private consultation too : https://youtu.be/sBFGitIvD2A

Arcon Septim

Massedcompute never available for training :( Runpod it seems :)

Andrew Tomkins

What workflow are you using to generate your results?

Furkan Gözükara

SwarmUI grid explained here : https://youtu.be/HKX8_F1Er_w 47:13 Full guide for extremely powerful grid image generation (like X/Y/Z plot)

Max

For me it's exactly the same. All my images are 3 Dimension (RGB) .jpg 512*512 files. And the script checks if the folder exists and contains data. However, if i click on training, i get the exact same error.

ElecMat

I followed the 22 August 2024 Update guide but you don't mention anything about the regularization images or the .txt files whit trigger words that we need for sd 1.5 or sdxl trainings. Do we need them for Flux trainings or just 1 folder whit all the images is enough for a good Flux Lora?

Arcon Septim

what about dataset preparation tab where do I put ohwx man or woman how many repeat no regularisation folder and destination training directory??

Max

I just found out that even with Flux you have to name the Folders as in dreambooth. But do you have any recommendations on that? Or even a guide on what all of that means? So for me, i used 25, because i assumed it's then 25 steps per epoch, right? But what about the name and the classification? For the first lora i wanted to train it on myself, so i sticked to what you've used in the tutorial. But what if i want to train on a dog? should i then name the folder "25_[nameofdog] dog"?

Furkan Gözükara

for flux 1 folder we dont use reg images. they dont improve results. make repeat 1. also ohwx man as prompt works great but i didnt test captioning with flux yet. hopefully will try

Furkan Gözükara

folder name is like 1_ohwx man dont do 25 make 1 and train up to 200 epoch. usually 150 is good. save checkpoints and compare them. dont use reg images they dont help leave that alone

Max

Do we really have to write "man" at the end? Or is it just the name/triggerword? What happens if i train it on 25? I will just let it run as i'm in my 10th epoch right now (only going to 50). And i also used captioning, i will write about the results as soon as i get some :)

Furkan Gözükara

sure let me know. man is class token helps model to understand what you train. if you train woman you write woman

Franco Antonelli

sorry ! i can t see where the error is. is set a folder "traininng_imgs" with a subfolder with all my images. thanks

Pidak

you have to go to dataset preparation set instance prompt and class prompt, then set training images folder with 1 repeats and set the destination training folder, then click prepare training data and after its done in the cmd then click copy info to respective folders, it worked for me

Adam Chido

The following parameters in the 10_12_16GB_GPU configuration file are incorrect/throw errors. Consumer cards do not behave like commercial hardware: "additional_parameters": "--network_args train_blocks=single" -results in error prior to training commenceing, there is a gui option for train_blocks=single "apply_t5_attn_mask": true, - Only applies to double blocks "fp8_base": true, - fp8 is not available on consumer hardware "highvram": true, - why is highvram active on lowvram cards? Finally, these configurations, even when the appropriate corrections have been made, result in Cuda out of memory errors on 3 separate systems using 12gb cards. I do not believe this configuration is ready for release.

shen oracle

the result lora file is more the 2GB,so big?

Furkan Gözükara

we save as float which doubles the size. you can save as fp16. also we use 128 rank. currently training 4, 8 , 16 , 32 , 64 to compare

Furkan Gözükara

with that config i am training on my home RTX 3060 you can see video here : https://www.reddit.com/r/FluxAI/comments/1ey6ie3/kohya_ss_gui_flux_lora_training_on_rtx_3060_lora/ i will hopefully make a full video very soon that should help you

Furkan Gözükara

let me test latest version gui it may have broken. gui option was not working previously that is why i had to provide that

shen oracle

有工具可以让我不用重新训练,而把lora文件变小吗?

Furkan Gözükara

hello again. please try latest Kohya_GUI_Flux_Installer_v9.zip . don't forget Fix_For_FLUX_Step_2.bat . it works on my rtx 3060 perfect but slow

Max

For me it's exactly the same, i even get an out of memory issue in forge, although i have 48GB of ram, but i assume this error is not related to lora size :)

Steve

still with 8/24 update getting the same error: Traceback (most recent call last): File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\flux_train_network.py", line 411, in trainer.train(args) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\train_network.py", line 342, in train model_version, text_encoder, vae, unet = self.load_target_model(args, weight_dtype, accelerator) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\flux_train_network.py", line 65, in load_target_model model = self.prepare_split_model(model, weight_dtype, accelerator) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\sd-scripts\flux_train_network.py", line 98, in prepare_split_model flux_upper.to(accelerator.device, dtype=target_dtype) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1174, in to return self._apply(convert) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 780, in _apply module._apply(fn) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 805, in _apply param_applied = fn(param) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1167, in convert raise NotImplementedError( NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device. Traceback (most recent call last): File "C:\Users\SysOp\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Users\SysOp\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command simple_launcher(args) File "W:\Kohya_GUI_Flux_Installer_v3\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['W:\\Kohya_GUI_Flux_Installer_v3\\kohya_ss\\venv\\Scripts\\python.exe', 'W:/Kohya_GUI_Flux_Installer_v3/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', 'W:/NewLoraFactory/model/config_lora-20240824-051834.toml']' returned non-zero exit status 1. 05:19:07-202926 INFO Training has ended.

Furkan Gözükara

this happens when you dont select flux1 checkbox. please make sure to check it. added screenshot to the top of page. also updated configs please download latest ones and which base model are you using? i will add 1 click downloader to download all necessary base models

Furkan Gözükara

updated configs and added Windows_Download_Training_Model_Files.bat file to download necessary training model files into the bat file run directory please try newest

Max

2024-08-24 13:12:04 INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668 /workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead. with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined] Traceback (most recent call last): File "/workspace/kohya_ss/sd-scripts/flux_train.py", line 905, in train(args) File "/workspace/kohya_ss/sd-scripts/flux_train.py", line 736, in train accelerator.backward(loss) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 2159, in backward loss.backward(**kwargs) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/_tensor.py", line 521, in backward torch.autograd.backward( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 289, in backward _engine_run_backward( File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/autograd/graph.py", line 768, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1116, in unpack_hook frame.recompute_fn(*args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1400, in recompute_fn fn(*args, **kwargs) File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 720, in _forward attn = attention(q, k, v, pe=pe, attn_mask=attn_mask) File "/workspace/kohya_ss/sd-scripts/library/flux_models.py", line 446, in attention x = torch.nn.functional.scaled_dot_product_attention(q, k, v, attn_mask=attn_mask) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 972.00 MiB. GPU 0 has a total capacity of 44.35 GiB of which 541.44 MiB is free. Process 1747024 has 43.81 GiB memory in use. Of the allocated memory 41.34 GiB is allocated by PyTorch, and 2.15 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) steps: 0%| | 0/2400 [00:06 sys.exit(main()) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command simple_launcher(args) File "/workspace/kohya_ss/venv/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['/workspace/kohya_ss/venv/bin/python', '/workspace/kohya_ss/sd-scripts/flux_train.py', '--config_file', '/workspace/kohya_ss/outputs/config_dreambooth-20240824-131040.toml']' returned non-zero exit status 1. 13:12:12-693701 INFO Training has ended. in the todays version of 48gb and even 24 GB i get this error when training on 1024*1024 images on an A40 (runpod) with 48GB of VRAM. It seems like there's an outofmemory issue because the memory handling doesn't seem to be working correctly. Do you have any ideas on how to solve this?

Furkan Gözükara

Did you select pytorch Cuda 11.8 template as Witten? I will test and let you know if broken or not. It is a40 gpu right?

Max

I know ;) Already found out that forge is (still after the update) trying to patch LoRas which results in huge amounts of Vram, so i switched to Comfyui and the first Lora i trained was "ok". Way to overfitted and quality is "ok" as i trained it on 512, not 1024 pixels

Max

One moment, i have to check. edit: Yes, it's the official runpod pytorch 2.1 with python 3.10 and Cuda 11.8.0, so exactly the one you mention in your guide :)

Adam Chido

Now i'm just frustrated. Fresh Kohya installs on the right repo, fresh python, your fix scripts run without error, but i can't get training under like 18gb even at 512 and 8/8. I am not in a hurry. Make sure you're sleeping, lol.

Arcon Septim

What are the best settings and sampler in SwarmUI to use with this flux model and lora for best generation results? Steps amount CFG etc Thank you!

Max

Have you found out something? I also tested a bit but still cannot find where it's hanging right now (as it worked yesterday on that machine with this setting - which is not the case anymore, tested an older version but i get the same error)

Steve

It's definitely checked. i used the flux.1 dev bnb 4-bit because I was using that for forge based on your benchmark. I figured once I got it working I would slowly work my way to higher models until I got OOM on my 12gb 4070. I haven't tried the update yet I'll be able to try it in a couple of hours.

Furkan Gözükara

hello just tested and it works. but A40 is slow :d https://pasteboard.co/QENsE96CfaBK.png https://pasteboard.co/jqaeX3as53C2.png probably it was broken pod

Furkan Gözükara

flux.1 dev bnb 4-bit could not be supported. by the way the vram usage doesnt related to base model you pick during training. so use fp16 dev model the best handled case

Max

Hmmm... What images are you using? 1024x1024, or 512x512? Because i tried it on 6 different pods, and no pod seem to be working :D

Max

I'm right now testing on an A100 GPU to really test it out

Max

Ok, so fortunately, it does work with the A100. So probably they (runpod) have some problems with the A40 Pods in Europe, because in terms of Memory, the A100 isn't nearly at it's limits :D But it's nice to see 1014x1024 only taking around 2.5-3 seconds per iteration :D

ElecMat

Does the Lora is suppose to work in Forge or only in ComfyUI ?

guangyu niu

Hi, Do I need to caption the dataset? and I notice that the "use fp8 base model" is truned on, does that mean I can use fp8 dev model? and Which model give better result?

Steve

you were correct using the full size model to train fixed the issue. took about 25hours to train 100 epoch on a 4070 i believe total training was around 5k steps. The issue I'm running into now is that using those loras with the flux.1 dev bnb 4-bit model takes forever to generate art. ~about one hour per 35 step image.

Arvin Flores

will you be doing a full flux tutorial with runpod kohya?

Furkan Gözükara

use fp16 dev model it will cast to fp8. captioning reduces likeliness i have tested and gonna post hopefully tomorrow. for style training captions may work better though

Steve

I think the biggest issue was I was testing multi epochs using S/R script and that was causing it to have to load the new model and then move it after all generations. I switched to the regular model and it works fine for the 4070 seems like the same speed as the quantified one I was using. testing the epochs in independent generations instead of using that search replace script for each epoch vastly improved speed. sacrificing the automatic grid to see it at a glance wasnt a big deal.

Max

Use Europe if possible, haven't had any problems there.

Max

I have tried both (training realistic photos from myself) and without captioning worked better for me

Franco Antonelli

can't get it to work with my RTX 3090; it runs out of memory. I followed the instructions for the update in step 2 and configured the acceleration settings as you suggested, but I'm still facing this issue. I'm not sure where I'm going wrong "You are using the default legacy behaviour of the . This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 INFO Using DreamBooth method. train_network.py:279 INFO prepare images. train_util.py:1803 INFO get image size from name of cache files train_util.py:1741 100%|████████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 3800.09it/s] INFO set image size from cache files: 0/19 train_util.py:1748 INFO found directory C:\koya\kohya_ss\outputs\prep\img\1_messi man contains train_util.py:1750 19 image files WARNING No caption file found for 19 images. Training will continue without train_util.py:1781 captions for these images. If class token exists, it will be used. / 19枚の画像にキャプションファイルが見つかりませんでした。これらの画像につ いてはキャプションなしで学習を続行します。class tokenが存在する場合はそれを使います。 WARNING C:\koya\kohya_ss\outputs\prep\img\1_messi man\2.PNG train_util.py:1788 WARNING C:\koya\kohya_ss\outputs\prep\img\1_messi man\Ca3ptura.PNG train_util.py:1788 WARNING C:\koya\kohya_ss\outputs\prep\img\1_messi man\Capt3ura.PNG train_util.py:1788 WARNING C:\koya\kohya_ss\outputs\prep\img\1_messi man\Capt5ura.PNG train_util.py:1788 WARNING C:\koya\kohya_ss\outputs\prep\img\1_messi man\Captura2.PNG train_util.py:1788 WARNING C:\koya\kohya_ss\outputs\prep\img\1_messi man\proxy-image (11).jpeg... train_util.py:1786 and 14 more INFO 19 train images with repeating. train_util.py:1844 INFO 0 reg images. train_util.py:1847 WARNING no regularization images / 正則化画像が見つかりませんでした train_util.py:1852 INFO [Dataset 0] config_util.py:570 batch_size: 1 resolution: (1024, 1024) enable_bucket: False network_multiplier: 1.0 [Subset 0 of Dataset 0] image_dir: "C:\koya\kohya_ss\outputs\prep\img\1_messi man" image_count: 19 num_repeats: 1 shuffle_caption: False keep_tokens: 0 keep_tokens_separator: caption_separator: , secondary_separator: None enable_wildcard: False caption_dropout_rate: 0.0 caption_dropout_every_n_epoches: 0 caption_tag_dropout_rate: 0.0 caption_prefix: None caption_suffix: None color_aug: False flip_aug: False face_crop_aug_range: None random_crop: False token_warmup_min: 1, token_warmup_step: 0, alpha_mask: False, is_reg: False class_tokens: messi man caption_extension: .txt INFO [Dataset 0] config_util.py:576 INFO loading image sizes. train_util.py:876 100%|███████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 18978.75it/s] INFO prepare dataset train_util.py:884 INFO preparing accelerator train_network.py:333 accelerator device: cuda 2024-08-26 13:37:08 INFO Building Flux model dev flux_utils.py:43 INFO Loading state dict from flux_utils.py:48 C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/unet/flux1-dev.s afetensors INFO Loaded Flux: flux_utils.py:51 INFO Building CLIP flux_utils.py:70 INFO Loading state dict from flux_utils.py:163 C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/clip/clip_l.saf etensors INFO Loaded CLIP: flux_utils.py:166 INFO Loading state dict from flux_utils.py:209 C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/clip/t5xxl_fp16 .safetensors INFO Loaded T5xxl: flux_utils.py:212 INFO Building AutoEncoder flux_utils.py:58 INFO Loading state dict from flux_utils.py:62 C:/Users/billi/AI/ComfyUI_windows_portable/ComfyUI/models/vae/ae.sft INFO Loaded AE: flux_utils.py:65 import network module: networks.lora_flux INFO [Dataset 0] train_util.py:2326 INFO caching latents with caching strategy. train_util.py:984 INFO checking cache validity... train_util.py:994 100%|███████████████████████████████████████████████████████████████████████████████| 19/19 [00:00<00:00, 18996.85it/s] INFO caching latents... train_util.py:1038 100%|██████████████████████████████████████████████████████████████████████████████████| 19/19 [00:03<00:00, 5.16it/s] 2024-08-26 13:37:12 INFO move vae and unet to cpu to save memory flux_train_network.py:156 INFO move text encoders to gpu flux_train_network.py:164 Traceback (most recent call last): File "C:\Python3_10_11\lib\runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "C:\Python3_10_11\lib\runpy.py", line 86, in _run_code exec(code, run_globals) File "C:\koya\kohya_ss\venv\Scripts\accelerate.EXE\__main__.py", line 7, in File "C:\koya\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 48, in main args.func(args) File "C:\koya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1106, in launch_command simple_launcher(args) File "C:\koya\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 704, in simple_launcher raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd) subprocess.CalledProcessError: Command '['C:\\koya\\kohya_ss\\venv\\Scripts\\python.exe', 'C:/koya/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', 'C:/koya/kohya_ss/outputs/prep\\model/config_lora-20240826-133657.toml']' returned non-zero exit status 3221225477. 13:37:51-568657 INFO Training has ended"

stephan stinker

Hello everyone, can I find a workflow for confyUI anywhere?

Arcon Septim

What is the difference in these two regarding quality? Rank_1_28700MB_Slow.json - 16bit - 8.53 second / it Rank_2_27360MB_Fast.json - 16bit - 4.49 second / it I see its double the time for full training....

Adam Chido

the new scripts worked right out of the box for me on 3 12gb machines. Thank you Furkan!

Mike

Hello, why is this in the prompt?

Frederic Collin

Sorry to say but this is a mess ... no real conclusion, a lot of "best" everywhere but impossible to find the configs easily to use. I've tried few of them which are simply not working with my 3090.