Kohya FLUX LoRA and Fine Tuning Training Full Tutorial For Local Windows and Cloud RunPod and Massed Compute for Research & Development (Patreon)
Downloads
Content
I will update this page as I progress. I am just starting so I will add the info. Nothing ready yet
27 August 2024 Update
Download latest : Kohya_GUI_Flux_Installer_v17.zip
Batch size experiments section added and config for 4x GPU added
Configs are inside Best_Configs folder inside above zip file
34 Distinct unique prompts for testing added to the zip file
Hopefully I am going to test time step shift more and Clip L Text Encoder training (Kohya just announced)
Newest Configs :
Rank 1 > Rank 2 > Rank 3 and so on - quality assessment
The quality difference between Rank 1 to Rank 6 are not very much but starting from Rank 7 the quality will degrade since we reduce training resolution and LoRA rank
So you can pick faster over slower if you are in hurry
These are early steps speeds (like first 50 and Kohya doesn't display latest step speed) so as you train more you will get like at least 25% faster speeds
These are raw VRAM usages so you need to look your VRAM usage before starting training
The tests are made on Linux on Massed Compute on RTX A6000 GPU - 1024x1024 resolution - 128 LoRA rank
Rank_1_28700MB_Slow.json - 16bit - 8.53 second / it
Rank_2_27360MB_Fast.json - 16bit - 4.49 second / it
Rank_3_18246MB_Slow.json - 8bit - 8.62 second / it
Rank_4_16960MB_Fast.json - 8bit - 4.61 second / it
Rank_5_11498MB_Slow.json - 8bit - Single Layers - 12.12 second / it
Rank_6_10222MB_Fast.json - 8bit - Single Layers - 9.42 second / it
Rank_7_9502MB.json - 8bit - Single Layers - 8.61 second / it - 896px
Rank_8_15406MB.json - 8bit - 3.71 second / it - 64 LoRA Rank - 896px
Rank_9_7514MB.json - 8bit - Single Layers - 5.8 second / it - 64 LoRA Rank - 512px
My suggestions for GPUs
8 GB GPUs : Rank_9_7514MB.json
10 GB GPUs : Rank_7_9502MB.json
12 GB GPUs : Rank_5_11498MB_Slow.json
16 GB GPUs : Rank_5_11498MB_Slow.json - if you need speed : Rank_8_15406MB.json
24 GB GPUs : Rank_3_18246MB_Slow.json
48 GB GPUs : Rank_1_28700MB_Slow.json
Batch Size Experiments And Multi GPU Usage
I have used Rank_2_27360MB_Fast.json config to test batch size impact on RTX A6000
The speed gain from batch size is almost none thus I don't suggest since you lose quality
Lesser batch size = better quality
Only Batch size 2 gives you some gain so you may use it if you wish
Batch size 1 : 4.54 second / it : effective speed same
Batch size 2 : 7.98 second / it : effective speed per step 3.99 second / it
Batch size 3 : 12.43 second / it : effective speed per step 4.14 second / it
Batch size 4 : 15.28 second / it : effective speed per step 3.82 second / it
Batch size 5 : 20.18 second / it : effective speed per step 4.03 second / it
Therefore I have added batch size 1 but 4 A6000 GPU Speed config : 4x_GPU_Batch_Size_1.json
With this config you get 5.75 second / it and effective speed per step is : 1.4375 second / it
When you use multiple GPU you need to divide epoch count to number of GPUs
So 200 epoch becomes 50 for 4x GPU
Also increase LR with this formula - best LR x (batch size * number of GPUs / 2) so in this case 0.00005 * (4/2) = 0.0001
So if you make batch size 2 it becomes = 0.00005 x (2x4/2) = 0.0002
The zip file now has 4x_GPU_Batch_Size_1.json and 4x_GPU_Batch_Size_2.json
I suggest you to use 4x_GPU_Batch_Size_1.json on a 4x RTX A6000 GPU machine
Inference Config for Evaluating Results of Experiments
I use FP16 FLUX dev model in 16 bit precision with UniPC sampler and 30 steps and SwarmUI grid system
You Don't Necessarily Need Class Prompt With FLUX
In the below grid images
trained with ohwx man : model name > Best_v2-000150 - first column
with only ohwx : model name > Best_v2_only_OHWX-000150 - second column
And then generated images with
prompt having ohwx man > More_Time_Shift_Test_and_Only_Ohwx_Token_Tested_and_Ohwx_Man_Used_in_Prompt.jpg
prompt only have ohwx > More_Time_Shift_Test_and_Only_Ohwx_Token_Tested_and_Only_Ohwx_Word.jpg
prompt only have man > More_Time_Shift_Test_and_Only_Ohwx_Token_Tested_and_Only_Man_Word.jpg
I predict that the T5 Text Encoder already encode image in a way that it fully tokenize it. So even if we don't define captions, our images are fully tokenized
This also results in overtraining in our case
Hopefully I will try lower LR and Clip-L training to solve this issue and I will try fine-tuning as well
So if you are going to train a 2 same class like 2 man in same training you can just caption them as ohwx and bbuk and such and try it
I still find ohwx man slightly better than ohwx
Also full captions reduces likeliness of a person training
Different LoRA Ranks Experiments
Full LoRA Rank impact : https://huggingface.co/MonsterMMORPG/Generative-AI/resolve/main/Full%20LoRA%20Rank%20Impact.jpg
Higher LoRA rank learns more details but overfits model more. I like 128 but it is personal. Look grid above for full impact
Higher Resolution Training Impact
I have done trainings at 1024x1024 (default), 1280x1280 and 1536x1536px
Then I have generated images in 1024x1024, 1280x1280, 1536x1536 and 1920x1080px for each training
Higher resolution training slightly improves quality but not dramatic probably doesn't worth the speed loss it causes
Full grid results are below
High Res Training Comparison 1024px.jpg , High Res Training Comparison 1280px Part 1.jpg , High Res Training Comparison 1536px Part 1.jpg , High Res Training Comparison 1536px Part 2.jpg
QKV Split Attention and JoyCaption Detailed Captioning Trainings Results
I have comprehensively tested QKV split attention and detailed captioned dataset trainings
I don't find split QKV brings benefit and brings huge speed loss
For person training detailed captions reduced the resemblance and didn't improve and flexibility or quality. But the resemblance loss was way lower than SDXL
So my conclusions are do not use split QKV and detailed captions for training a person
For style training probably detailed Joy captions will work great - i should test it too
Full grid results are below
QKV Split and JoyCaption Result 1024px.jpg , QKV Split and JoyCaption Result 1920px.jpg
Time Step Sampling Shift Experiments
This one hugely improved stylization capability of the model but reduced resemblance and the realism
I plan to hopefully do more research on this - 8 more trainings
Wait for them. Full grids below
Apply T5 Attention Mask and detailed Regularization / Classifications Images Impact Experiments
I find that Apply T5 Attention Mask improves quality but the tradeoff is slightly improved VRAM and some serious speed los
I also have compared my training workflow with default CivitAI FLUX LoRA training and ours much better
Regularization / classification images didn't improve results with none of the tested configs - I tested so many - check grids to see in full detai
Full grids are shown below
Reg_T5_Attention_Mask_CivitAI_50_Epoch.jpg, Reg_T5_Attention_Mask_CivitAI_100_Epoch.jpg, Reg_T5_Attention_Mask_CivitAI_150_Epoch.jpg, T5_Attention_Mask_v1.jpg, Civit_Reg_15GB_150_Epoch_Compare.jpg
More Different Experiments Are Historically shared below read the entire thread
-
24 August 2024 Update V2
Windows_Download_Training_Model_Files.bat file added to download necessary training model files into the bat file run directory
A 1-min video published to show how to set Accelerate for Kohya SS GUI : https://youtu.be/adVhm9aI9Gc
This set fixes caching stuck problem
24 August 2024 Update
Massed Compute installer fixed and super simplified just 1 step and auto opens browser
Massed compute instructions updated and auto models downloader added - downloads all necessary models automatically Dev FP16, T5 FP16, VAE and Clip
Fix_For_FLUX_Step_2.bat improved - don't forget to run this for lower VRAM usage and 10GB Config
10 GB config updated for latest GUI
Solution for KeyError: 'time_embed.0.weight' error
When you loaded the config make sure that this is selected
23 August 2024 Update
Massed Compute and RunPod Installers fully added (read Massed_Compute_Kohya_FLUX_Instructions and RunPod_Install_Instructions)
16_GB_Config_May_Be_Lower_Quality_In_Test added - still in test - changes are rank 32 and 896 pixel resolution training but super fast compared to lowest VRAM
Download newest Kohya_GUI_Flux_Installer_v13.zip and extract into any folder
48_GB_GPUs_v2 (27 GB VRAM) and 24_GB_GPUs_v2 (17 GB VRAM) yields almost same quality
10_12_16GB_GPUs_v2.json (yields almost same quality as best ones but 3-5 times slower due to VRAM optimizations) - uses 10183 MB peak so if uses shared VRAM and be too slow reduce LoRA Rank or resolution to like 896x896 - I am testing LoRA rank impact at the moment
16_GB_Config_Slightly_Lower_Than_10GB_But_5_Times_Faster.json yields lower quality than 10_12_16GB_GPUs_v2.json but 3-5 times faster than it
Newest test Grids are as below
Reg_T5_Attention_Mask_CivitAI_50_Epoch.jpg, Reg_T5_Attention_Mask_CivitAI_100_Epoch.jpg, Reg_T5_Attention_Mask_CivitAI_150_Epoch.jpg, T5_Attention_Mask_v1.jpg, Civit_Reg_15GB_150_Epoch_Compare.jpg
Tested all configs file updated to V5 : Tested_All_Configs_V6.zip
So what are the newest changes? I have enabled T5 Attention Mask which I believe slightly improves results
T5 Attention Mask Tradeoffs
However T5 Attention Mask increases VRAM usage by 1 GB for 16gb, 24gb and 48 gb configs, thus I didn't enable it for 16 GB config. Now 48 GB config uses 28 GB, 24 GB config uses 18 GB with attention masking
Moreover, T5 Attention Mask slows down the training significantly so that is another tradeoff
What does it do explained here : https://poe.com/s/EHIviAdVuZds5XBGDYWP
What is next? I am going to test quality impact of different LoRA network ranks next 4, 8, 16, 32, 64, 128
16 GB Config yielded slightly worse results than 10 GB config probably due to reduced resolution and LoRA network rank
CivitAI default settings training performed worse than our very best config with exactly same dataset and caption
Regularization / classification images tests failed. Still yields non-resemblance images. I don't see its benefit at this point. Tested with different Prior loss weight parameters but non yielded better results
Windows training is still significantly slower than Linux training. I am still researching a fix for this issue. Asked on Pytorch and Diffusers GitHubs
Full Kohya SS GUI setup screenshots added to the installer zip file
22 August 2024 Update
How to setup Kohya interface for FLUX full screenshot : Example Full Setup.jpg
FLUX LoRA training almost perfected
Use LoRA tab to load configs not DreamBooth
I also made 1 click Kohya SS GUI installer for FLUX
Run Windows_Install_Step_1.bat - select option 1 and then exit with option 7 once completed - no need other options
Run Fix_For_FLUX_Step_2.bat and then use Windows_Start_Kohya_SS.bat to start
Step 2 will improve training speed like 100% and VRAM usage like 25% with just proper library upgrade
Same upgrade added for Massed Compute read the
You are ready to use FLUX training with most performant way now
From newest configs I feel like Best_v1_5e_5_max_grad_norm_0 is the best
But Best_v1_5e_5 is also very good. There is like some lightning difference
When you extract the above zip file you will see configs as
48_GB_GPUs.json for 48 GB GPUs
24_GB_GPUs.json for 24 GB GPUs - uses around 17-18 gb with newest libraries
16GB_Compare_Speed_With_10_12_16_Config.json - If you have 16 GB try this and compare speed with 10_12_16GB_GPUs.json
10_12_16GB_GPUs.json - For 10, 12, 16 GB GPUs. If you have 10 GB and be very slow you can reduce LoRA Rank and also resolution to 512,512
512px training yields inferior quality and VRAM usage almost same - but 2.5x speed
New tested configs full quality grids : 50 epoch (750 steps) , 100 epoch (1500 steps) , 150 epoch (2250 steps)
For 24 GB Config RTX 3090 Speed should be close to 4 second / it - 1024x1024px
For 10_12_16 GB config speed drops like 3 to 5 times due to full optimizations
Training dataset 15 images, 1 repeat, train like 200 epochs
22 August 2024 Update
Section "FLUX Huge Sampler + Scheduler Test For a Very Hard Prompt" added to the post
Section "FLUX Guidance Scale Grid Test on LoRA" added to the post
Newests tests completed and new best configs uploaded as
Surprise that now wen have 10 GB config (LoRA rank 128) - 10 GB config will be published hopefully in 8-9 hours
If you have 10 GB GPU try to reduce LoRA rank until it works better also minimize your VRAM usage before starting the training
So the current best configs are like this
For 48 GB GPU : 6e_05_best_raw_sigmoid.json
For 24 GB GPU : 6e_05_best_raw_sigmoid_24GB.json,
For 10, 12, 16 GB GPUs : lowest_vram.json
For lowest VRAM config to work you have to activate VENV of Kohya GUI and execute below commands 1 by 1
This below thing will speed up training almost 100% and reduce VRAM usage greatly but results unknown yet (training atm). You can also do that if you have 24 GB GPUs like 4090
First update Kohya to latest, run 1 time, then activate venv and install below, then edit gui.bat file and comment or remove python.exe .\setup\validate_requirements.py and if %errorlevel% neq 0 exit /b %errorlevel% to prevent auto reinstall requirements
pip install torch==2.4.0+cu124 --index-url https://download.pytorch.org/whl/cu124
pip install torchvision==0.19.0+cu124 --index-url https://download.pytorch.org/whl/cu124
21 August 2024 Update
So far 16 different tests completed
For 48GB GPUs use 6e_05_bf16_128_rank_full_bf16.json - train up to 200 epochs and compare checkpoints
For 24GB GPUs use 6e_05_fp8_bf16_accelerate_full_bf16_32_rank.json - train up to 200 epochs and compare
I will try to get better results and reduce number of necessary epochs
I will like reg images would help but I haven't tested yet
20 August 2024 Update
So far tested configs added to the attachment and more results added to the post
More configs are on training
Windows Requirements
Python 3.10, FFmpeg, Cuda 11.8, C++ tools and Git
If it doesn't work make sure to below tutorial and install everything exactly as shown in this below tutorial
How To Install
I am going to use GUI version of the Kohya
Currently It doesn't have FLUX at the main branch
So clone it into a new folder like c:/flux_train
Before installing open a cmd
Type : git checkout sd3-flux.1
Then install as usual
FLUX Training Discussions With Lots of Info
Models Links Downloads
FLUX dev FP16 (23.8 GB) : https://huggingface.co/OwlMaster/realgg/resolve/main/flux1-dev.safetensors
Download Clip L (250 MB) : https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/clip_l.safetensors
T5 XXL FP16 (9.8 GB) : https://huggingface.co/comfyanonymous/flux_text_encoders/resolve/main/t5xxl_fp16.safetensors
FLUX VAE (335 MB) : https://huggingface.co/OwlMaster/realgg/resolve/main/ae.safetensors
FLUX Fine Tuning Lower VRAM Optimizations
--optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False"
--timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0
--blockwise_fused_optimizers --double_blocks_to_swap 6 --cpu_offload_checkpointing
Options are almost the same as LoRA training. The difference is `--blockwise_fused_optimizers`, `--double_blocks_to_swap` and `--cpu_offload_checkpointing`. `--single_blocks_to_swap` is also available.
`--blockwise_fused_optimizers` enables the fusing of the optimizer for each block. This is similar to `--fused_backward_pass`. Any optimizer can be used, but Adafactor is recommended for memory efficiency. `--fused_optimizer_groups` is deprecated due to the addition of this option for FLUX.1 training.
`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--blockwise_fused_optimizers`.
`--cpu_offload_checkpointing` is to offload the gradient checkpointing to CPU. This reduces about 2GB of VRAM usage.
--network_args train_blocks=single - reduces vram
All these options are experimental and may change in the future.
The increasing the number of blocks to swap may reduce the memory usage, but the training speed will be slower. `--cpu_offload_checkpointing` also slows down the training.
Swap 6 double blocks and use cpu offload checkpointing may be a good starting point. Please try different settings according to VRAM usage and training speed.
Previous Master Kohya Tutorials
How To Do SDXL LoRA Training On RunPod With Kohya SS GUI Trainer & Use LoRAs With Automatic1111 UI
Become A Master Of SDXL Training With Kohya SS LoRAs - Combine Power Of Automatic1111 & SDXL LoRAs
Massed Compute
Massed compute A100 discount : A100_2ybG1e9StAdA
Auxiliary Scripts and Tutorials
Learn how to use SwarmUI : https://youtu.be/HKX8_F1Er_w
Learn how to use SwarmUI on Cloud (RunPod, Massed Compute, Kaggle) : https://youtu.be/XFUZof6Skkw
Learn how to use FLUX with SwarmUI : https://youtu.be/bupRePUOA18
Learn how to download models from Hugging Face and CivitAI and how to upload Hugging Face for backup : https://youtu.be/X5WVZ0NMaTg
FLUX Guidance Scale Grid Test on LoRA
As you know FLUX has CFG set as 1
FLUX dev model has FLUX guidance scale
You can download full grid test from this link and check out
I feel like FLUX Guidance 4.0 is best but you can push more to get more accurate prompt following
FLUX Huge Sampler + Scheduler Test For a Very Hard Prompt
Prompt is : A cinematic shot featuring a photo-realistic image of an Ohwx man riding an enormous and terrifying Tyrannosaurus rex in the heart of Jurassic Park. The scene is set against a backdrop of dense, prehistoric jungle with towering trees and lush greenery, creating an intense and thrilling atmosphere. The T-rex is depicted in all its majestic and fearsome glory, with its massive jaws open wide, sharp teeth glistening, and powerful muscles rippling as it charges forward. The Ohwx man, in contrast, is shown with a determined and fearless expression as he expertly rides the gigantic predator, his posture exuding confidence and control. The image is captured in a highly cinematic style, with dramatic lighting and dynamic angles that emphasize the scale and intensity of the scene. The focus is partially directed towards the face of the Ohwx man, ensuring that his features are clearly visible and convey a sense of resolve and bravery, while maintaining the overall composition of the dynamic action scene.<segment:face,0.7>photo of ohwx man
So far my findings are as below as promising sampler + scheduler
Full grid link : click here
euler + normal - default
euler + karras - cinematic look
heun + karras , heunpp2 + karras
dpm_2 + karras or exponential
dpm_fast + normal
lcm + normal or karras : cartoon look
unipc + normal or karras
I did bigger tests and my conclusion is that sadly Karras scheduler which makes it more realistic is only useable with DMP_2 and even that is not great
Also I think best sampler is UniPC
Below all test results
LCM_Normal_vs_Karras.jpg , Eular_A_Normal_vs_UniPC_Normal.jpg , Eular_Normal_vs_Karras.jpg , UniPC_Normal_vs_Karras.jpg , DPM_2_Normal_vs_Karras.jpg
Experiments and VRAM Usages
A6000
Accelerate launch Mixed precision BF16 - highvram - 128 rank : 44.18 GB - 8.25 second / it
Accelerate launch Mixed precision BF16 - full BF16 - highvram 128 rank : 42.58 GB - 8.25 second / it
Accelerate launch Mixed precision BF16 - lowvram - 128 rank : 44.18 GB - 8.25 second / it : 44.18 GB - 8.25 second / it
Accelerate launch Mixed precision BF16 - full BF16 - lowvram 128 rank : 42.58 GB - 8.25 second / it
Accelerate launch Mixed precision FP16 - highvram - 128 rank : 44.14 GB - 7.83 second / it
Accelerate launch Mixed precision FP16 - full FP16 - highvram - 128 rank : 42.55 GB - 7.83 second / it
Accelerate launch Mixed precision BF16 - highvram - 128 rank - Memory efficient attention : 44.18 GB - 8.25 second / it
Accelerate launch Mixed precision BF16 - highvram - 128 rank - Split Mode : error
Accelerate launch Mixed precision BF16 - highvram - 128 rank - Split Mode train blocks double : error
Accelerate launch Mixed precision BF16 - highvram - 128 rank - Split Mode train blocks single : error
Accelerate launch Mixed precision FP16 - fp8 base - highvram - 128 rank : 33.53 GB - 7.99 second / it
Accelerate launch Mixed precision FP8 - fp8 base - highvram - 128 rank : error
Accelerate launch Mixed precision FP8 - fp8 base - full FP16 - highvram - 128 rank : error
Accelerate launch Mixed precision FP16 - fp8 base - highvram - full FP16 - 128 rank : 24.39 GB - 7.96 second / it
Accelerate launch Mixed precision BF16 - fp8 base - highvram - 128 rank : 33.55 GB - 8.38 second / it
Accelerate launch Mixed precision BF16 - fp8 base - full BF16 - highvram - 128 rank : 24.4 GB - 8.35 second / it
Accelerate launch Mixed precision FP16 - fp8 base - highvram - full FP16 --network_args train_blocks=single - 128 rank : 18.39 GB - 18 second / it
Accelerate launch Mixed precision BF16 - fp8 base - highvram - full BF16 --network_args train_blocks=single - 128 rank : 18.39 GB - 21 second / it
Running Experiments
1e_04_bf16_128_rank : completed - 48 gb GPU
1e_04_fp16_128_rank : completed - 48 gb GPU
1e_04_bf16_128_rank_full_bf16 : completed - 48 gb GPU
1e_04_fp16_128_rank_full_fp16 : completed - 48 gb GPU
1e_04_fp8_32_rank : completed - 23.58 GB
1e_04_fp8_full_fp16_32_rank : completed - 23.28 GB
1e_04_fp8_bf16_accelerate_32_rank : completed - 23.57 GB
1e_04_fp8_bf16_accelerate_full_bf16_32_rank : completed - 23.28 GB
Early Testing Results
Looks like FP16 accelerator training yields way inferior results compared to BF16 accelerator training
This results is very early experiments only 50 epoch
More results obtained
So far best working cases are 1e_04_bf16_128_rank_full_bf16 and 1e_04_fp8_bf16_accelerate_full_bf16_32_rank
FP16 training broken
Starting 8 more test right now
These below images has no face inpainting raw output
These are early results of my extensive FLUX LoRA training but 8 bit is way inferior to 16 bit atm
8 bit looks undertrained perhaps and 16 bit is overtrained with current LR and 150 epoch
Gonna do more testing now
No face inpainting raw 1024x1024 images
Some more full comparisons attached to attachments
New Started Trainings
48 GB Configs - uses 42.6 GB VRAM
9e_05_bf16_128_rank_full_bf16 - completed
8e_05_bf16_128_rank_full_bf16 - completed
7e_05_bf16_128_rank_full_bf16 - completed best at 150 epoch
6e_05_bf16_128_rank_full_bf16 - completed very best at 200 epoch
24 GB Configs - uses 23.30 GB VRAM
2e_04_fp8_bf16_accelerate_full_bf16_32_rank
3e_04_fp8_bf16_accelerate_full_bf16_32_rank
9e_05_fp8_bf16_accelerate_full_bf16_32_rank
8e_05_fp8_bf16_accelerate_full_bf16_32_rank
New Training Results for 48 GB Configs
So far I have tested 48 GB configs results and very best is 6e_05_bf16_128_rank_full_bf16 however it requires 200 epoch training for 15 images
At 150 epochs 7e_05_bf16_128_rank_full_bf16 yields best results but it is not as good as 200 epoch of 6e_05_bf16_128_rank_full_bf16
New grids uploaded to attachments , 50_epoch_LowerLR_48_tests.jpg , 100_epoch_LowerLR_48_tests.jpg , 150_epoch_LowerLR_48_tests.jpg , 200_epoch_LowerLR_48_tests.jpg, Face_Inpainted_48_GB_Best_Comparison.jpg
Tested configs updated to V2 - look attachments for the zip file
Here some examples
24 GB Experiments
I feel like same LR of 48 GB works
So I decided to go with single version tests from now on
At the very top of the model I added best configs
Also configs zip file updated
24 GB experiments files : 50_epoch_LowerLR_24_GB_config_tests.jpg , 100_epoch_LowerLR_24_GB_config_tests.jpg , 150_epoch_LowerLR_24_GB_config_tests.jpg , 200_epoch_LowerLR_24_GB_config_tests.jpg
Started 7 New Trainings
6e_05_best_with_reg_images - completed results below
6e_05_best_max_grad_norm_0 - completed results below
6e_05_best_raw_sigmoid - completed results below
6e_05_best_raw_sigma - completed results below
6e_05_best_additive_sigma - completed results below
6e_05_best_sigma_scaled_sigma - completed results below
6e_05_best_additive_uniform - completed results below
Prediction_Type_Timestep_Sampling_200_Epoch.jpg , Prediction_Type_Timestep_Sampling_150_Epoch.jpg