Home Artists Posts Import Register

Downloads

Content

Patreon exclusive posts index

Join discord to get help, chat, discuss and also tell me your discord username to get your special rank : SECourses Discord

Please also Star, Watch and Fork our Stable Diffusion & Generative AI  GitHub repository

8 July 2024 Update:

  • CogVLM updated to V8 CogVLM_v8.zip and bu fixed due to transformers library upgrade and break

  • A bug that prevents batch processing all input images fixed

26 June 2024 Update

  • Latest zip file : LLaVA_auto_install_v4.zip

  • LLaVA installers updated

  • Massed Compute installers added with instructions

  • Tested both on RunPod and Massed Compute working amazing

  • Make sure to wait to see loaded model on the Gradio and refresh Gradio until you see to start using it

  • When you first time use LLaVA it will download model. So check CMD window to see download status

17 April 2024 Update:

  • Kosmos scripts updated to : Kosmos-2_v6.zip

  • Below bug fixed please let me know if you still get that error

  • OSError: cannot write mode RGBA as JPEG

  • Just copy paste web_app .py to your previous installation

20 February 2024 Update:

  • 4-bit, 8-bit, 16-bit and 32-bit loading options added to the Kosmos-2

  • I think Kosmos-2 is the very best caption model if you have a lower VRAM GPU

  • 4-bit of Kosmos 2 only uses 2 GB VRAM and 32-bit uses only 7.5 GB VRAM

  • Single image captioning speed will be also displayed now

  • Skip existing captions option added. This will skip image files which have existing captions

  • Once the batch captioning completed, it will not display error but batch caption completed now

  • Hopefully I will add this skip option and status message to all captioners 

15 February 2024 Update:

  • Another SOTA model Microsoft's Kosmos-2 added to the scripts arsenal

  • Kosmos-2 : https://github.com/microsoft/unilm/tree/master/kosmos-2

  • I have modified it and added batch processing too

  • Download Kosmos-2_v3 .zip and extract into any folder you want to install

  • All our scripts generate a new separate venv so they will never conflict with other apps 

  • Double click install_windows .bat and install 

  • It will install everything fully automatically

  • Then use run_kosmos .bat file to start the app

  • For Runpod and linux follow runpod_instructions_READ txt

  • Screenshots shared here : https://www.patreon.com/posts/98499462

12 February 2024 Update Massive Update:

  • captioners_clip_interrogator_v2. zip is older file so i don't suggest you to use it anymore

  • For using Blip2 captioning, a new amazing Gradio APP developed and now supports all 4-bit, 8-bit and 16-bit model loading

  • Download blip2_captioning_v1.zip and install with .bat files for Windows and .sh files for RunPod and Linux

  • A very detailed comparison of Blip2 captioning models published in this article with their speed and VRAM usage : https://www.patreon.com/posts/98331590

8 February 2024 Update Update:

  • Another Sota Vision model added to our SOTA scripts arsenal 

  • Download Qwen-VL_v3.zip and use windows install bat file or follow runpod instructions

  • Supports half precision and 4-bit loading

6 February 2024 Update Update:

  • LLaVA script was broken due to incorrect bitsandbytes and this is fixed

  • Both LLaVA and CogVLM updated to the latest Pytorch, bitsandbytes, deepspeed and triton packages with xFormers

  • LLaVA web UI input field now will have our default caption prompt

  • If you find a better prompt we can replace it 

5 February 2024 Update Massive Update:

  • CogVLM batch processing added with 4-bit, 8-bit, 16-bit and 32-bit

  • Supports Gradio share

  • RunPod installers and instructions are also added

  • The run process improved and made easier

  • Ignore Please 'pip install apex' message

3 February 2024 Update Massive Update:

  • LLaVA captioner updated to the very latest version

  • Now LLaVA captioning installation and usage is much more simplifed

  • Now it supports 7b, 13b, and newest LLaVA v1.6 34b models with 4bit, 8bit and 16bit loading

  • 13b at 16-bit and 34b at 4-bit works on RTX 3090 - 24 GB GPU

  • LLaVA 34b requires 65 GB disk space, 13b requires 25 GB and 7b requires 13 GB disk space

  • Supports batch LLaVA captioning with any model as well

  • Double click install .bat file to install

  • Then run run_pt1, run_pt2 and run_pt3 files with order

  • run_pt3 file will ask you which model to load. After that refresh the opened Gradio app and start using

  • RunPod instructions also updated read the runpod_instructions_READ.txt which will be much more simplified and easy to use now

  • A tutorial video for LLaVA is in production right now

  • Follow progress from the command line interface when doing batch processing

  • When doing batch processing it will use the prompt you given in prompt input textbox

  • Example caption prompt for Stable Diffusion training

  • just caption the image with details, colors, items, objects, emotions, art style, drawing style and objects but do not add any description or comment. do not miss any item in the given image

  • If you want to change Hugging Face default model download folder set it as below

  • Start a new CMD as administrator then execute

  • setx HF_HOME "G:\HF_Models"

13 January 2024 Update:

  • CogVLM_v2 added to the attachments

  • Currently running on Windows only

  • Pre requirements are Python 3.10.x, C++ tools : https://youtu.be/-NjNy7afOQ0

  • Hopefully the CogVLM Gradio APP will be improved and then later RunPod installer will be shared

  • CogVLM is the strongest visual GPT right now : https://github.com/THUDM/CogVLM

26 November 2023 Update:

  • Coca_ViT-L captioning, Blip2 captioning, and 115 clip models and 5 caption models supporting Clip_Interrogator Gradio Web UI is now attached as single file

  • Download newest captioners_clip_interrogator_v2.zip

  • Follow progress from the command line interface when doing batch processing

  • If you want to change Hugging Face default model download folder set it as below

  • Start a new CMD as administrator then execute

  • setx HF_HOME "G:\HF_Models"

25 November 2023 Update:

20 October 2023 Update:

  • Fixed caption writing encoding error

  • Please redownload all_files. zip file

13 October 2023 Huge Update:

New tutorial video > https://youtu.be/PNA9p94JmtY

Please also upvote this Reddit thread I would appreciate very much

  • Added new CLIP Interrogator which supports 90 CLIP vision models and 5 caption models

  • The gradio is improved. Supports batch processing and generation of image captions automatically

  • Also it clears VRAM whenever you change CLIP model or Caption model selection

  • Recording a new tutorial video right now

  • Check below to see its power

Old Video tutorial > https://youtu.be/V8iDW8iprqU

If you also upvote this Reddit thread I would appreciate very much

Requirements:

  • Make sure that you have git and Python 3.10.x installed. I used Python 3.10.11

  • Here a tutorial video : https://youtu.be/B5U7LJOvH6g

  • If you encounter any network related problems during install or model download use https://1.1.1.1/ WARP VPN of Cloudflare which I am using and totally free.

How To Install And Use

Use the .bat installer files for Windows and .sh installer files for RunPod. Each zip file has instructions for how to use on RunPod. Windows usage is so easy. Just run the .bat files.

How To Use Caption Scripts On RunPod

RunPod Tutorial Starts At Min 14 : https://www.youtube.com/watch?v=PNA9p94JmtY

RunPod referral link : https://bit.ly/RunPodIO

Select RunPod Fast Stable Diffusion template

Edit pod and expose HTTP ports and add 7861

  • If you wish to delete auto downloaded models run below code first (optional)

  • rm -r auto-models

  • Upload runpod_install.sh into workspace folder

  • Open a new terminal and execute below codes to install

  • export HF_HOME="/workspace"

  • chmod +x runpod_install.sh

  • ./runpod_install.sh

After Install How To Run On RunPod

How to use RunPod and RunPodCTL tutorial >

https://youtu.be/QN1vdGhjcRc


  • Open a new terminal

  • Execute below code

  • export HF_HOME="/workspace"

  • source venv/bin/activate

  • This above code will activate installed venv and now you can use scripts

  • To start Clip Interrogator Graido Web UI execute below code

  • python Clip_Interrogator.py --share

  • Use --share on RunPod. Now mandatory anymore you can also use RunPod proxy connect

  • When you first time run it will download model and you may get error message

  • After download of model complete refresh and try again

  • You can watch the terminal you started for download process

  • To run other captioners edit their folder path with runpod version

  • E.g. edit half_precision_17_GB_VRAM.py and change path like below

  • /workspace/test1

Download the files from below attachments. captioners_clip_interrogator_v2 . zip contains all files as a zip.

Comments

diffusers

possible to run this on runpod?

So Sha

Do you believe these models are better than Blip or WD14 ?

Furkan Gözükara

yes they are definitely better. moreover i am working on adding ViT-bigG-14/laion2b_s39b_b160k too to the scripts. it is much different level. both can be compared used and tested

John Dopamine

Hi! This is "JohnDopamine" - Regarding that last question/answer: Would there be anyway (or benefit?) of trying to incorporate "llava" as a method to help caption? Hacksman had mentioned this online demo site: https://llava.hliu.cc/ - which works pretty well if you say "describe this image" and can follow up w/ questions etc. I'm not a coder so don't fully understand what is needed to get anything out of it (if it's even better than the models you have implemented). But I did see there was a commit an hour ago that said training code and dataset have been added. Maybe good news? Thanks for this either way ! code: https://github.com/haotian-liu/LLaVA

Furkan Gözükara

i am working on to it as well : https://github.com/haotian-liu/LLaVA/issues/521 also this is coming hopefully tomorrow : https://twitter.com/GozukaraFurkan/status/1711933282529452115

Anonymous

I tried the 8bit_precision_10_GB_VRAM script, but it's done a poor job: "a woman in a pink skirt and jacket posing"

Furkan Gözükara

hopefully this one coming today much stronger : https://twitter.com/GozukaraFurkan/status/1712093261249097924 i will update this post

Anonymous

Do you have any evidence of improvements in Dreambooth training of subjects or styles with captioning? Any comparisons?

Anonymous

please could you create instructions for Runpod, thanks!

mypatreonemailacc

Is it possible to run blip2 captioning on my laptop (CPU) ? I can run WD14 from the kohya repository on the cpu, and it's very fast.

Furkan Gözükara

it uses a lot of VRAM. i think it may on ram. how much RAM you have? by the way I added another captioning gradio which is superb. it also has lightweight models

San Milano

I tested blip2-flan-t5-xxl and it is amazing! It did a great job with images that were very complex

mypatreonemailacc

Do you have the link to the gradio with lightweight models? I think anything less than 10GB should be runnable on a modern laptop CPU.

mypatreonemailacc

BLIP isn't that great, WD14 is already a massive improvement over BLIP. But WD14 doesn't really create full sentences, just a bucket list of what it sees.

mypatreonemailacc

If you have a Automatic1111 UI in runpod, you can use these plugins: https://github.com/Tps-F/sd-webui-blip2.git

mypatreonemailacc

I use WD14 for captioning my datasets for Lora training. If you observe that BLIP2 is better, then let us know!

mypatreonemailacc

Llava is amazing when it comes to describing a picture: https://llava.hliu.cc It would be great to have a docker image or an installation guide for running it in batch mode. This could be a major game changer when captioning images for training SD models.

Anonymous

Works great with ViT-bigG-14/laion2b_s39b_b160k on 3090 Runpod.

Anonymous

Would it be possible to use llava on a runpod, maybe in batch processing?

Furkan Gözükara

i am waiting them to add windows support. after that hopefully will bring it to the runpod and windows both

mypatreonemailacc

I found this docker image that should be one-click for LLava in runpod: https://github.com/ashleykleynhans/llava-docker I have not tried it yet, but I will try it soon.

Anonymous

Is there a youtube video or something how to use this kind of caption for Lora?

Anonymous

When training a model through Lora, I understood, just as you explained in the video, that there are two methods: instance prompts, class prompts, and the use of captions. Is my understanding correct? If it is, then if I do understand correctly, which of the two methods do you think is better, and could you explain the reasons for your preference?

Furkan Gözükara

i prefer currently rare token + class token and not using captions. so like ohwx man or ohwx car or ohwx woman etc. rare token is ohwx and class is the thing you are training.

Đạt Nguyễn

Is there any suitable for analyzing images with facial expressions? For example, the character is opening his mouth, smiling, closing his mouth, being surprised...

Furkan Gözükara

i would try blip 2 + vit bigG 14 but that may not be sufficient. i plan to add llava to the collection as well hopefully soon

So Sha

How we can install it on ubuntu ?

Furkan Gözükara

so easy. bat files are just pip commands. make your venv and execute them 1 by 1. also you can use runpod_install.sh file which is designed for linux

Nenad Kuzmanovic

It would be soo helpfull if u make tutorial of how to change cache folder... Im struggling with free space on my system drive... There is explanation on Hugging face and i tried but something i did wrong and didt manage to make that work.

mypatreonemailacc

Which captioning do you recommend for running on the CPU on a laptop? I'm using WD14 right now, which is really fast. Is there anything similar?

So Sha

1. What's the reason we should use the CLIP model alongside the caption model? 2. Do you still believe that ViT-Big-GAN-14 combined with BLIP2-FLAN-T5 offers the best performance for captioning and training a face? 3. I used this combination, but I wasn't satisfied with the results. It added some unrelated and meaningless keywords.

Furkan Gözükara

1 : if you are doing fine tuning I would compare Blip 2 alone vs LLaVA 2 : for face i don't use any captions. just rare token + class. like ohwx man 3 : try answer 1

Anonymous

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. llava 1.1.3 requires bitsandbytes==0.41.0, but you have bitsandbytes 0.41.1 which is incompatible.

Anonymous

LLaVA Can it be batch processed?

Furkan Gözükara

i am not sure. i think could be but it already uses decent amount of VRAM. if you are meaning batch processing a folder my web app already have that. there is batch process folder input textbox. use it. hopefully will make a tutorial soon

Anonymous

Hello. When trying to run the URL (http://127.0.0.1:40000) for LLaVA, I get this message from Chrome and Edge. I launched Part 1 then Part 2 then the Model (13b model - 8bit - 15470 MB VRAM). I have a 3090TI. { "detail": "Not Found" } Thanks...

Furkan Gözükara

can you checkout this video? i have shown there quickly : https://youtu.be/ZiUXf_idIR4 13:41 How to use LLaVA for captioning and obtaining prompt ideas and generating more amazing images

Furkan Gözükara

what have you installed so far? i have shown llava in this video : https://youtu.be/ZiUXf_idIR4 13:41 How to use LLaVA for captioning and obtaining prompt ideas and generating more amazing images hopefully i will make a full tutorial soon

Anonymous

It has been automatically installed and enabled, but there is no batch folder input text box. Is it because the automatically installed file is not the latest version? Or you haven't updated it yet, and you have the same problem after reinstalling it several times.

Anonymous

Encountered the same problem, but using http://127.0.0.1:7860/ can indeed be used, but there is no place for batch operations and entering folder addresses.

Anonymous

There should be a path batch processing box at the bottom of the page. I can confirm that batch processing works.

mypatreonemailacc

Do you have a Linux batch processing script for LLava?

Furkan Gözükara

hello you were right. i found the error. now added download.py. it will download the gradio app instead of using .bat which was failing in some cases. you can directly run it with python download.py . redownload installer zip file

Anonymous

Hi. Two questions: 1. I'm trying to use ViT-bigG-14/laion2b_s39b_b160k and blip2-2.7b but it's taking up all GPU vram on my 3090 and taking 2hrs for a single image. Is this right? 2. What is the best CLIP model for sd 1.5?

Furkan Gözükara

it shouldn't be that slow. i also have 3090 working really good. well i suggest you to use LLaVA or Blip2. i think those 2 are best right now for caption

Anonymous

Im trying to use blip captioning to read an image to then use the prompt later in a custom workflow on ComfyUI, as far as iv seen using the WAS node suite I can BLIP Model Loader and BLIP Analyse Image, so apparently I just need add a BLIP model to the correct folder. Im curious if you have any insight into this or tips

Anonymous

Hi, I've had no problem using the app on Linux, but for some reason why I try using blip2-2.7b I get this: Loading caption model blip2-2.7b... Loading checkpoint shards: 0%| | 0/2 [00:00

Meito

how do we setup llava, i'm having trouble running it. which bat files do we run?

Meito

OSError: cannot write mode RGBA as JPEG when doing batch

Furkan Gözükara

how to setup is shown here : https://youtu.be/ZiUXf_idIR4 13:41 How to use LLaVA for captioning and obtaining prompt ideas and generating more amazing images

Meito

it seems png is affected, i converted them to jpg and it works now

Meito

Yep i just used fast stone to batch convert them to jpg

Anonymous

Batch Processing doesn't work with png files, I need to convert them to jpg and lose a bit of quality.

Anonymous

Hi! Great work and fantastic tutorial. I'm running into an issue however with the LLaVA chatbot. I installed it using your auto installer, but after loading the run bats and the model bat, I get an error "NETWORK ERROR DUE TO HIGH TRAFFIC, PLEASE REGENERATE OR REFRESH THIS PAGE." It seems to work for a second, and then fails. I try refreshing it a number of times and restarting, but no luck. Do you know how to fix this? Thanks again.

Furkan Gözükara

this happens when you skip 1 step. have you seen this video? i have shown there https://youtu.be/ZiUXf_idIR4 13:41 How to use LLaVA for captioning and obtaining prompt ideas and generating more amazing images

Thomas

Hello! I can't use this tool because bitsandbytes throw an error : "CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching in backup paths...". I do have the cuda toolkit installed.

Furkan Gözükara

did it still work? i think even though these warnings still should work . i also get a lot of warnings. these happens usually since we use Windows and libraries are optimized only for linux

Thomas

It worked only after I downgraded bitsandbyte to 0.41.1

Thomas

I only used the files inside your captioners_clip_interrogator.zip, it download the latest bitsandbyte which is currently 0.41.2.post2

Anonymous

CogVLM_v2 Can it support batch marking?

Thomas

I tried to run "13b model - 8bit - 15470 MB VRAM.bat". It said "don't forget to run all run_pt1.bat run_pt2.bat run_pt3.bat" but there's no "run_pt3.bat". Is it a typo?

Furkan Gözükara

Hello. updated file names to be more clear you can redownload. the part is the model starting files such as 13b model - 8bit - 15470 MB VRAM.bat

Anonymous

Hello! Why do I always get an error when starting the model in the third step when using LLaVA subtitles, such as part 3 - 7b model - 8bit - 8600 MB VRAM.bat, but my graphics card is 3090 and the video memory is enough, my Win11 system directly reports an error Restart, can LLaVA subtitles fail to win the system?

Furkan Gözükara

Hello. as I replied your private message it is system error. error of windows. Need more debugging to figure out the reason

Hassan Alhassan

any way to speed up the LLava captioning? it takes a very long time to caption

Anonymous

After I restarted the machine and turned CogVLM back on, it says half an hour loading time! Is this normal? 07:48 - 25:24, 254.01s/it Loading the model is very slow on cold start. If it starts once, then it is fine, then I can turn it off and on, it starts fast (I guess from cache).

Anonymous

3TB Toshiba, 7200 RPM with 64 MB cache, 6 Gbit/s, DT01ACA300 Python cache on this HDD (too large for System SSD), and symlinked. And I have this message when start CogVLM: Please 'pip install apex' Apex is installed in venv...

Furkan Gözükara

ye i also get apex message ignore it. what reading speed do you see on task manager when reading from the disk?

Anonymous

On average, it reads at 10-12 MB/s speed instead of 100. Sometimes it goes up to 20. As if the model loader reads the model slowly if it is not in the cache. I wrote a PM on Discord.

Nenad Kuzmanovic

Is it possible to add batch process for CogVLM_v2?

Nenad Kuzmanovic

Error processing scorn_ (99).png: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Furkan Gözükara

you get this error at llava or CogVLM or? CogVLM 8 bit load not working. I am trying to fix. 4 bit load working fine

Furkan Gözükara

Hello. Download latest V5. it works with 4bit 8bit and 16bit. i have tested. 16 bit uses more than 30GB VRAM

Nenad Kuzmanovic

I have installed CUDA 11.8 but installer wants to install 12.1. Is that maybe a problem here?

Furkan Gözükara

i would do this. uninstall all python cuda. restart computer. install exactly as shown in this video into C drive directly : https://youtu.be/-NjNy7afOQ0 also if you have a antivirus that could be preventing the installation

Nenad Kuzmanovic

Enter your choice (1-2): 1 WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.2.0+cu121 with CUDA 1201 (you have 2.2.0+cu118) Python 3.10.11 (you have 3.10.9) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details Traceback (most recent call last): File "H:\CogVLM\web_app_CogVLM.py", line 43, in model = AutoModelForCausalLM.from_pretrained( File "H:\CogVLM\CogVLM\venv\lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained return model_class.from_pretrained( File "H:\CogVLM\CogVLM\venv\lib\site-packages\transformers\modeling_utils.py", line 3032, in from_pretrained raise ImportError( ImportError: Using `load_in_8bit=True` requires Accelerate: `pip install accelerate` and the latest version of bitsandbytes `pip install -i https://test.pypi.org/simple/ bitsandbytes` or `pip install bitsandbytes`. Press any key to continue . . .

Nenad Kuzmanovic

Python is installed in root: C. I dont have antivirus, only Win 10 defender. But i have checked, Kohya installler has installed everything correctly, xformers, bitsandbytes etc....

Nenad Kuzmanovic

This last time, i went step by step and installed everything manualy. I edited links with cuda 118 (instead 121)...

Furkan Gözükara

this is not supposed to happen because it is supposed to install torch 2.2.0 with cu121. but i see it installed 2.2.0 cu118. but my bat file containing this line pip3 install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 you see it is installing cu121 also you should install python 3.10.11. but i see you have installed 3.10.9. i have shown this in the video if you want older cuda use this pip3 install torch==2.0.1 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 pip install xformers==0.0.22

Joosheen

For me llava doesn't work unfortunately :/ I tried to run 6gb vram model. I can open part 1 and part 2 but part 3 is crashing. Gradio is opening but I have several errors: "The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " and "No GPU found. A GPU is needed for quantization". Something is wrong with bitsandbytes?

Joosheen

It is working now, thanks ;) Sometimes there is a problem with error about a network. For others - if u have lower GPU - you need to edit gradio_web_server.py "response = requests.post(worker_addr + "/worker_generate_stream", headers=headers, json=pload, stream=True, timeout=10)" - timeout from 10 to for example 500 and the error will not occur anymore :)

Nenad Kuzmanovic

It is working and yes, IT IS the best captioning model. Amazing how good it is

Nenad Kuzmanovic

I've noticed that sentence construction all of VLM's is not very suitable for Stable diffusion training, cause it contains a lot of unnecessary words.. Example: The image showcases a futuristic or sci-fi setting with a central mechanical structure illuminated in green. Within this structure, there's a humanoid figure with a bald head, seemingly in a state of distress or unconsciousness. The environment appears to be a dimly lit room with large windows, possibly suggesting an industrial or laboratory setting. The art style leans towards realism with a touch of surrealism, given the juxtaposition of the mechanical structure and the human figure. Total number of tokens: 99 - 101 (depends of which model is used, gpt-4 says 101). So, undesirable words in this example are: The image showcases, or, Within this structure, possibly suggesting, The art style leans towards realism with a touch of surrealism, given the juxtaposition of the mechanical structure and the human figure. When those words are tokenized, multiple problems potentially arise: 1.Describing the style is not desirable when training the style, but also the character, because the model becomes inflexible and the transfer of the style present in the dataset is not achieved, and character (person) is best to train only with rare token trigger word. This is maybe good for Concept training, but for that we already have LORA and block weights. 2. The number of tokens is unnecessarily inflated, which can be a problem when generating images But this YET has to be tested, I am writing this from the experience I have with training models using filewords

Nenad Kuzmanovic

I edited command and results are quite better: Question: just caption the image with details, colors, items and objects but do not add any description or comment. do not miss any item in the given image Answer: The image showcases a long hallway with white walls and doors. At the end of the hallway, there's a large, ominous portal or gateway. This gateway is surrounded by red, grotesque, and tentacle-like structures. Within the gateway, a massive, menacing face with hollow eyes and sharp teeth is visible. A woman in a long white dress stands in front of the gateway, seemingly confronting or observing the face. The floor is wet, possibly from a recent flood or rain, and there are puddles of water scattered around.

Nenad Kuzmanovic

I think that can be fixed with custom python script, like in Kohyass/Finetune/clean captions and tags. In that way we can have separate script for style, concept etc.

Nenad Kuzmanovic

If you need any help for formating captions, just let me know. It would be great if that script for caption pruning can be integrated in your gradio UI, so we can easily change type of cleaning captions. I am constantly searching for the best caption tools and i am not satisfied with any of them, and if you manage to make it as i suggested, that WILL be gamechanger my friend...

Nenad Kuzmanovic

I'm testing CogVLM intensively and already i have some patterns for cleaning captions. Like this example, which occurs almost in every caption is: - change "there's a" to "with a" That change mainly concerns the adaptation of the sentence, in the sense of making it as concise as possible and corresponding to the format that SD understands best. Dots "." should be deleted also.

Nenad Kuzmanovic

You are right, but nothing is stopping you to add as much "terms" as you wish... I can do that, but if you can write short instruction which lines in script has to be changed...

leem0nchu

Will there be similar options (I understand not so powerful) for lowvram?

leem0nchu

I haven’t. I guess I didn’t notice only so the sample scripts on the post a few days ago didn’t that showed 10.5gb vram as lowest

Furkan Gözükara

I have added some newer stuff try them. especially blip + clip vision models uses really low VRAM if you have low VRAM.

leem0nchu

Awesome

Nenad Kuzmanovic

I'm getting this error with batch captioning Qwen: Caption for callisto_ (1).png generated in 1.59 seconds. Estimated time to complete: 348.22 seconds. Processed 14/93. Caption for callisto_ (11).png generated in 6.52 seconds. Estimated time to complete: 356.07 seconds. Processed 15/93. Caption for callisto_ (13).png generated in 3.75 seconds. Estimated time to complete: 347.66 seconds. Traceback (most recent call last): File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\queueing.py", line 495, in call_prediction output = await route_utils.call_process_api( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\route_utils.py", line 230, in call_process_api output = await app.get_blocks().process_api( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\blocks.py", line 1590, in process_api result = await self.call_function( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\blocks.py", line 1176, in call_function prediction = await anyio.to_thread.run_sync( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 2134, in run_sync_in_worker_thread return await future File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run result = context.run(func, *args) File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\utils.py", line 678, in wrapper response = f(*args, **kwargs) File "H:\Qwen_captioning\Qwen-VL-Chat.py", line 88, in batch_caption_images caption_file.write(response) File "C:\python3109\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 215-217: character maps to Traceback (most recent call last): File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\queueing.py", line 495, in call_prediction output = await route_utils.call_process_api( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\route_utils.py", line 230, in call_process_api output = await app.get_blocks().process_api( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\blocks.py", line 1590, in process_api result = await self.call_function( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\blocks.py", line 1176, in call_function prediction = await anyio.to_thread.run_sync( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 2134, in run_sync_in_worker_thread return await future File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 851, in run result = context.run(func, *args) File "H:\Qwen_captioning\Qwen-VL\venv\lib\site-packages\gradio\utils.py", line 678, in wrapper response = f(*args, **kwargs) File "H:\Qwen_captioning\Qwen-VL-Chat.py", line 88, in batch_caption_images caption_file.write(response) File "C:\python3109\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 571-573: character maps to

Nenad Kuzmanovic

hahha Qwen is so funny and ridiculous, here are some of it's captions: - Sorry, but I can't assist with that. - As an AI language model, I don't have access to images, but I can describe the scene you might be referring to. - As an AI language model, I don't have access to the image you are referring to, so I cannot provide a detailed description of it. However, if you could provide me with more information or context about the image, I would be happy to help you with your query.

leem0nchu

Getting this warning when using 4bit. Is this normal behavior? UserWarning: Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed. warnings.warn(f'Input type into Linear4bit is torch.float16, but bnb_4bit_compute_type=torch.float32 (default). This will lead to slow inference or training speed.') Over 2 minutes and it never generated a caption. I'm on a 4070 8GB CogLVMv6 I should have pointed out.

Anonymous

How do I use gpu1 instead of gpu0. I change set CUDA_VISIBLE_DEVICES=1 in run_pt3 but no luck. Thanks.

Anonymous

I was able to resolve setting the gpu. I have a 3080ti as my primary and a 3090ti as my second. For some reason when selecting #6 (13b Model - Load in 16 bit - 24 GB VRAM) and running LLaVa, I receive a NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE. (error_code: 1). I ran pt1, pt2 then pt3 several times with the same error. When I choose any of the 7B models, it works fine. The 3090ti is at (0 usage) by the way. My monitors are connected to my 3080ti at (0.3 usage).

Anonymous

im a little confused, which one do i run for windows? i installed like it said in the clip interrogator folder, but i downloaded lava and qwen ect, does each do the same thing? running a 3090 and primarily use SDXL for my lora training with Kohya ss

Furkan Gözükara

interesting. i have rtx 3060 as my second gpu and when i set cuda visible devices to 1 it works directly and load model into that GPU.

Furkan Gözükara

no they are all different captioners. you can test all and compare them. hopefully i am working on a massive tutorial for all

Nenad Kuzmanovic

Kosmos is IMHO the best captioner so far... generated captions requires just a bit of editing...

Anonymous

CogVLM_v6,After batch labeling, it cannot be used in dreamboth, prompt。Exception training model: ''NoneType' object is not subscriptable'.

Anonymous

Hi Doc, I had successfully installed CogVLM_v6 separately (2 days ago, I just needed CogVLM.As a visual designer, lengthy video tutorials are difficult for me, so I followed "runpod_instructions_READ" to perform the installation).but today it no longer works, I neither can open the 7861 port nor get the link after the installation is complete. Error message (KeyError: 'inv_freq')

Furkan Gözükara

updated v7 and fixed the issue. you can either manually downgrade the transformers library or do a fresh install

Erik

Which one of these is the best? Also, I installed LLaVA but the program would crash when I get to run_pt3

Furkan Gözükara

each one has strengths and weaknesses. i plan to compare all later a time. kosmos 2 is working pretty fast with low VRAM. you can try. llava is working perfect. if you can show me entire process how you are running i can point your error. hopefully i will make a tutorial for it

Additional Contributions

Doc, I downloaded the LLaVA installer separately and there is no window .bat file, can I only run it on the runpod?

Furkan Gözükara

it is running like this : run_pt1.bat , run_pt2.bat , run_pt3.bat run all these 3 with orders and wait each one

Walker4k

Following error when attempting to use clip interrogator (ViT-L-14/openai + blip-large) - Traceback (most recent call last): File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\feature_extraction_utils.py", line 182, in convert_to_tensors tensor = as_tensor(value) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\feature_extraction_utils.py", line 141, in as_tensor return torch.tensor(value) RuntimeError: Could not infer dtype of numpy.float32 During handling of the above exception, another exception occurred: Traceback (most recent call last): File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\gradio\queueing.py", line 541, in process_events response = await route_utils.call_process_api( File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\gradio\route_utils.py", line 276, in call_process_api output = await app.get_blocks().process_api( File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\gradio\blocks.py", line 1928, in process_api result = await self.call_function( File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\gradio\blocks.py", line 1514, in call_function prediction = await anyio.to_thread.run_sync( File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\anyio\to_thread.py", line 56, in run_sync return await get_async_backend().run_sync_in_worker_thread( File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 2177, in run_sync_in_worker_thread return await future File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\anyio\_backends\_asyncio.py", line 859, in run result = context.run(func, *args) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\gradio\utils.py", line 833, in wrapper response = f(*args, **kwargs) File "O:\captioning\captioners_clip_interrogator_v2\Clip_Interrogator.py", line 118, in image_to_prompt return ci.interrogate(image) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\clip_interrogator\clip_interrogator.py", line 244, in interrogate caption = caption or self.generate_caption(image) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\clip_interrogator\clip_interrogator.py", line 191, in generate_caption inputs = self.caption_processor(images=pil_image, return_tensors="pt").to(self.device) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\models\blip\processing_blip.py", line 103, in __call__ encoding_image_processor = self.image_processor(images, return_tensors=return_tensors) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\image_processing_utils.py", line 551, in __call__ return self.preprocess(images, **kwargs) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\models\blip\image_processing_blip.py", line 310, in preprocess encoded_outputs = BatchFeature(data={"pixel_values": images}, tensor_type=return_tensors) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\feature_extraction_utils.py", line 78, in __init__ self.convert_to_tensors(tensor_type=tensor_type) File "O:\captioning\captioners_clip_interrogator_v2\venv\lib\site-packages\transformers\feature_extraction_utils.py", line 188, in convert_to_tensors raise ValueError( ValueError: Unable to create tensor, you should probably activate padding with 'padding=True' to have batched tensors with the same length.

Walker4k

I'll have a look at the YT vid. GPU is NVIDIA GeForce RTX 3090