Furkan Gözükara

SOTA Image Captioning Scripts For Stable Diffusion: CogVLM, LLaVA, BLIP-2, Clip-Interrogator (115 Clip Vision Models + 5 Caption Models) (Patreon)

Published:

2024-06-26 11:00:00

Edited:

2024-07-08 21:56:35

Imported:

Tags:

caption captioning image caption image processor script

Downloads

Content

Patreon exclusive posts index

Join discord to get help, chat, discuss and also tell me your discord username to get your special rank : SECourses Discord

Please also Star, Watch and Fork our Stable Diffusion & Generative AI GitHub repository

8 July 2024 Update:

CogVLM updated to V8 CogVLM_v8.zip and bu fixed due to transformers library upgrade and break
A bug that prevents batch processing all input images fixed

26 June 2024 Update

Latest zip file : LLaVA_auto_install_v4.zip
LLaVA installers updated
Massed Compute installers added with instructions
Tested both on RunPod and Massed Compute working amazing
Make sure to wait to see loaded model on the Gradio and refresh Gradio until you see to start using it
When you first time use LLaVA it will download model. So check CMD window to see download status

17 April 2024 Update:

Kosmos scripts updated to : Kosmos-2_v6.zip
Below bug fixed please let me know if you still get that error
OSError: cannot write mode RGBA as JPEG
Just copy paste web_app .py to your previous installation

20 February 2024 Update:

4-bit, 8-bit, 16-bit and 32-bit loading options added to the Kosmos-2
I think Kosmos-2 is the very best caption model if you have a lower VRAM GPU
4-bit of Kosmos 2 only uses 2 GB VRAM and 32-bit uses only 7.5 GB VRAM
Single image captioning speed will be also displayed now
Skip existing captions option added. This will skip image files which have existing captions
Once the batch captioning completed, it will not display error but batch caption completed now
Hopefully I will add this skip option and status message to all captioners

15 February 2024 Update:

Another SOTA model Microsoft's Kosmos-2 added to the scripts arsenal
Kosmos-2 : https://github.com/microsoft/unilm/tree/master/kosmos-2
I have modified it and added batch processing too
Download Kosmos-2_v3 .zip and extract into any folder you want to install
All our scripts generate a new separate venv so they will never conflict with other apps
Double click install_windows .bat and install
It will install everything fully automatically
Then use run_kosmos .bat file to start the app
For Runpod and linux follow runpod_instructions_READ txt
Screenshots shared here : https://www.patreon.com/posts/98499462

12 February 2024 Update Massive Update:

captioners_clip_interrogator_v2. zip is older file so i don't suggest you to use it anymore
For using Blip2 captioning, a new amazing Gradio APP developed and now supports all 4-bit, 8-bit and 16-bit model loading
Download blip2_captioning_v1.zip and install with .bat files for Windows and .sh files for RunPod and Linux
A very detailed comparison of Blip2 captioning models published in this article with their speed and VRAM usage : https://www.patreon.com/posts/98331590

8 February 2024 Update Update:

Another Sota Vision model added to our SOTA scripts arsenal
Download Qwen-VL_v3.zip and use windows install bat file or follow runpod instructions
Supports half precision and 4-bit loading

6 February 2024 Update Update:

LLaVA script was broken due to incorrect bitsandbytes and this is fixed
Both LLaVA and CogVLM updated to the latest Pytorch, bitsandbytes, deepspeed and triton packages with xFormers
LLaVA web UI input field now will have our default caption prompt
If you find a better prompt we can replace it

5 February 2024 Update Massive Update:

CogVLM batch processing added with 4-bit, 8-bit, 16-bit and 32-bit
Supports Gradio share
RunPod installers and instructions are also added
The run process improved and made easier
Ignore Please 'pip install apex' message

3 February 2024 Update Massive Update:

LLaVA captioner updated to the very latest version
Now LLaVA captioning installation and usage is much more simplifed
Now it supports 7b, 13b, and newest LLaVA v1.6 34b models with 4bit, 8bit and 16bit loading
13b at 16-bit and 34b at 4-bit works on RTX 3090 - 24 GB GPU
LLaVA 34b requires 65 GB disk space, 13b requires 25 GB and 7b requires 13 GB disk space
Supports batch LLaVA captioning with any model as well
Double click install .bat file to install
Then run run_pt1, run_pt2 and run_pt3 files with order
run_pt3 file will ask you which model to load. After that refresh the opened Gradio app and start using
RunPod instructions also updated read the runpod_instructions_READ.txt which will be much more simplified and easy to use now
A tutorial video for LLaVA is in production right now
Follow progress from the command line interface when doing batch processing
When doing batch processing it will use the prompt you given in prompt input textbox
Example caption prompt for Stable Diffusion training
just caption the image with details, colors, items, objects, emotions, art style, drawing style and objects but do not add any description or comment. do not miss any item in the given image
If you want to change Hugging Face default model download folder set it as below
Start a new CMD as administrator then execute
setx HF_HOME "G:\HF_Models"

13 January 2024 Update:

CogVLM_v2 added to the attachments
Currently running on Windows only
Pre requirements are Python 3.10.x, C++ tools : https://youtu.be/-NjNy7afOQ0
Hopefully the CogVLM Gradio APP will be improved and then later RunPod installer will be shared
CogVLM is the strongest visual GPT right now : https://github.com/THUDM/CogVLM

26 November 2023 Update:

Coca_ViT-L captioning, Blip2 captioning, and 115 clip models and 5 caption models supporting Clip_Interrogator Gradio Web UI is now attached as single file
Download newest captioners_clip_interrogator_v2.zip
Follow progress from the command line interface when doing batch processing
If you want to change Hugging Face default model download folder set it as below
Start a new CMD as administrator then execute
setx HF_HOME "G:\HF_Models"

25 November 2023 Update:

A new great feature added
Now at the very bottom there is test all models button
This button will test all of the 115 CLIP Models with the selected Caption Model
Be careful for each CLIP Model it will download necessary CLIP Model files into your cache folder
It has resume feature too. If for any reason if process interrupts restart and select same picture
For more info with pictures check this link > https://www.linkedin.com/posts/furkangozukara_an-amazing-new-feature-has-been-added-to-activity-7133979531396141056-iQwK

20 October 2023 Update:

Fixed caption writing encoding error
Please redownload all_files. zip file

13 October 2023 Huge Update:

New tutorial video > https://youtu.be/PNA9p94JmtY

Please also upvote this Reddit thread I would appreciate very much

Added new CLIP Interrogator which supports 90 CLIP vision models and 5 caption models
The gradio is improved. Supports batch processing and generation of image captions automatically
Also it clears VRAM whenever you change CLIP model or Caption model selection
Recording a new tutorial video right now
Check below to see its power

Old Video tutorial > https://youtu.be/V8iDW8iprqU

If you also upvote this Reddit thread I would appreciate very much

Requirements:

Make sure that you have git and Python 3.10.x installed. I used Python 3.10.11
Here a tutorial video : https://youtu.be/B5U7LJOvH6g
If you encounter any network related problems during install or model download use https://1.1.1.1/ WARP VPN of Cloudflare which I am using and totally free.

How To Install And Use

Use the .bat installer files for Windows and .sh installer files for RunPod. Each zip file has instructions for how to use on RunPod. Windows usage is so easy. Just run the .bat files.

How To Use Caption Scripts On RunPod

RunPod Tutorial Starts At Min 14 : https://www.youtube.com/watch?v=PNA9p94JmtY

RunPod referral link : https://bit.ly/RunPodIO

Select RunPod Fast Stable Diffusion template

Edit pod and expose HTTP ports and add 7861

If you wish to delete auto downloaded models run below code first (optional)
rm -r auto-models
Upload runpod_install.sh into workspace folder
Open a new terminal and execute below codes to install
export HF_HOME="/workspace"
chmod +x runpod_install.sh
./runpod_install.sh

After Install How To Run On RunPod

How to use RunPod and RunPodCTL tutorial >

https://youtu.be/QN1vdGhjcRc

Open a new terminal
Execute below code
export HF_HOME="/workspace"
source venv/bin/activate
This above code will activate installed venv and now you can use scripts
To start Clip Interrogator Graido Web UI execute below code
python Clip_Interrogator.py --share
Use --share on RunPod. Now mandatory anymore you can also use RunPod proxy connect
When you first time run it will download model and you may get error message
After download of model complete refresh and try again
You can watch the terminal you started for download process
To run other captioners edit their folder path with runpod version
E.g. edit half_precision_17_GB_VRAM.py and change path like below
/workspace/test1

Download the files from below attachments. captioners_clip_interrogator_v2 . zip contains all files as a zip.