Home Artists Posts Import Register

Downloads

Content

This post is for video: https://youtu.be/svzd5d1LXGk

If you're looking to install Unstructured for LangChain's document loaders, then you're in the right place. In this guide, we'll walk you through the step-by-step process of installing Unstructured and its dependencies, including LangChain and OpenAI.

  1. Create a new environment To begin, create a new environment using a python version less than 3.11. You can do this by running the following command:
conda create -n unstructured python=3.10

Once the environment is created, activate it using:

Copy codeconda activate unstructured
  1. Upgrade pip and setuptools Before installing any packages, it's a good idea to upgrade pip and setuptools. To do this, run the following commands:
pip install --upgrade setuptools
python.exe -m pip install --upgrade pip

If you encounter a permission error with pip upgrade, then try this command instead:

python.exe -m pip install --upgrade pip --user
  1. Install LangChain and OpenAI Next, install LangChain and OpenAI by running:
Copy codepip install langchain
pip install openai

Be sure to read the installation instructions of these two documents carefully before beginning.

  1. Install Git To install Git, visit the following website and download the appropriate version for your operating system: https://git-scm.com/download/win
  2. Install unstructured[local-inference] To install Unstructured, run the following command:
pip install unstructured[local-inference]

This command will take some time to complete. If you encounter any numpy errors, then upgrade numpy by running:

pip install numpy --upgrade

You'll also need to install Cython and torch:

pip install cython
pip3 install torch torchvision torchaudio
  1. Install Detectron2 For Detectron2 installation, you can follow the instructions in this helpful link: https://haroonshakeel.medium.com/detectron2-setup-on-windows-10-and-linux-407e5382df1Alternatively, you can clone the Detectron2 repository and install requirements as follows:
git clone https://github.com/facebookresearch/detectron2.git
cd detectron2
pip install -e .
cd..
pip install opencv-python
  1. Install layoutparser To install layoutparser, run:
pip install layoutparser[layoutmodels,tesseract]
  1. Install other dependencies Install other dependencies required by Unstructured by running the following commands:
pip install python-magic
pip install python-magic-bin

You'll also need to download and install Poppler and Tesseract. Download 7-zip (https://www.7-zip.org/) and unzip Poppler to place it in your working directory. You should also add Poppler BIN to PATH. Then, download and install Tesseract from this website: https://github.com/UB-Mannheim/tesseract/wikiNote the Tesseract installation path and add it to your environment parh variable and install the pytesseract package by running:

pip install pytesseract
NOTE: You will have to add Poppler's bin folder directory path and Tesseracts installation path to your system environment path variable. This is explained in the video much more clearer!!
  1. Install NLTK dependencies Finally, run the following commands to install NLTK dependencies:
python -c "import nltk; nltk.download('punkt')"
python -c "import nltk; nltk.download('averaged_perceptron_tagger')"
  1. Restart VS Code Restart VS Code and ensure that the Unstructured environment is active in VS Code by checking the Python interpreter (Ctrl + Shift + P).

By following these instructions, you should be able to successfully install Unstructured for LangChain's document loaders. If you encounter any issues, be sure to refer to the documentation and installation instructions

Comments

Vipin Kasarla

Thanks for this post, because I want to use this with your previous post on ChatGPT with documents. Q: Would this also work with MS Word documents? (If not how I make it work?)

echohive42

It probably won’t work with MS word documents because as far as I could tell you need “Libreoffice” installed for that. Which seemed to me like it was complicated.

Vipin Kasarla

Do you think I can use https://llamahub.ai/ functions as a loader, and use it as input to LangChain? I find using lLamhub loaders and GPTIndex much more simpler.... and it has pdf, docx and many other loaders (the websie says it can be used with Langchan too !!!

echohive42

You can definitely try although I am not sure if they will work. My hunch is that they should.

Kris Wilkinson

This just may be the holy grail for self-taught developers like me as this is the part that takes the most skill, and most time to understand!

Kris Wilkinson

Im coming across the below error whilst attempting to pip install unstructured[local-inference]

Kris Wilkinson

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for pycocotools Successfully built unstructured-inference python-docx python-pptx unstructured iopath antlr4-python3-runtime Failed to build pycocotools ERROR: Could not build wheels for pycocotools, which is required to install pyproject.toml-based projects

Kris Wilkinson

Im about to attempt to tackle the problem with GPTChat, but thought I'd ask this awesome community alongside.

echohive42

Make sure to upgrade setup tools and upgrade pip before pip installing: pip install --upgrade setuptools python.exe -m pip install --upgrade pip

Patrick Young (edited)

Comment edits

2023-06-10 22:46:13 you need to install microsoft build tools. I installed visual studio C++ development module to get this & then it build pycotools
2023-03-25 18:07:38 you need to install microsoft build tools. I installed visual studio C++ development module to get this & then it build pycotools

you need to install microsoft build tools. I installed visual studio C++ development module to get this & then it build pycotools

Patrick Young (edited)

Comment edits

2023-06-10 22:46:13 Bingo. Works like a charm! Joined Patreon. Thx Echohive'! NB instruction post from patreon wouldn't work for me. Most problems are caused by package incompatibility. NB my system: Windows 11 home system running on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with an nvidia RTX 3050 GPU. As others have found I also ran into probs pip installing unstructured[local-inference], pycotools errors and ninja build compilation errors when trying to get detectron2 up & running. You need C++ build tools - install from visual studio). Also I installed detectron2 *first* using a .yml I found on stackoverflow because I realised that using the latest versions of Python, dectectron2, pytorch wouldn't work. I could get it to work by using python=3.8, cudatoolkit=11.0, pytorch==1.7.1 and quite an old version of detectron2 from the facebookresearch/detectron2 git . If Echohive, or anyone else wants the .yml file see step 2 on https://stackoverflow.com/questions/60631933/install-detectron2-on-windows-10 in the answer posted by DV82XL user:5752730.
2023-03-27 17:13:00 Bingo. Works like a charm! Joined Patreon. Thx Echohive'! NB instruction post from patreon wouldn't work for me. Most problems are caused by package incompatibility. NB my system: Windows 11 home system running on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with an nvidia RTX 3050 GPU. As others have found I also ran into probs pip installing unstructured[local-inference], pycotools errors and ninja build compilation errors when trying to get detectron2 up & running. You need C++ build tools - install from visual studio). Also I installed detectron2 *first* using a .yml I found on stackoverflow because I realised that using the latest versions of Python, dectectron2, pytorch wouldn't work. I could get it to work by using python=3.8, cudatoolkit=11.0, pytorch==1.7.1 and quite an old version of detectron2 from the facebookresearch/detectron2 git . If Echohive, or anyone else wants the .yml file see step 2 on https://stackoverflow.com/questions/60631933/install-detectron2-on-windows-10 in the answer posted by DV82XL user:5752730.

Bingo. Works like a charm! Joined Patreon. Thx Echohive'! NB instruction post from patreon wouldn't work for me. Most problems are caused by package incompatibility. NB my system: Windows 11 home system running on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with an nvidia RTX 3050 GPU. As others have found I also ran into probs pip installing unstructured[local-inference], pycotools errors and ninja build compilation errors when trying to get detectron2 up & running. You need C++ build tools - install from visual studio). Also I installed detectron2 *first* using a .yml I found on stackoverflow because I realised that using the latest versions of Python, dectectron2, pytorch wouldn't work. I could get it to work by using python=3.8, cudatoolkit=11.0, pytorch==1.7.1 and quite an old version of detectron2 from the facebookresearch/detectron2 git . If Echohive, or anyone else wants the .yml file see step 2 on https://stackoverflow.com/questions/60631933/install-detectron2-on-windows-10 in the answer posted by DV82XL user:5752730.

Patrick Young

ps Echohive do you use it a lot? What sort of use cases benefit?

echohive42

I haven’t used unstructured much. But it is default loader for almost all document loaders in langchain. So I wanted to make a video about how to install it.

Patrick Young

Hi everyone, I followed this tutorial and now I have a program that runs in vscode on windows11. It uses document loader to reads text from URLs, images and all regular files like txt, md, py, etc. It mostly works, but images can sometimes be problematic. E.g., if I load a two page infographic .jpg with UnstructuredFileLoader() it will only extract some of the text from the image. IDK whether UnstructuredFileLoader is causing the issue. Or, is it an image segmentation problem (Detectron2)? Other possibilities: (LayoutParser? Tesseract? PyTesseract? Opencv-python? Torchvision? Trawling for suggestions. Here, there, everywhere. Anyone?

echohive42

Hi Patrick. I think the problem can result from two things. Tesseract which does OCR(optical character recognition) and/or Detecteon which detects the layout in an image like object. Unless they update those libraries and Untstuctured implements their latest version, this problem will persist. You can take a look at other PDF readers and see if they can extract the text better. If there is such a package, this approach should work. You just would have to write a logic to deal with multiple pdf files. I hope this is helpful.