There are currently many solutions for the text chunking part of RAG, with one of the most well-known being unstructured-io, integrated within the Langchain project. The advantage of Unstructure lies in its comprehensive suite of OCR, layout analysis, and other features that produce rich text chunks. However, it falls short when it comes to parsing images and charts within documents.
Recently, a trending project called gptpdf has emerged. It uses PyMuPDF for PDF layout parsing and combines text areas by following specific rules, marking image and chart areas, and then sending all this information to multimodal models like GPT-4o or Qwen-VL for recognition, ultimately generating a complete markdown-formatted document.
This project is notably simple, with less than 300 lines of code in total.
After reading through it, I realized that the current goal is to directly build text chunks that can be used for RAG indexing. Therefore, the final output being in markdown format isn’t actually that important. So, based on this approach, I made some modifications and developed a new PDF parsing solution called llmdocparser.
Let me introduce the entire solution below.
Introduction
First, we still need to perform layout analysis. While gptpdf uses rules for layout analysis, I use the PPStructure model from PaddleOCR.
Its analysis can generate information about the category, position, and reading order of each area on each page. An example is as follows:
[{'header': ((101, 66, 436, 102), 0)},
{'header': ((1038, 81, 1088, 95), 1)},
{'title': ((106, 215, 947, 284), 2)},
{'text': ((101, 319, 835, 390), 3)},
{'text': ((100, 565, 579, 933), 4)},
{'text': ((100, 967, 573, 1025), 5)},
{'text': ((121, 1055, 276, 1091), 6)},
{'reference': ((101, 1124, 562, 1429), 7)},
{'text': ((610, 565, 1089, 930), 8)},
{'text': ((613, 976, 1006, 1045), 9)},
{'title': ((612, 1114, 726, 1129), 10)},
{'text': ((611, 1165, 1089, 1431), 11)},
{'title': ((1011, 1471, 1084, 1492), 12)}]
Based on this information, more comprehensive rules can be set to merge areas. For example, the image below shows the result of a layout analysis:
Based on real-world scenarios, we can set rules to merge overlapping areas, such as merging a “title” type area with the subsequent “text” type area.
After merging, the positions of the areas are updated, and each area is saved as an image for further analysis by large models.
Of course, there are many situations that need to be handled. For instance, during layout analysis, some images might not be detected. In such cases, PyMuPDF is also used to parse the page and obtain its results. These results are then compared with the model’s output to supplement any undetected areas.
Finally, all images are sent one by one to the multimodal large model for analysis, resulting in a table of text chunks.
This table includes the location of the area screenshot, its type, page number, file name, and the corresponding parsed text chunk.
This usage can be quite versatile. For example, if a text chunk of the image type is retrieved, the response can return the location of the screenshot. Once rendered on the front end, it can generate a visually enriched response, combining text and images.
See more in llm_parser.py main function.
Installation
pip install llmdocparser
Installation from Source
To install this project from source, follow these steps:
Clone the Repository:
First, clone the repository to your local machine. Open your terminal and run the following commands:
git clone https://github.com/lazyFrogLOL/llmdocparser.git
cd llmdocparser
Install Dependencies:
This project uses Poetry for dependency management. Make sure you have Poetry installed. If not, you can follow the instructions in the Poetry Installation Guide .
Once Poetry is installed, run the following command in the project's root directory to install the dependencies:
poetry install
This will read the `pyproject.toml` file and install all the required dependencies for the project.
Usage
from llmdocparser.llm_parser import get_image_content
content, cost = get_image_content(
llm_type="azure",
pdf_path="path/to/your/pdf",
output_dir="path/to/output/directory",
max_concurrency=5,
azure_deployment="azure-gpt-4o",
azure_endpoint="your_azure_endpoint",
api_key="your_api_key",
api_version="your_api_version"
)
print(content)
print(cost)
Parameters
llm_type: str
The options are azure, openai, dashscope.
pdf_path: str
Path to the PDF file.
output_dir: str
Output directory to store all parsed images.
max_concurrency: int
Number of GPT parsing worker threads. Batch calling details: Batch Support
If using Azure, the azure_deployment and azure_endpoint parameters need to be passed; otherwise, only the API key needs to be provided.
base_url: str
OpenAI Compatible Server url. Detail: OpenAI-Compatible Server
Cost
Using the 'Attention Is All You Need' paper for analysis, the model chosen is GPT-4o, costing as follows:
Total Tokens: 44063
Prompt Tokens: 33812
Completion Tokens: 10251
Total Cost (USD): $0.322825
Average cost per page: $0.0215
Project Link
https://github.com/lazyFrogLOL/llmdocparser
Feel free to create an issue if you encounter any problems.