Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,19 @@
# Change Log

## Changes in version 0.2.5

### Fixes:

* [341](https://github.com/pymupdf/RAG/issues/341) - Broken markdown parsing for new line directly followed by 'o'...

### Other Changes:

* New parameter `table_format` in method `to_text()` (PyMuPDF-Layout only). This allows selecting the appearance of tables in plain text outputs. The possible values are defined in the list `tabulate.tabulate_formats`. Default is "grid".
* Installaing PyMuPDF4LLM now supports including all optional dependencies in the `pip` command: `pip install --update pymupdf4llm[ocr,layout]`. This will install pymupdf4llm, pymupdf, and pymupdf-layout. The "ocr" parameter - when needed - installs opencv-python for automatic OCR support in PyMuPDF-Layout mode. Combine this with parameters `--update`, `--force-reinstall` or `--no-cache-dir` as necessary.
* Major rework of the heuristics that determine whether a page should be OCR'd.

------

## Changes in version 0.2.4

### Fixes:
Expand All @@ -10,6 +24,7 @@


------

## Changes in version 0.2.3

### Fixes:
Expand Down
88 changes: 76 additions & 12 deletions pymupdf4llm/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Using PyMuPDF as Data Feeder in LLM / RAG Applications
# Using PyMuPDF as a Data Feeder in LLM / RAG Applications

This package converts the pages of a PDF to text in Markdown format using [PyMuPDF](https://pypi.org/project/PyMuPDF/).

Expand All @@ -8,42 +8,105 @@ Header lines are identified via the font size and appropriately prefixed with on

Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists.

By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of 0-based page numbers.
By default, all document pages are processed. If desired, a subset of pages can be specified by providing a sequence of 0-based page numbers.

-----

[PyMuPDF-Layout](https://pypi.org/project/pymupdf-layout/) is an optional extension of PyMuPDF. It offers AI-based improved page layout analysis, for instance entailing a much higher table recognition.

Since version 0.2.0, pymupdf4llm fully supports pymupdf-layout. As part of this, output as plain text or a JSON string is also possible. In addition, every page is automatically OCR'd (based on a number of criteria) provided package [opencv-python](https://pypi.org/project/opencv-python/) is installed and Tesseract is available on the platform.

Layout mode is activated with a simple modification of the import statements - for details, please see below.

# Installation

```bash
$ pip install -U pymupdf4llm
```

> This command will automatically install [PyMuPDF](https://github.com/pymupdf/PyMuPDF) if required.
> This command will automatically install or upgrade [PyMuPDF](https://github.com/pymupdf/PyMuPDF) as required.

To install all Python packages for full support of the layout feature and automatic OCR, you can use the following command version:

```bash
$ pip install -U pymupdf4llm[ocr,layout]
```

This will install opencv-python and pymupdf-layout in addition to pymupdf4llm and pymupdf.

# Execution
## Legacy Mode
For **_standard (legacy) markdown extraction_**, use the following simple script

```python
import pymupdf4llm

md_text = pymupdf4llm.to_markdown("input.pdf")

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
```

Instead of the filename string as above, one can also provide a PyMuPDF `Document`.

Then in your script do:
By default, all pages in the PDF will be processed. If desired, the parameter `pages=<sequence>` can be used to provide a sequence of zero-based page numbers to consider.

## Layout Mode
To **_activate layout mode_**, use the following

```python
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
import pymupdf4llm

# The remainder of the script is unchanged
md_text = pymupdf4llm.to_markdown("input.pdf")

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())
```

Instead of the filename string as above, one can also provide a PyMuPDF `Document`. By default, all pages in the PDF will be processed. If desired, the parameter `pages=[...]` can be used to provide a list of zero-based page numbers to consider.
Here are the JSON and plain text output versions.

### JSON

```python
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
import pymupdf4llm

json_text = pymupdf4llm.to_json("input.pdf")

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.json").write_text(json_text)
```

### Plain Text

```python
import pymupdf.layout # activate PyMuPDF-Layout in pymupdf
import pymupdf4llm

plain_text = pymupdf4llm.to_text("input.pdf")

# now work with the markdown text, e.g. store as a UTF8-encoded file
import pathlib
pathlib.Path("output.txt").write_bytes(plain_text.encode())
```


**Feature Overview:**

* Support for pages with **_multiple text columns_**.
* Support for **_image and vector graphics extraction_**:

1. Specify `pymupdf4llm.to_markdown("input.pdf", write_images=True)`. Default is `False`.
2. Each image or vector graphic on the page will be extracted and stored as an image named `"input.pdf-pno-index.extension"` in a folder of your choice. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"), `pno` is the 0-based page number and `index` is some sequence number.
3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`).
4. Any text contained in the images or graphics will be extracted and **also become visible as part of the generated image**. This behavior can be changed via `force_text=False` (text only apears as part of the image).
1. Specify either `write_images=True` or `embed_images=True`. Default is `False`.
2. Images and vector graphics on the page will be stored as images named `"input.pdf-pno-index.extension"` in a folder of your choice or be embedded in the markdown text as base64-encoded strings. The image `extension` can be chosen to represent a PyMuPDF-supported image format (for instance "png" or "jpg"), `pno` is the 0-based page number and `index` is some sequence number.
3. The image files will have width and height equal to the values on the page. The desired resolution can be chosen via parameter `dpi` (default: `dpi=150`). So this is not an actual **_extraction_** but rather rendering of the respective page area.
4. Any standard text written in image areas will become a visible part of the generated image and otherwise be ignored. This behavior can be changed via `force_text=True` which causes the text to also become part of the output.

* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with the text and some metadata.
* Support for **page chunks**: Instead of returning one large string for the whole document, a list of dictionaries can be generated: one for each page. Specify `data = pymupdf4llm.to_markdown("input.pdf", page_chunks=True)`. Then, for instance the first item, `data[0]` will contain a dictionary for the first page with its text and some metadata.

* As a first example for directly supporting LLM / RAG consumers, this version can output **LlamaIndex documents**:

Expand All @@ -57,6 +120,7 @@ Instead of the filename string as above, one can also provide a PyMuPDF `Documen
# Every list item contains metadata and the markdown text of 1 page.
```

* A LlamaIndex document essentially corresponds to Python dictionary, where the markdown text of the page is one of the dictionary values. For instance the text of the first page is the the value of `data[0].to_dict().["text"]`.
* A LlamaIndex document essentially corresponds to Python dictionary, where the markdown text of the page is one of the dictionary values. For instance the text of the first page is the value of `data[0].to_dict().["text"]`.
* For details, please consult LlamaIndex documentation.
* Upon creation of the `LlamaMarkdownReader` all necessary LlamaIndex-related imports are executed. Required related package installations must have been done independently and will not be checked during pymupdf4llm installation.
* Upon creation of the `LlamaMarkdownReader` all necessary LlamaIndex-related imports are executed. Required related package installations must have been done independently and will not be checked during pymupdf4llm installation.

2 changes: 2 additions & 0 deletions pymupdf4llm/pymupdf4llm/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ def to_text(
force_text=True,
ocr_dpi=400,
use_ocr=True,
table_format="grid",
# unsupported options for pymupdf layout:
**kwargs,
):
Expand All @@ -164,6 +165,7 @@ def to_text(
footer=footer,
ignore_code=ignore_code,
show_progress=show_progress,
table_format=table_format,
)


Expand Down
106 changes: 81 additions & 25 deletions pymupdf4llm/pymupdf4llm/helpers/check_ocr.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,8 +107,48 @@
--------------------------------------------------------------------------
"""

"""
Functions detecting general photos versus text-heavy images.
"""


def entropy_check(img_gray, threshold=4.5):
"""Compute Shannon entropy of grayscale image."""
hist = cv2.calcHist([img_gray], [0], None, [256], [0, 256])
hist = hist.ravel() / hist.sum()
hist = hist[hist > 0]
entropy = -np.sum(hist * np.log2(hist))
return entropy < threshold, entropy


def fft_check(img_gray, threshold=0.15):
"""Check ratio of high-frequency energy in FFT spectrum."""
# Downsample for speed
small = cv2.resize(img_gray, (128, 128))
f = np.fft.fft2(small)
fshift = np.fft.fftshift(f)
magnitude = np.abs(fshift)
h, w = magnitude.shape
center = magnitude[h // 4 : 3 * h // 4, w // 4 : 3 * w // 4]
ratio = center.sum() / magnitude.sum()
return ratio < threshold, ratio

def get_span_ocr(page, bbox, dpi=300):

def components_check(img_gray, min_components=50):
"""Count connected components after thresholding."""
_, bw = cv2.threshold(img_gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
num_labels, _ = cv2.connectedComponents(bw)
return num_labels < min_components, num_labels


def edge_density_check(img_gray, threshold=0.01):
"""Compute edge density using Canny."""
edges = cv2.Canny(img_gray, 100, 200)
density = edges.sum() / 255.0 / edges.size
return density < threshold, density


def get_span_ocr(page, bbox, dpi=400):
"""Return OCR'd span text using Tesseract.

Args:
Expand All @@ -127,7 +167,7 @@ def get_span_ocr(page, bbox, dpi=300):
return text


def repair_blocks(input_blocks, page):
def repair_blocks(input_blocks, page, dpi=400):
"""Repair text blocks with missing glyphs using OCR.

TODO: Support non-linear block structure.
Expand All @@ -148,7 +188,7 @@ def repair_blocks(input_blocks, page):
if not REPLACEMENT_CHARACTER in span_text:
continue
span_text_len = len(span_text)
new_text = get_span_ocr(page, span["bbox"])[:span_text_len]
new_text = get_span_ocr(page, span["bbox"], dpi=dpi)[:span_text_len]
if "chars" in span:
# rebuild chars array
new_chars = []
Expand Down Expand Up @@ -177,25 +217,48 @@ def get_page_image(page, dpi=150, covered=None):
if covered is None:
covered = page.rect
covered = covered.irect
pix = page.get_pixmap(dpi=dpi)
matrix = pymupdf.Rect(pix.irect).torect(page.rect)

# make a sub-pixmap of the covered area
pix_covered = pymupdf.Pixmap(pymupdf.csRGB, covered)
pix_covered.copy(pix, covered) # copy over covered area
# make a gray pixmap of the covered area
pix_covered = page.get_pixmap(colorspace=pymupdf.csGRAY, clip=covered)
# convert to numpy array
img = np.frombuffer(pix_covered.samples, dtype=np.uint8).reshape(
gray = np.frombuffer(pix_covered.samples, dtype=np.uint8).reshape(
pix_covered.height, pix_covered.width, pix_covered.n
)
# cv2 needs the gray image version of this
gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
return gray, matrix, pix
photo_entropy, entropy_val = entropy_check(gray)
photo_fft, fft_val = fft_check(gray)
photo_components, comp_val = components_check(gray)
photo_edges, edge_val = edge_density_check(gray)

# print(f"Entropy: {entropy_val:.3f} → {photo_entropy}")
# print(f"FFT ratio: {fft_val:.3f} → {photo_fft}")
# print(f"Components: {comp_val} → {photo_components}")
# print(f"Edge density: {edge_val:.6f} → {photo_edges}")

# Weighted decision logic
score = 0
if photo_components:
score += 2
if photo_edges:
score += 2
if photo_entropy:
score += 1
if photo_fft:
score += 1
# print(f"{score=}")
if score >= 3:
pix = None
matrix = pymupdf.Identity
photo = True
else:
pix = page.get_pixmap(dpi=dpi)
matrix = pymupdf.Rect(pix.irect).torect(page.rect)
photo = False

return matrix, pix, photo


def should_ocr_page(
page,
dpi=150,
edge_thresh=0.02,
vector_thresh=0.9,
image_coverage_thresh=0.9,
text_readability_thresh=0.9,
Expand All @@ -207,7 +270,6 @@ def should_ocr_page(
Parameters:
page: PyMuPDF page object
dpi: DPI used for rasterization
edge_thresh: minimum edge density to suggest text presence
vector_thresh: minimum number of vector paths to suggest glyph simulation
image_coverage_thresh: fraction of page area covered by images to trigger OCR
text_readability_thresh: fraction of readable characters to skip OCR
Expand All @@ -225,7 +287,6 @@ def should_ocr_page(
"has_vector_chars": False,
"transform": pymupdf.Identity,
"pixmap": None,
"edge_density": 0.0,
}
page_rect = page.rect
page_area = abs(page_rect) # size of the full page
Expand Down Expand Up @@ -279,21 +340,16 @@ def should_ocr_page(
assert decision["should_ocr"] is True

if not decision["has_text"]:
# Rasterize and analyze edge density
img, matrix, pix = get_page_image(page, dpi=dpi, covered=analysis["covered"])
# Rasterize and check for photo versus text-heaviness
matrix, pix, photo = get_page_image(page, dpi=dpi, covered=analysis["covered"])

# Analyze edge density
edges = cv2.Canny(img, 100, 200)
decision["edge_density"] = float(np.sum(edges > 0) / edges.size)
if decision["edge_density"] <= edge_thresh:
if photo:
# this seems to be a non-text picture page
decision["should_ocr"] = False
decision["pixmap"] = None
else:
decision["should_ocr"] = True
decision["transform"] = matrix
decision["pixmap"] = pix

if decision["should_ocr"]:
decision["transform"] = matrix
decision["pixmap"] = pix
return decision
Loading