fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

bittoby · 2026-01-28T12:23:22Z

Summary

Fixes issue where bold text in PDFs is extracted with duplicate characters (e.g., "BOLD" → "BBOOLLDD")
Some PDF generators simulate bold by rendering each character twice at slightly offset positions
Added character-level deduplication based on position proximity to detect and remove these duplicates

Problem

When extracting text from certain PDFs, bold text appears duplicated:

# Before fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60>60" instead of ">60"

Solution

Added character-level deduplication that:

Compares consecutive characters' text content and position
Removes duplicates where same character appears within 3 pixels (configurable)
Preserves spaces and other non-character elements (LTAnno objects)

# After fix
elements = partition_pdf("document.pdf", strategy="fast")
print(elements[0].text)  # Output: ">60" ✓

Configuration

# Default: 3.0 pixels (enabled)
export PDF_CHAR_DUPLICATE_THRESHOLD=3.0

# Disable deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=0

# More aggressive deduplication
export PDF_CHAR_DUPLICATE_THRESHOLD=5.0

bittoby · 2026-01-28T12:33:07Z

@badGarnet Could you please review this PR? Thanks!

badGarnet · 2026-01-30T02:53:50Z

@badGarnet Could you please review this PR? Thanks!

Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file.

bittoby · 2026-01-30T18:00:24Z

@badGarnet
I added example pdf(example-docs/pdf/fake-bold-sample.pdf) and test script(diagnose_fake_bold.py) for diagnose fake bolds.
please review and test again. Thank you

badGarnet · 2026-02-02T16:23:34Z

test_unstructured/partition/pdf_image/test_pdfminer_utils.py

+        assert len(text_with_dedup) <= len(text_no_dedup), (
+            f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer "
+            f"than non-deduplicated text ({len(text_no_dedup)} chars)"
+        )


a better assert would be:

checking the exact expected text length

check there is duplicated characters in the text_no_dedup (like bboolldd) and normal text in text_with_dedupe (like bold)

badGarnet · 2026-02-02T16:24:15Z

diagnose_fake_bold.py

@@ -0,0 +1,69 @@
+"""Diagnostic script to verify fake-bold PDF deduplication is working."""


a test against the new file is good enough; we don't need to add a script to root dir for this case

…ix/remove-pdf-bold-text-duplication

bittoby · 2026-02-02T17:20:05Z

@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again!

…ix/remove-pdf-bold-text-duplication

bittoby · 2026-02-04T03:49:21Z

Hi, @badGarnet . I updated all. Hope you merge this when you have a sec

bittoby · 2026-02-05T17:40:58Z

@badGarnet Thanks for approval. Can you merge the PR!

bittoby · 2026-02-05T20:19:29Z

Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks.

badGarnet

please update the changelog and move your entry to the appropriate section; please also bump the version number

…c ID generation

bittoby · 2026-02-06T00:29:31Z

I updated changelog and bumped version number

badGarnet · 2026-02-06T15:35:29Z

CHANGELOG.md

 - **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489)

 ### Fixes
+- **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864).


badGarnet · 2026-02-06T15:35:44Z

CHANGELOG.md

please bump here as well

bittoby · 2026-02-06T16:24:38Z

Sorry, @badGarnet - I misunderstood. I’ve now updated CHANGELOG.md and bumped the version correctly. Could you please check again? Thanks for taking a look.

bittoby · 2026-02-06T17:19:47Z

@badGarnet Thanks for approving. 👍 Could you merge this PR?

bittoby · 2026-02-09T09:30:25Z

@badGarnet Sorry for tagging you again! Let me know if it needs to update more. If not, I would appreciate to merge this PR! thanks

badGarnet · 2026-02-09T20:55:52Z

@badGarnet Sorry for tagging you again! Let me know if it needs to update more. If not, I would appreciate to merge this PR! thanks

@bittoby it seems the default duplication detection is too sensitive and have too many false positives that resulted in ingest test failure. E.g., this one shows double ll in all is now just a single l
I would suggest tweak the default settings to not apply detection.

…ap analysis to prevent false positives on legitimate double letters

bittoby · 2026-02-09T21:15:11Z

Thanks @badGarnet . Fixed!

badGarnet · 2026-02-10T15:06:26Z

unstructured/partition/pdf_image/pdfminer_utils.py

+    # Fake-bold duplicates typically have >70% overlap
+    # Legitimate consecutive letters have <30% overlap (or none)
+    # Use 50% as threshold to be conservative
+    return overlap_ratio > 0.5


let's make this also a config variable so it can be changed via an env variable

…CHAR_OVERLAP_RATIO_THRESHOLD environment variable

bittoby · 2026-02-10T15:19:19Z

@badGarnet Updated!

bittoby · 2026-02-11T12:07:37Z

@badGarnet Sorry for tagging you again. could you please check again? thanks

bittoby · 2026-02-11T16:46:50Z

@badGarnet The Slack notification test is failing because of the SLACK_BOT token. Could you take a look?
My code changes shouldn’t be causing it - it seems more like a GitHub repo/settings issue. Please check this. Thanks.

fix: remove duplicate characters caused by fake bold rendering in PDFs

f8af84b

bittoby added 2 commits January 30, 2026 17:25

fix: solve merge conflict

8d80a34

fix: apply character deduplication to fast strategy for fake-bold PDFs

92c02d6

bittoby added 2 commits January 30, 2026 19:10

fix: define imports at the top

8377398

test: simplify fake-bold integration test assertions

d817d42

badGarnet reviewed Feb 2, 2026

View reviewed changes

bittoby added 2 commits February 2, 2026 18:15

Merge branch 'main' of https://github.com/bittoby/unstructured into f…

e0803a3

…ix/remove-pdf-bold-text-duplication

fix: improve fake-bold deduplication tests with specific assertions

3d11da7

bittoby added 2 commits February 3, 2026 18:47

Merge branch 'main' of https://github.com/bittoby/unstructured into f…

90a82c2

…ix/remove-pdf-bold-text-duplication

fix: remove unused pytest import to pass ruff linter

355e925

bittoby force-pushed the fix/remove-pdf-bold-text-duplication branch from 29d32e5 to 355e925 Compare February 3, 2026 17:48

badGarnet approved these changes Feb 5, 2026

View reviewed changes

badGarnet enabled auto-merge February 5, 2026 16:50

fix: black formatting violations in PDF test files for CI/CD compliance

14d1231

auto-merge was automatically disabled February 5, 2026 17:39
Head branch was pushed to by a user without write access

badGarnet requested changes Feb 5, 2026

View reviewed changes

fix: Update code formatting and element ID to match new deterministri…

80e2774

…c ID generation

bittoby requested a review from badGarnet February 6, 2026 00:40

badGarnet reviewed Feb 6, 2026

View reviewed changes

CHANGELOG.md Outdated

Copy link

Collaborator

badGarnet Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please bump here as well

bittoby reacted with thumbs up emoji

fix: Update CHANGELOG

68fc61c

fix: recover origin 0.18.35

0728ec0

bittoby requested a review from badGarnet February 6, 2026 16:30

badGarnet approved these changes Feb 6, 2026

View reviewed changes

fix: improve PDF fake-bold deduplication by adding bounding box overl…

fb1067d

…ap analysis to prevent false positives on legitimate double letters

fix: solve merge conflict

1894ed1

badGarnet reviewed Feb 10, 2026

View reviewed changes

fix: make pdf character overlap ratio threshold configurable via PDF_…

5f7c2e6

…CHAR_OVERLAP_RATIO_THRESHOLD environment variable

fix: resolve merge conflict

9100347

badGarnet enabled auto-merge February 11, 2026 16:08

fix: resolve Lint style error

66beb1e

auto-merge was automatically disabled February 11, 2026 16:43
Head branch was pushed to by a user without write access

		@@ -0,0 +1,69 @@
		"""Diagnostic script to verify fake-bold PDF deduplication is working."""

fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

Are you sure you want to change the base?

fix: remove duplicate characters caused by fake bold rendering in PDFs #4215

Uh oh!

Conversation

bittoby commented Jan 28, 2026

Summary

Problem

Solution

Configuration

Uh oh!

bittoby commented Jan 28, 2026

Uh oh!

badGarnet commented Jan 30, 2026

Uh oh!

bittoby commented Jan 30, 2026

Uh oh!

badGarnet Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

badGarnet Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 2, 2026

Uh oh!

bittoby commented Feb 4, 2026

Uh oh!

bittoby commented Feb 5, 2026

Uh oh!

bittoby commented Feb 5, 2026

Uh oh!

badGarnet left a comment

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 6, 2026

Uh oh!

badGarnet Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

badGarnet Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bittoby commented Feb 6, 2026

Uh oh!

bittoby commented Feb 9, 2026

Uh oh!

badGarnet commented Feb 9, 2026

Uh oh!

bittoby commented Feb 9, 2026

Uh oh!

badGarnet Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

bittoby commented Feb 10, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

bittoby commented Feb 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bittoby commented Feb 6, 2026 •

edited

Loading