-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: remove duplicate characters caused by fake bold rendering in PDFs #4215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: remove duplicate characters caused by fake bold rendering in PDFs #4215
Conversation
|
@badGarnet Could you please review this PR? Thanks! |
Thanks for contributing! I would suggest finding an example pdf that has this kind of issue and add a test using it. The code reads fine to me but it would be good to test on an actual file. |
|
@badGarnet |
| assert len(text_with_dedup) <= len(text_no_dedup), ( | ||
| f"Deduplicated text ({len(text_with_dedup)} chars) should not be longer " | ||
| f"than non-deduplicated text ({len(text_no_dedup)} chars)" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a better assert would be:
- checking the exact expected text length
- check there is duplicated characters in the
text_no_dedup(likebboolldd) and normal text intext_with_dedupe(likebold)
diagnose_fake_bold.py
Outdated
| @@ -0,0 +1,69 @@ | |||
| """Diagnostic script to verify fake-bold PDF deduplication is working.""" | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a test against the new file is good enough; we don't need to add a script to root dir for this case
…ix/remove-pdf-bold-text-duplication
|
@badGarnet Thanks for your feedback. I've updated. Could you please review again and confirm that it’s configured correctly according to your req? thanks again! |
…ix/remove-pdf-bold-text-duplication
29d32e5 to
355e925
Compare
|
Hi, @badGarnet . I updated all. Hope you merge this when you have a sec |
Head branch was pushed to by a user without write access
|
@badGarnet Thanks for approval. Can you merge the PR! |
|
Sorry for tagging you again, @badGarnet. I faced linting test error, so I updated the code and pushed a new commit. Could you please review it again and merge? Thanks. |
badGarnet
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the changelog and move your entry to the appropriate section; please also bump the version number
|
I updated changelog and bumped version number |
CHANGELOG.md
Outdated
| - **Add `group_elements_by_parent_id` utility function**: Groups elements by their `parent_id` metadata field for easier document hierarchy traversal (fixes #1489) | ||
|
|
||
| ### Fixes | ||
| - **Fix duplicate characters in PDF bold text extraction**: Some PDFs render bold text by drawing each character twice at slightly offset positions, causing text like "BOLD" to be extracted as "BBOOLLDD". Added character-level deduplication based on position proximity. Configurable via `PDF_CHAR_DUPLICATE_THRESHOLD` environment variable (default: 3.0 pixels, set to 0 to disable)(fixes #3864). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete?
CHANGELOG.md
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please bump here as well
|
Sorry, @badGarnet - I misunderstood. I’ve now updated CHANGELOG.md and bumped the version correctly. Could you please check again? Thanks for taking a look. |
|
@badGarnet Thanks for approving. 👍 Could you merge this PR? |
|
@badGarnet Sorry for tagging you again! Let me know if it needs to update more. If not, I would appreciate to merge this PR! thanks |
@bittoby it seems the default duplication detection is too sensitive and have too many false positives that resulted in ingest test failure. E.g., this one shows double |
…ap analysis to prevent false positives on legitimate double letters
|
Thanks @badGarnet . Fixed! |
| # Fake-bold duplicates typically have >70% overlap | ||
| # Legitimate consecutive letters have <30% overlap (or none) | ||
| # Use 50% as threshold to be conservative | ||
| return overlap_ratio > 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's make this also a config variable so it can be changed via an env variable
…CHAR_OVERLAP_RATIO_THRESHOLD environment variable
|
@badGarnet Updated! |
|
@badGarnet Sorry for tagging you again. could you please check again? thanks |
Head branch was pushed to by a user without write access
|
@badGarnet The Slack notification test is failing because of the SLACK_BOT token. Could you take a look? |
Closes #3864
Summary
Problem
When extracting text from certain PDFs, bold text appears duplicated:
Solution
Added character-level deduplication that:
Configuration