this post was submitted on 05 Jan 2024
18 points (100.0% liked)

Python

6153 readers
62 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

๐Ÿ“… Events

October 2023

November 2023

PastJuly 2023

August 2023

September 2023

๐Ÿ Python project:
๐Ÿ’“ Python Community:
โœจ Python Ecosystem:
๐ŸŒŒ Fediverse
Communities
Projects
Feeds

founded 1 year ago
MODERATORS
 

Iโ€™m not a software developer, but I like to use Python to help speed up some of my office work. One of my regular tasks is to print a stack of ~40 sheets of paper, highlight key information for each entry (about 3 entries per page), and fill out a spreadsheet with that information that then gets loaded into our software.

This is time-consuming, and Iโ€™d like to write a program that can scan the OCR-ed PDFs and pull the relevant information into a CSV.

Iโ€™m confident I could handle it from there, but I know that PDFs are tricky files to work with. Are there any Python modules that might be a good fit for the approach Iโ€™m hoping to take here? Thanks!

top 4 comments
sorted by: hot top controversial new old
[โ€“] [email protected] 9 points 8 months ago* (last edited 8 months ago)

PyMuPDF is excellent for extracting 'structured' text from a pdf page โ€” though I believe 'pulling out relevant information' will still be a manual task, UNLESS the text you're working with allows parsing into meaningful units.

That's because 'textual' content in a pdf is nothing other than a bunch of instructions to draw glyphs inside a rect that represents a page; utilities that come with mupdf or poppler arrange those glyphs (not always perfectly) into 'blocks', 'lines', and 'words' based solely on whitespace separation; the programmer who uses those utilities in an end-user facing application then has to figure out how to create the illusion (so to speak) that the user is selecting/copying/searching for paragraphs, sentences, and so on, in proper reading order.

PyMuPDF comes with a rich collection of convenience functions to make all that less painful; like dehyphenation, eliminating superfluous whitespace, etc. but still, need some further processing to pick out humanly relevant info.

Built-in regex capabilities of Python can suffice for that parsing; but if not, you might want to look into NLTK tools, which apply sophisticated methods to tokenize words & sentences.

EDIT: I really should've mentioned some proper full text search tools. Once you have a good plaintext representation of a pdf page, you might want to feed that representation into tools like the following to index them properly for relevant info:

https://lunr.readthedocs.io/en/latest/ -- this is easy to use, & set up, esp. in a python project.

... it's based on principles that are put to use in this full-scale, 'industrial strength' full text search engine: https://solr.apache.org/ -- it's a bit of a pain to set up; but python can interface with it through any http client. Once you set up some kind of mapping between search tokens/keywords/tags, the plaintext page, & the actual pdf, you can get from a phrase search, for example, to a bunch of vector graphics (i.e. the pdf) relatively painlessly.

[โ€“] [email protected] 4 points 8 months ago* (last edited 8 months ago) (1 children)

pypdf, recently been updated to version 3... it sometimes takes a bit of wrangling for more specific use cases: I've used it in conjunction with reportlab when needing to add text and other bits with a bit more flexibility.

[โ€“] [email protected] 2 points 8 months ago (1 children)

From what I understand PyPDF3 and 4 are separate from pypdf which is the modern version of PyPDF2 as of last year

source link

[โ€“] [email protected] 2 points 8 months ago

That's correct afaik. The maintainers of PyPDF2 merged it back into the original pypdf for version 3 I believe.