this post was submitted on 05 Jan 2024
18 points (100.0% liked)

Python

6153 readers
62 users here now

Welcome to the Python community on the programming.dev Lemmy instance!

📅 Events

October 2023

November 2023

PastJuly 2023

August 2023

September 2023

🐍 Python project:
💓 Python Community:
✨ Python Ecosystem:
🌌 Fediverse
Communities
Projects
Feeds

founded 1 year ago
MODERATORS
 

I’m not a software developer, but I like to use Python to help speed up some of my office work. One of my regular tasks is to print a stack of ~40 sheets of paper, highlight key information for each entry (about 3 entries per page), and fill out a spreadsheet with that information that then gets loaded into our software.

This is time-consuming, and I’d like to write a program that can scan the OCR-ed PDFs and pull the relevant information into a CSV.

I’m confident I could handle it from there, but I know that PDFs are tricky files to work with. Are there any Python modules that might be a good fit for the approach I’m hoping to take here? Thanks!

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 4 points 8 months ago* (last edited 8 months ago) (2 children)

pypdf, recently been updated to version 3... it sometimes takes a bit of wrangling for more specific use cases: I've used it in conjunction with reportlab when needing to add text and other bits with a bit more flexibility.

[–] [email protected] 2 points 8 months ago (1 children)

From what I understand PyPDF3 and 4 are separate from pypdf which is the modern version of PyPDF2 as of last year

source link

[–] [email protected] 2 points 8 months ago

That's correct afaik. The maintainers of PyPDF2 merged it back into the original pypdf for version 3 I believe.