38
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
this post was submitted on 20 Jun 2025
38 points (95.2% liked)
Open Source
38137 readers
243 users here now
All about open source! Feel free to ask questions, and share news, and interesting stuff!
Useful Links
- Open Source Initiative
- Free Software Foundation
- Electronic Frontier Foundation
- Software Freedom Conservancy
- It's FOSS
- Android FOSS Apps Megathread
Rules
- Posts must be relevant to the open source ideology
- No NSFW content
- No hate speech, bigotry, etc
Related Communities
- !libre_culture@lemmy.ml
- !libre_software@lemmy.ml
- !libre_hardware@lemmy.ml
- !linux@lemmy.ml
- !technology@lemmy.ml
Community icon from opensource.org, but we are not affiliated with them.
founded 5 years ago
MODERATORS
A PDF is (or at least can be) similar to a HTML document on the inside. A long time ago we used that at my company to edit PDFs through java code.
Is it possible for you to share the document so we can take a closer look at it? Or if you don't want it on the internet, is there a way to share it privately?
I'm really sorry the pdf contains some personal information. May I ask how you process PDF as HTML?
It's not as HTML. It's just that PDF is a structured file format (as is html, but very different). There are libraries for most programming languages that allow you to edit this structure.
to add to what Elvith wrote:
you can read the HTML like structures inside a PDF and then find out details about the elements you want to remove and then remove them by using that found common property.
This is very hard to do by hand. But if you are curious you can download https://file-examples.com/wp-content/storage/2017/10/file-sample_150kB.pdf
and open it with a text editor like kate. You will see a lot of encoded content data, but also the "html-like" structure in plaintext (in between the encoded stuff but also more at the bottom)
You might find that editing the PDF by hand will break it completely, that is because it is complicated. Iirc you'd need to fix the index, recalculate the checksum or do some other magic bullshit. But that is often taken care of by the library.
Here is a shitty python example for that demo pdf that redacts the image at the last page by drawing a white rectangle over it. There is no way in pymupdf to delete an image or a textblock ... but this is just an example. Other libraries might be able to do it (the one I used a decade ago in java could). I just wanted to point you in the general direction, hope you can see from here how iterating over all the pages, picking the right element and redacting it would work.
A way simpler approach might be to crop all pages at the bottom.
Here are the docs: https://pymupdf.readthedocs.io/en/latest/the-basics.html
Thanksalot!