[-] HelloRoot@lemy.lol 4 points 1 month ago* (last edited 1 month ago)

A PDF is (or at least can be) similar to a HTML document on the inside. A long time ago we used that at my company to edit PDFs through java code.

Is it possible for you to share the document so we can take a closer look at it? Or if you don't want it on the internet, is there a way to share it privately?

[-] happeningtofry99158@lemmy.world 2 points 1 month ago

I'm really sorry the pdf contains some personal information. May I ask how you process PDF as HTML?

[-] elvith@feddit.org 3 points 1 month ago

It's not as HTML. It's just that PDF is a structured file format (as is html, but very different). There are libraries for most programming languages that allow you to edit this structure.

[-] HelloRoot@lemy.lol 2 points 1 month ago* (last edited 1 month ago)

to add to what Elvith wrote:

you can read the HTML like structures inside a PDF and then find out details about the elements you want to remove and then remove them by using that found common property.

This is very hard to do by hand. But if you are curious you can download https://file-examples.com/wp-content/storage/2017/10/file-sample_150kB.pdf

and open it with a text editor like kate. You will see a lot of encoded content data, but also the "html-like" structure in plaintext (in between the encoded stuff but also more at the bottom)

You might find that editing the PDF by hand will break it completely, that is because it is complicated. Iirc you'd need to fix the index, recalculate the checksum or do some other magic bullshit. But that is often taken care of by the library.

Here is a shitty python example for that demo pdf that redacts the image at the last page by drawing a white rectangle over it. There is no way in pymupdf to delete an image or a textblock ... but this is just an example. Other libraries might be able to do it (the one I used a decade ago in java could). I just wanted to point you in the general direction, hope you can see from here how iterating over all the pages, picking the right element and redacting it would work.

import pymupdf  # PyMuPDF

# Open the PDF
doc = pymupdf.open("./file-sample_150kB.pdf")

# Get the last page
page = doc[-1]

# Get all images on the page
images = page.get_images(full=True)

if images:
    # Get the xref of the first image
    xref = images[0][0]

    # Find all instances of the image and redact their bounding boxes
    for info in page.get_image_info(xrefs=True):
        if info["xref"] == xref:
            rect = pymupdf.Rect(info["bbox"])
            page.add_redact_annot(rect, fill=(1, 1, 1))  # white fill

    page.apply_redactions()

# Save the modified PDF
doc.save("./modified.pdf")
doc.close()

A way simpler approach might be to crop all pages at the bottom.

import pymupdf  # PyMuPDF

doc = pymupdf.open("input.pdf")  # open the PDF

for page in doc:
    rect = page.rect  # original page size
    new_rect = pymupdf.Rect(rect.x0, rect.y0 + 100, rect.x1, rect.y1)  # crop bottom 100px
    page.set_cropbox(new_rect)

doc.save("output.pdf")  # save the cropped PDF
doc.close()

Here are the docs: https://pymupdf.readthedocs.io/en/latest/the-basics.html

[-] happeningtofry99158@lemmy.world 2 points 1 month ago

Thanksalot!

[+] lol@discuss.tchncs.de 4 points 1 month ago* (last edited 1 month ago)

[deleted]

[-] happeningtofry99158@lemmy.world 3 points 1 month ago

Much appreciated. Is there an Opensource alternative for this?

[-] Flagstaff@programming.dev 2 points 1 month ago* (last edited 1 month ago)

The freemium, offline-capable software Stirling-PDF is king of open-source PDF-editing. Nothing else comes even close, despite how counterintuitive its UI can be. Ironically enough, the creator said it was originally a ChatGPT experiment, but once it went viral, he ended up hiring others and they all regularly improve it as its staff.

If you are on Windows, head over to @ahk@programming.dev and we'll see what we can do about automating the removal per pg. via AutoHotkey.

[-] happeningtofry99158@lemmy.world 1 points 1 month ago

Stirling-PDF

thanksalot! Stirling-PDF is indeed so very powerful!

[-] thevoidzero@lemmy.world 3 points 1 month ago

I don't know how comfortable you are writing your own, but pdf saves the components with coordinates, bounding box etc so you should be able to automate it with a small script that reads pdf components directly.

Also try qpdf to convert pdf into qdf format, then you can open it in a text editor, find the element you want to remove. Look at examples of few pages, find the pattern and do regex replace. Make sure to keep a copy and check the diff before accepting it.

[-] beepbooprobot@lemmy.world 2 points 1 month ago

PDFGear might be able to handle this if your link is available in the page footer.

Open the file, click edit, header & footer then update and make your changes globally across all 500 pages.

[-] happeningtofry99158@lemmy.world 2 points 1 month ago

thanks. May I also recommend Stirling-PDF

[-] j5906@feddit.org 2 points 1 month ago

Found this on stackoverflow: https://stackoverflow.com/questions/68903835/how-to-draw-a-shape-inside-a-pdf-with-python

import fitz

Open the pdf

doc = fitz.open('./test.pdf') for page in doc: # For every page, draw a rectangle on coordinates (1,1)(100,100) page.draw_rect([1,1,100,100], color = (0, 1, 0), width = 2)

Save pdf

doc.save('./your-route/name.pdf')

Seems like it has a solid chance of working.

[-] happeningtofry99158@lemmy.world 1 points 1 month ago

does it simply place an image of that size on top of every page at the desired position or does it redact the element underneath it as well?

[-] VoxAliorum@lemmy.ml 1 points 1 month ago* (last edited 1 month ago)

I tend towards: No. I can't know for sure, but given how pdfs are structured this sounds very difficult.

A workaround might be to automatically place white boxes over those, but you can probably still select the text underneath afterwards.

[-] SkavarSharraddas@gehirneimer.de 1 points 1 month ago

Something like https://flathub.org/apps/com.github.jeromerobert.pdfarranger to crop pages might work, or https://flathub.org/apps/net.sourceforge.Pdfedit (old and possibly insecure) for more options.

[-] happeningtofry99158@lemmy.world 1 points 1 month ago

Sincere thanks I will look into that

[-] nexguy@lemmy.world 1 points 1 month ago

Just a thought, put a white square over the desired position on every page... quicker? (Not sure how to do it though)

[-] gi1242@lemmy.world -1 points 1 month ago

pdftk can split it into pages, and then recombine the pages. not sure how to automatically remove an element from each page unfortunately.

[-] Flagstaff@programming.dev 3 points 1 month ago

Stirling-PDF can do 10x as many kinds of tasks as pdftk can. I used to use pdftk until I discovered Stirling-PDF and have never looked back.

[-] happeningtofry99158@lemmy.world 1 points 1 month ago

yes just found Stirling-PDF. It is indeed amazing

[-] Cyberflunk@lemmy.world -1 points 1 month ago

Claude code wrote and opencv python app to remove every other word from the Declaration of Independence.

Not really, but you wondered

[-] Anon518@sh.itjust.works -2 points 1 month ago* (last edited 1 month ago)

Why did you link to reddit? Can a mod/admin do something about this? People keep advertising reddit here.

[-] happeningtofry99158@lemmy.world 2 points 1 month ago

because it is the source

[-] Anon518@sh.itjust.works 1 points 1 month ago

Use an archive site. The OP asked a question. He's linking to someone on reddit asking the same question??

[-] Vendetta9076@sh.itjust.works 1 points 1 month ago* (last edited 1 month ago)

Like it or not, reddit isnt an illegal site or anything. Asking mods to do something about linking to reddit is like asking them to do something about linking to twitter. Its not even an "advertisement".

[-] B0rax@feddit.org -5 points 1 month ago

Ask chatgpt for a python script to do exactly that task. Maybe you get what you want, maybe not.

Helped me out quite a few times with niche tasks like this.

[-] Solumbran@lemmy.world 4 points 1 month ago

Look at how to do it with python, you'll learn interesting stuff, get a working result, and not destroy your brain using a chat simulator as a programming help.

I don't get why people are fine with comments that are as absurd as saying "to hang a painting, first stab a screwdriver in the wall then attach the painting to it, sometimes it's not too bad"

[-] B0rax@feddit.org -2 points 1 month ago

How is asking chatgpt to do it any different as to using a different pre made tool? Both do not require programming.

[-] Solumbran@lemmy.world 3 points 1 month ago

How is using a spoon different as to using a military tank? They're both tools after all.

[-] golden_zealot@lemmy.ml 3 points 1 month ago

Pre-made tools have reproducible and known functionality that has been tested whereas LLM's when generating this across 100 different users may come up with 100 different untested results in which someone who doesn't know programming won't really know what complete result to expect from the code it generates.

In short, pre made tools don't require programming knowledge because someone has handled all of this for you previously, but LLM's do require programming knowledge to make sure what it made is going to work safely and correctly.

[-] B0rax@feddit.org -1 points 1 month ago

Sure, but the OP has a singular task to fulfill which he can easily check if it is done correctly. The premade tools don’t have the functionality, now what? Tell him to learn how to code?

[-] golden_zealot@lemmy.ml 2 points 1 month ago* (last edited 1 month ago)

Tell him to learn how to code?

Optimally yes, the OP should learn some code before doing so - this task doesn't seem that difficult to do with a script if you wrote it yourself, and it's even less work to learn enough to just verify what the script is doing.

I have no idea how a car works at a deep level. However I know enough to know how to drive, and if I see its mirrors are broken off, the seat belts are missing, or there is gas leaking out of it onto the ground, I probably shouldn't get in and drive it.

If you don't understand code and run generated code, the problem is that you are stuck with a result that you may or may not have wanted. You may also just think it worked correctly when in fact It might have done other stuff as well that can't be seen plainly - this is the inherent risk of running generated code where you can't actually verify what it's doing.

Maybe it performs the requested function correctly but is sourcing the original code from a use case where someone also wanted to delete every other kind of file that wasn't a pdf in that directory. Maybe not. But this is a difference of one line of code which can have major ramifications if it gets left in.

The point is that if you aren't certain what something does before you use it, you should at a minimum go through the necessary steps to be able to make an informed decision, otherwise it's just reckless.

[-] happeningtofry99158@lemmy.world 2 points 1 month ago

thanks I will try that

Open the pdf

Save pdf

Open Source

Useful Links

Rules

Related Communities