68
submitted 10 months ago by dead@hexbear.net to c/news@hexbear.net
you are viewing a single comment's thread
view the rest of the comments
[-] dRLY@hexbear.net 16 points 10 months ago

I think the only issue with a txt version would come down to not being able to view the items that aren't actually regular text (scans of originals and the ability to see stuff that was handwritten or whatever) or images. Of course most of the docs will be just text, but it would be easy to lose information. What is your main issue with them being in PDF?

[-] frauddogg@lemmygrad.ml 6 points 10 months ago

can't control-f names we feel WOULD turn up in this doc; which means now about 1200 pages have to be trawled through entirely manually

[-] MayoPete@hexbear.net 2 points 10 months ago

Are there OCR tools that can convert the PDF to a searchable file?

[-] frauddogg@lemmygrad.ml 3 points 10 months ago

If there are, I certainly don't know about 'em-- stuff I could've used for my textbook epubs last semester

[-] IzyaKatzmann@hexbear.net 3 points 10 months ago* (last edited 10 months ago)

pdf2text and tesseract, i believe pdf2text uses tesseract. i have them installed on an apple silicon mac with homebrew (e.g. brew install tesseract or brew install pdf2text)

could probably use some ai computer vision package (i haven't checked, i remember looking around before settling on pdf2text) like opencv.

when i used pdf2text it was with pdf slides my prof provided, they ONLY gave pdfs. something about copyright and IP. super interesting prof, great scientist, great researcher, actually a member of some cool orgs like Linnaeus Society, and annoying with her lecture files.

EDIT: if anyone wants it enough i can try to do a proof-of-concept for like ~15 random pages of a random doc and see how well it goes

this post was submitted on 05 Jan 2024
68 points (100.0% liked)

news

23555 readers
920 users here now

Welcome to c/news! Please read the Hexbear Code of Conduct and remember... we're all comrades here.

Rules:

-- PLEASE KEEP POST TITLES INFORMATIVE --

-- Overly editorialized titles, particularly if they link to opinion pieces, may get your post removed. --

-- All posts must include a link to their source. Screenshots are fine IF you include the link in the post body. --

-- If you are citing a twitter post as news please include not just the twitter.com in your links but also nitter.net (or another Nitter instance). There is also a Firefox extension that can redirect Twitter links to a Nitter instance: https://addons.mozilla.org/en-US/firefox/addon/libredirect/ or archive them as you would any other reactionary source using e.g. https://archive.today . Twitter screenshots still need to be sourced or they will be removed --

-- Mass tagging comm moderators across multiple posts like a broken markov chain bot will result in a comm ban--

-- Repeated consecutive posting of reactionary sources, fake news, misleading / outdated news, false alarms over ghoul deaths, and/or shitposts will result in a comm ban.--

-- Neglecting to use content warnings or NSFW when dealing with disturbing content will be removed until in compliance. Users who are consecutively reported due to failing to use content warnings or NSFW tags when commenting on or posting disturbing content will result in the user being banned. --

-- Using April 1st as an excuse to post fake headlines, like the resurrection of Kissinger while he is still fortunately dead, will result in the poster being thrown in the gamer gulag and be sentenced to play and beat trashy mobile games like 'Raid: Shadow Legends' in order to be rehabilitated back into general society. --

founded 4 years ago
MODERATORS