32
submitted 7 months ago* (last edited 7 months ago) by muntedcrocodile@lemm.ee to c/piracy@lemmy.dbzer0.com

Not sure if this is the right community but seems close enough.

Ideally i want a url that i can just put any paywalled news article into that will return the unpaywalled version.

Ie: https://somedomain/https://somenewssite/somenewsartle

I need it to work with https://pypi.org/project/newspaper4k/

Alternativly if someone knows of another python library that can extract article text and images automaticly just from a link that would also solve my problem.

top 9 comments
sorted by: hot top controversial new old

Generally, 12ft.io works pretty well for me.

[-] BlackEco@lemmy.blackeco.com 10 points 7 months ago* (last edited 7 months ago)

Most of the time archive.today gets the work done

It also offers a URL to get a snapshot from a given URL: http://archive.is/newest/http://lemmy.dbzer0.com/c/piracy

[-] Byter@lemmy.one 7 points 7 months ago

Looks like newspaper4k uses headless Chrome. You could try loading the Bypass Paywalls Clean extension and browsing the pages directly.

I regularly use it (in Firefox) without even thinking about it. Only notice when I send someone an article they can't access.

[-] muntedcrocodile@lemm.ee 1 points 7 months ago

It does not use headless chrome it just uses the python requests library. Did u get got by an ai hallucination?

Source: i went digging in the source code.

[-] Byter@lemmy.one 2 points 7 months ago

No, just this example code from their site:

browser = p.chromium.launch(headless=True)

My mistake was not knowing where newspaper4k fits in the stack. They're wrapping it with Playwright, which it seems you could do here.

[-] muntedcrocodile@lemm.ee 1 points 7 months ago

Ahh i see. Im using newspaper4k to fetch articles directly it seems the example u found is just using it simply as a parser after using playwright as a html fetcher. I might try that approach.

[-] Darkassassin07@lemmy.ca 3 points 7 months ago
[-] muntedcrocodile@lemm.ee 1 points 7 months ago

Yeah ive tried that only some of em work in an easy way to implement but if the one im currently using goes down then i guess ill have to bodge somthing together.

[-] _cryptagion@lemmy.dbzer0.com 1 points 7 months ago

12ft works, if you really need to. But in general, I just don’t read any publications that paywall their content. Mass media is all owned by one or two billionaires, if they need money they can get it from them.

this post was submitted on 30 Dec 2024
32 points (97.1% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

63297 readers
116 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others



Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

🏴‍☠️ Other communities

FUCK ADOBE!

Torrenting/P2P:

Gaming:


💰 Please help cover server costs.

Ko-Fi Liberapay
Ko-fi Liberapay

founded 2 years ago
MODERATORS