Self-host Reddit – 2.38B posts, works offline, yours forever (github.com)

submitted 1 month ago by 19_84@lemmy.dbzer0.com to c/selfhosted@lemmy.world

61 comments fedilink hide all child comments

Reddit's API is effectively dead for archival. Third-party apps are gone. Reddit has threatened to cut off access to the Pushshift dataset multiple times. But 3.28TB of Reddit history exists as a torrent right now, and I built a tool to turn it into something you can browse on your own hardware.

The key point: This doesn't touch Reddit's servers. Ever. Download the Pushshift dataset, run my tool locally, get a fully browsable archive. Works on an air-gapped machine. Works on a Raspberry Pi serving your LAN. Works on a USB drive you hand to someone.

What it does: Takes compressed data dumps from Reddit (.zst), Voat (SQL), and Ruqqus (.7z) and generates static HTML. No JavaScript, no external requests, no tracking. Open index.html and browse. Want search? Run the optional Docker stack with PostgreSQL – still entirely on your machine.

API & AI Integration: Full REST API with 30+ endpoints – posts, comments, users, subreddits, full-text search, aggregations. Also ships with an MCP server (29 tools) so you can query your archive directly from AI tools.

Self-hosting options:

USB drive / local folder (just open the HTML files)
Home server on your LAN
Tor hidden service (2 commands, no port forwarding needed)
VPS with HTTPS
GitHub Pages for small archives

Why this matters: Once you have the data, you own it. No API keys, no rate limits, no ToS changes can take it away.

Scale: Tens of millions of posts per instance. PostgreSQL backend keeps memory constant regardless of dataset size. For the full 2.38B post dataset, run multiple instances by topic.

How I built it: Python, PostgreSQL, Jinja2 templates, Docker. Used Claude Code throughout as an experiment in AI-assisted development. Learned that the workflow is "trust but verify" – it accelerates the boring parts but you still own the architecture.

Live demo: https://online-archives.github.io/redd-archiver-example/ GitHub: https://github.com/19-84/redd-archiver (Public Domain)

Pushshift torrent: https://academictorrents.com/details/1614740ac8c94505e4ecb9d88be8bed7b6afddd4

top 50 comments

sorted by: hot top controversial new old

[-] a1studmuffin@aussie.zone 9 points 1 month ago

This seems especially handy for anyone who wants a snapshot of Reddit from pre-enshittification and AI era, where content was more authentic and less driven by bots and commercial manipulation of opinion. Just choose the cutoff date you want and stick with that dataset.

[-] breakingcups@lemmy.world 9 points 1 month ago

Just so you're aware, it is very noticeable that you also used AI to help write this post and its use of language can throw a lot of people off.

Not to detract from your project, which looks cool!

[-] 19_84@lemmy.dbzer0.com 12 points 1 month ago

Yes I used AI, English is not my first language. Thank you for the kind words!

load more comments (2 replies)

load more comments (1 replies)

[-] tanisnikana@lemmy.world 4 points 1 month ago

Reddit is hot stinky garbage but can be useful for stuff like technical support and home maintenance.

Voat and Ruqqus are straight-up misinformation and fascist propaganda, and if you excise them from your data set, your data will dramatically improve.

[-] 19_84@lemmy.dbzer0.com 2 points 1 month ago

the great part is that since everything is built it is easy to support any additional data! there is even an issue template to submit new data source! https://github.com/19-84/redd-archiver/blob/main/.github/ISSUE_TEMPLATE/submit-data-source.yml

[-] frongt@lemmy.zip 3 points 1 month ago

And only a 3.28 TB database? Oh, because it's compressed. Includes comments too, though.

[-] douglasg14b@lemmy.world 2 points 1 month ago* (last edited 1 month ago)

Yeah, it should inflate to 15TB or more I think

[-] muusemuuse@sh.itjust.works 2 points 1 month ago

If only I had the space and bandwidth. I would host a mirror via Lemmy and drag the traffic away.

Actually, isn’t the a way to decentralize this that can be accessed from regular browsers on the internet? Live content here, archive everywhere.

[-] 19_84@lemmy.dbzer0.com 2 points 1 month ago

Yes! Too many comments to count in a reasonable amount of time!

[-] usernameusername@sh.itjust.works 3 points 1 month ago

so kinda like kiwix but for reddit. That is so cool

[-] offspec@lemmy.world 3 points 1 month ago

It would be neat for someone to migrate this data set to a Lemmy instance

[-] TeddE@lemmy.world 3 points 1 month ago

It would be inviting a lawsuit for sure. I like the essence of the idea, but it's probably more trouble than it's worth for all but the most fanatic.

[-] floquant@lemmy.dbzer0.com 1 points 1 month ago* (last edited 1 month ago)

Is it though? That is (or was, and should be again) publicly accessible information that was created over the years by random internet users. I refuse the notion that an American company can "own it" just because they ran the servers. Sure they can hold copyright for their frontend and backend code, name and whatever. But posts and comments, no way.

Of course it would be dumb for someone under US jurisdiction but we'll see how much an international DMCA claim is worth considering the current relations anyway.

load more comments (3 replies)

[-] JackbyDev@programming.dev 1 points 4 weeks ago

Lemmit already existed and was annoying as hell. It was the first account I remember blocking.

[-] BigDiction@lemmy.world 2 points 1 month ago

You should be very proud of this project!! Thank you for sharing.

[-] Butterphinger@lemmy.zip 2 points 4 weeks ago

grabs external

[-] 19_84@lemmy.dbzer0.com 2 points 1 month ago

PLEASE SHARE ON REDDIT!!! I have never had a reddit account and they will NOT let me post about this!!

[-] elbarto777@lemmy.world 2 points 1 month ago* (last edited 1 month ago)

Anyone doing this will be banned in that platform.

[-] MedicPigBabySaver@lemmy.world 1 points 1 month ago

Fuck Reddit and Fuck Spez.

[-] muusemuuse@sh.itjust.works 0 points 1 month ago

You know what would be a good way to do t? Take all that content and throw it on a federated service like ours. Publicly visible. No bullshit. And no reason to visit Reddit to get that content. Take their traffic away.

[-] elbarto777@lemmy.world 1 points 1 month ago

Where would it be hosted so that Conde Nast lawyers can't touch it?

[-] muusemuuse@sh.itjust.works 0 points 1 month ago

What would they say? It’s information that’s freely available, no payment required, no accounts to simply read it, no copyrights, where’s the legal in hosting a duplicate of the content?

[-] limelight79@lemmy.world 0 points 1 month ago

It might fall under the same concept that recipes do - you can't copyright a recipe, but a collection of recipes (such as a book) is copyrightable.

In any case, they have a lot more money to pay lawyers than you or I do, I'll bet, so even if you are right, that doesn't mean you'll have the money to actually win.

load more comments (1 replies)

[-] Tiger@sh.itjust.works 1 points 1 month ago

What is the timing of the dataset, up through which date in time?

[-] 19_84@lemmy.dbzer0.com 6 points 1 month ago

2005-06 to 2024-12

however the data from 2025-12 has been released already, it just needs to be split and reprocessed for 2025 by watchful1. once that happens then you can host archive up till end of 2025. i will probably add support for importing data from the arctic shift dumps instead so that archives can be updated monthly.

[-] douglasg14b@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

It's literally says in the link. Go to the link and it's the title.

[-] SteveCC@lemmy.world 1 points 1 month ago

Wow, great idea. So much useful information and discussion that users have contributed. Looking forward to checking this out.

[-] 19_84@lemmy.dbzer0.com 1 points 1 month ago

thank you!!! i built on great ideas from others! i cant take all the credit 😋

[-] avidamoeba@lemmy.ca 1 points 1 month ago

How does this compare to redarc? It seems to be similar.

[-] 19_84@lemmy.dbzer0.com 1 points 1 month ago

redarc uses reactjs to serve the web app, redd-archiver uses a hybrid architecture that combines static page generation with postgres search via flask. is more like a hybrid static site generator with web app capabilities through docker and flask. the static pages with sorted indexes can be viewed offline and served on hosts like github and codeberg pages.

[-] Howlinghowler110th@kbin.earth 1 points 1 month ago

I think this is a good use case for AI and Impressed with it. wish the instructions were more clear how to set up though.

[-] 19_84@lemmy.dbzer0.com 1 points 1 month ago

thank you! the instruction are little overwhelming, check out the quickstart if you haven't yet! https://github.com/19-84/redd-archiver/blob/main/QUICKSTART.md

[-] vane@lemmy.world 1 points 1 month ago* (last edited 1 month ago)

How long it takes to download this 3TB torrent ?

[-] 19_84@lemmy.dbzer0.com 1 points 1 month ago

week(s)

[-] UnderpantsWeevil@lemmy.world 0 points 1 month ago

I would sooner download a tire fire.

[-] 19_84@lemmy.dbzer0.com 3 points 1 month ago

thanks anyway for looking at my project 🙂

[-] Gerudo@lemmy.zip 2 points 1 month ago

Say what you will about Reddit, but there is tons of information on that platform that's not available anywhere else.

[-] UnderpantsWeevil@lemmy.world 0 points 1 month ago

:-/

You can definitely mine a bit of gold out of that pile of turds. But you could also go to the library and receive a much higher ratio of signal to noise.

load more comments (2 replies)

[-] Clbull@lemmy.world 0 points 1 month ago

Eww, Voat and Ruqqus.

[-] 19_84@lemmy.dbzer0.com 0 points 1 month ago

i will always take more data sources, including lemmy!

[-] polarity_inverter@startrek.website 0 points 1 month ago

... for building your personal Grok?

[-] 19_84@lemmy.dbzer0.com 2 points 1 month ago

if you didn't notice, this project was released into the public domain

load more comments

this post was submitted on 13 Jan 2026

59 points (100.0% liked)

Selfhosted

56359 readers

45 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.
No low-effort posts. This is subjective and will largely be determined by the community member reports.

Resources:

selfh.st Newsletter and index of selfhosted software and apps
awesome-selfhosted software
awesome-sysadmin resources
Self-Hosted Podcast from Jupiter Broadcasting

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 2 years ago

MODERATORS

HybridSarcasm@lemmy.world

HybridSarcasm@lemmy.hybridsarcasm.xyz