1
submitted 3 days ago* (last edited 3 days ago) by xXPoisonFoxXx@sh.itjust.works to c/datahoarder@lemmy.ml

I was able to get a list of the most recent anime from aniwave using this reddit thread Goofhey made: https://old.reddit.com/r/animepiracy/comments/1f2xbg7/archived_aniwaves_12000_anime_pages_on_wayback/ and scraping all 411 pages archived in the wayback machine. Back in March I built a web scraper using python requests and beautiful soup and got a list of all of aniwaves current anime sorted in alphabetical order. I compared that list to what was most recently saved in wayback machine by Goofhey. I discovered that some anime were missing. I guess its because the pages saved by Goofhey in the wayback machine were sorted by recently updated and since recently updated is constantly changing it caused some anime to be excluded but I think I got all or most of them by combining both list. Then a using a Disqus scraper I made I fed it links from the list I made and downloaded the comments. I tested the scraper on various sites(myasiantv, gogoanime, aniwave) the scraper can most likely work on most websites that use disqus with a bit of tweaking.

I also managed to get all of Gogoanime's old comments from before 2021 going all the way back to 2014/2015. Something interesting I found is that a few copycat websites(6anime, gogoanimes) still have all of gogoanimes old comments from before 2021. I have a few questions regarding this and I would appreciate if anyone can answer them.

  1. What happened to the old gogoanime comments? and why couldnt the Gogoanime admins get them back if a copycat site was able to do it?
  2. New disqus threads for new anime are still being made with the same disqus link structure as the old comment threads how are these new threads being made?

The Aniwave(9anime) comments currently have a few problems that I will fix later:

currently missing some glitched/merged comment threads

Imgur images didn't download properly

Some images were downloaded twice(as the scraper was downloading I made changes to how images were named and ran it again)

Most commented pages on each site sorted from most(Aniwave) to least(Anitaku) amount of comments:

Aniwave(9anime): Attack on Titan The Final Season Part 3 Episode 1

Gogoanime Old comments: Yuri on Ice Category page

Anitaku(Gogoanime): Kimetsu no Yaiba Yuukaku Hen Episode 10

Folders were compressed into tarballs with zstd level 9 compression:

Aniwave(9anime): TOTAL GB UNCOMPRESSED: 69.2 GiB TOTAL GB COMPRESSED:17.4 GiB

Gogoanime: TOTAL GB UNCOMPRESSED: 84.8 GiB TOTAL GB COMPRESSED: 48.2 GiB

Anitaku(Gogoanime): TOTAL GB UNCOMPRESSED: 16.6 GiB TOTAL GB COMPRESSED: 1 GiB

Inside each of the anime folders, you will find 3 types of files that end with 'part X.json,' 'full.json,' and 'simple.json':

Part files - downloaded from disqus and unmodified and contain a maximum of 100 comments

Full - concatonated all part files

Simple - Full file with info stripped out to make more readable by human eyes

DOWNLOADS:

Aniwave(9anime) Comments: https://mega.nz/file/RfgliKJR#kV9MXkEYC-5tqS9A4ZenOMoQKKxpj_ujNadzKeu--qs

Anitaku(Gogoanime) March 2024: https://mega.nz/file/FDBngTQB#p3GMrhPpBY893GLBUJfBePwDOYsKFWmpRyarFlGWCZs

Gogoanime Comments Before 2021: Unfortunatly the compressed file size for Gogoanime is 48.2 GiB and I dont know how to share it since I ran out of free storage space. I will make another post when I figure out how to set up a torrent and also add the link here

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here
this post was submitted on 22 Nov 2024
1 points (100.0% liked)

datahoarder

6786 readers
39 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 4 years ago
MODERATORS