Update: Downloading all archive.org metadata (lemmy.dbzer0.com)

submitted 1 year ago by BermudaHighball@lemmy.dbzer0.com to c/datahoarder@lemmy.ml

0 comments fedilink hide all child comments

I used the API at https://archive.org/developers/changes.html to enumerate all the item names in the archive. Currently there are over 256 million item names. However I went through a sample of them and noted the following:

Many do not have the .torrent available because some of the files are locked due to copyright concerns, like their music collection. Ex: https://archive.org/details/lp_le-sonate-per-pianoforte-vol-1_carl-maria-von-weber-dino-ciani_0
A lot of items have been removed from public access completely, and possibly deleted even on their storage backend. Ex: https://archive.org/details/0-5-1-0-hernan-hernandez

There are many, many items from the archive which have been removed. Much higher than I expected. If you have critical data, of course Internet Archive should never be your only backup.

I don't know the distribution of metadata and .torrent file sizes since i have not tried downloading them yet. It looks like it would require a lot of storage if there are many files or the content is huge (if only 50% of the items remain and the average .torrent + metadata is 20KB it would be over 2.5 TB to store). But on the other hand, the archive has a lot of random one off uploads that are not very big, so some metadata is 800 bytes and the torrent 3KB in those cases (only 640 GB to store if combined is 5 KB).

no comments (yet)

sorted by: hot top controversial new old

there doesn't seem to be anything here

this post was submitted on 17 Nov 2024

1 points (100.0% liked)

datahoarder

6786 readers

39 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 6 years ago

MODERATORS

archivist@lemmy.ml