The idea of making the Internet Archive decentralized is a very good one. It wou...

pradn · on April 1, 2021

From what I understand, Filecoin contracts are fixed-term. This means you need to read data back from the Filecoin network at the end of the term and negotiate another fixed-term contract to store those files again. Though Filecoin storage fees are low, isn't this back-and-forth a big pain for real archivists? I suppose you only need 1/n of the storage on your side if you stagger the data "withdrawal" and "deposit" actions. Does anyone have insight into this?

jimpick · on April 2, 2021

For long term archiving, it’s essential practice to move data from machine to machine over time anyways, as machines get old and obsolete and can’t be repaired after a certain age.

Right now there’s a maximum deal term of 1.5 years. Potentially that could be extended as the system matures and miners become capable of taking on longer commitments.

thebean11 · on April 1, 2021

> Are these "filecoins" actually useful for anything, like incentivizing people to make copies of files?

Yes, that's exactly what they are. They incentive making copies, then periodically proving you have the copies.

djwhitt · on April 1, 2021

People have to keep renewing the contract to pay for those copies though. Afaik there's nothing that guarantees people will maintain them indefinitely after the contract expires.

thebean11 · on April 1, 2021

Yeah that's right, I'm not sure how you could possibly value or incentivize indefinite storage. All storage mediums are consumable and require upkeep, a "one time" pricing model doesn't really make sense.

djwhitt · on April 1, 2021

Arweave (https://www.arweave.org/) is attempting it. Of course, we'll have to wait and see if they're successful.

thebean11 · on April 1, 2021

I would assume that eventually the cost of "old storage" will be too much and new payments wouldn't be able to subsidize it anymore (sort of like a failing Ponzi scheme) but to be fair I haven't heard of this. Thanks for the link!

djwhitt · on April 1, 2021

You're welcome! I have some questions about how the network is going to play out over time wrt mining centralization and access incentivization, but their overall funding model seems sound to me. It's based an endowment that pays out over time to fund storage mining. Their yellow paper (https://www.arweave.org/yellow-paper.pdf) is worth a read if you want to learn about all the details.

hinkley · on April 1, 2021

And it's still not clear to me how you avoid the situation where a datacenter with 3 network connections masquerades as three different filecoin storage providers with a very similar set of files.

emidoots · on April 2, 2021

I got curious and looked it up. I found this[0]:

> As the storage miner receives each piece of client data, they place it into a sector. Sectors are the fundamental units of storage in Filecoin, and can contain pieces from multiple deals and clients.

> Next, a process called sealing takes place. During sealing, the sector data is encoded through a sequence of graph and hashing processes to create a unique replica. The encoding process is designed to be slow and computationally heavy, making it difficult to spoof.

[0] https://blog.coinlist.co/deep-dive-into-filecoin/

ric2b · on April 3, 2021

Oh, so it's secretly proof of work.

wmf · on April 2, 2021

I think it's worse than that; it looks like Filecoin's PoRep proves that multiple copies of the data are stored on the same server. I don't understand why this is useful.

thebean11 · on April 2, 2021

There's no incentive to store multiple copies of the same data vs a single copy of multiple datas. It's a possible attack vector, for example you could store all of the copies of some data and then lose it, but there's no economic incentive to do so from what I understand.

progval · on April 1, 2021

IPFS devs commented on IA.bak back then, saying he would write a proposal to store IA.bak on IPFS: https://news.ycombinator.com/item?id=9148576

AFAICT, there was no follow up at the time, according to https://wiki.archiveteam.org/index.php/INTERNETARCHIVE.BAK/i... ; so IA.bak went with git-annex, but it had issues given the scale of the project

meibo · on April 2, 2021

It's stunning to me that IA.BAK is sort of dead and that there's no immediate efforts to revive it - losing IA in an earthquake would be unimaginable.

I'd love to donate some of my storage, if there was a project to support it.

kevincox · on April 1, 2021

I agree, a better route would probably be exposing the archive via IPFS. I wonder if they could split the archive like this:

- Sites are archived to IPFS.

- The URL -> Archive mapping is published in a merkle dag.

This way people can help mirror the archive (or subsets of it) by replicating the IPFS archive. You can also separately mirror the merkle dag which would be relatively small. If sites are fairly predictable you could even crawl it yourself and verify that it matches the data that archive.org reported (although there are probably many sites that change on every request so the hash won't match exactly, but you could at least do some analysis/manual inspection on the diffs to check that archive.org appears to be reporting correct data).

Groxx · on April 1, 2021

They effectively are doing this, including with IPFS: https://dweb.archive.org/details/home

I'm not sure what the current state is though. And IPFS is basically just a protocol for storage, it doesn't ensure / encourage storage in any way - that's the point of filecoin.

leoc · on April 2, 2021

It's also worth adding that the general idea of using distributed proof-of-work things in Internet protocols goes back quite a long way, to 1997 at least https://en.wikipedia.org/wiki/Hashcash : it's not just a post-Bitcoin/ICO-era notion.

Groxx · on April 2, 2021

Yeah, I sincerely wish hashcash had succeeded, it seems like a viable route for nigh-eliminating spam.

Bitcoin's achievement isn't a radically new core algorithm of any kind, it's using existing stuff in a clever way, to make it self-reinforcing even when thrown at humans. (I mean, yea, that's an algorithm, but you get my meaning)

pradn · on April 1, 2021

You hit the hammer on the nail. IPFS urls point to content-addressable data. These URLs can be pinned by anyone running an IPFS node. However, what's missing is an easy way to "seed" lists of IPFS files with whatever storage you have. Ideally, there's a way for me to choose to contribute - say - 30 GB of space to a particular project, and the system will take care of pinning the most-needed files, up to that storage limit. This would be useful for any number of public archival projects.

I've seen efforts where people bring up a web page that tells you which torrents to seed, based on how many seeds are active. But this is manual work, and not too robust.

kevincox · on April 1, 2021

I was thinking that most people would just pin the content that is most interesting to them. But it would be also to pin the rarest content. There is no reason that both can't exist to capture multiple motivations.

henvic · on April 1, 2021

Decentralizing it would be far more simple if they provided something such as an application you could download and partially mirror some parts of it, for example.

Eventually, given enough interest, they'd mirror everything. Maybe they could even build this on top of Torrent.

Jtsummers · on April 1, 2021

Or even straight on top of IPFS since there are existing HTTP gateways for IPFS to access the content and most of the archive is static in nature. Provide a client which can be used to coordinate between the numerous volunteers to ensure a wide (ideally full) backup of IA distributed over IPFS, paired with volunteer pinning of specific portions by interest groups and individuals. Their self-hosted IPFS node(s) would become the permanent seeds for this system and broader use of IPFS would ensure wider (though not guaranteed sans pinning) availability.