With the rapid growth of data - when you add all the data generated by all labs, whether it is EST banks, or microarray data - access to this data has become a problem. That is a primary concern - storing data in single locations is not a good idea. Worse even, most data never even gets uploaded (90%). Sometimes there may be strategic reasons (intellectual property, patents), but in many cases it may just be a researcher has no convenient way of sharing his/her data.
A shared data resource prevents webservices sending large SOAP attachments around - which may block services. Sending a referrel ID around (like LSID) pointing to a shared web resource would delay fetching the file until the last moment and prevent denial of service problems.
Moreover shared web storage may provide resilience and reliability - if care is taken data files are replicated across different servers geographically spread.
Simple Use Cases
The sharer wants to put data in the public domain without hassle. We can give him two options: have him host the data, or have it uploaded to some storage. Most important is that the procedure has to be hassle free. We may ask him for a login, a limited number of meta data fields (like copyright type) and a free flow description. After that his data is registred and in the public domain (irreversibly so).
Registrars handle the registration and synchronise data with other registrars (or use a DNS type resolver). We should take care not to
end with a single point of failure. Once a Sharer shares his data a bittorrent descriptor gets made available where the original data is the first seed. A URI (LSID?) is made available to the torrent. The registrar contacts some 'storage managers' to see whether they want to copy the original.
The Storage Manager
Storage managers make storage available for shared use. This will be software installed by anyone where the owner can allocate, say 50Gb of HDD to the pool. The storage manager registrers with the Registrar and allows for mirroring of seeds. This will give us our free storage space in a redundant fashion. The big players can even create storage that mirrors all available seeds in one subject.
The end user or service
Uses a bittorrent client to fetch his data based on a referrel URI/LSID.
A login is a good idea - as users are supposed to be qualified members of the scientific (biology) community. Registration for sharing may require some form of curation - as the community is probably small this should not be a problem.
A user should not be able to remove data - as other people may use it and publish results. Versioned data may be stored as diffs. As the data is immutable it can be stored with an MD5.
Every piece of data comes with a metadata description. This may be changed later by the user. Almost all metadata is optional as we don't want to bother users - it is up to them to provide detail.
In the case of illegal files the Registrar should be able to mark data as 'tainted' which do get removed by all storage managers.
First versions should be simple to implement. So we do not implement security (all data is public). Later versions may allow for security through encryption or marking specific storage for specific groups of users.
- Bioshare is a publicly available repository for hosting and publishing biodiversity datasets and images. Usage appears to be minimal.
- Ara Howard noticed that the functionality is also covered by the Amazon S3 service which allows for storage through the web with very similar features, including security and bittorrent support. A nice overview is shown for the Ruby implementation (reads easily for everyone). Now if we only had an open source implementation...