Changes between Version 1 and Version 2 of BioShare

2008/02/12 16:45:29 (12 years ago)



  • BioShare

    v1 v2  
     2== Sharing Data == 
     4With the rapid growth of data - when you add all the data generated by 
     5all labs, whether it is EST banks, or microarray data - access to this 
     6data has become a problem. That is a primary concern - storing data in 
     7single locations is not a good idea. Worse even, most data never even 
     8gets uploaded (90%). Sometimes there may be strategic reasons 
     9(intellectual property, patents), but in many cases it may just be a 
     10researcher has no convenient way of sharing his/her data. 
     12A shared data resource prevents webservices sending large SOAP 
     13attachments around - which may block services. Sending a referrel ID around 
     14(like LSID) pointing to a shared web resource would delay fetching 
     15the file until the last moment and prevent denial of service problems. 
     17Moreover shared web storage may provide resilience and reliability - 
     18if care is taken data files are replicated across different servers 
     19geographically spread. 
     21== Simple Use Cases == 
     23=== The Sharer === 
     25The sharer wants to put data in the public domain without hassle. We 
     26can give him two options: have him host the data, or have it uploaded 
     27to some storage. Most important is that the procedure has to be hassle 
     28free. We may ask him for a login, a limited number of meta data fields 
     29(like copyright type) and a free flow description. After that his data 
     30is registred and in the public domain (irreversibly so). 
     32=== The Registrar === 
     34Registrars handle the registration and synchronise data with other 
     35registrars (or use a DNS type resolver). We should take care not to 
     37end with a single point of failure. Once a Sharer shares his data a 
     38bittorrent descriptor gets made available where the original data is 
     39the first seed. A URI (LSID?) is made available to the torrent. The 
     40registrar contacts some 'storage managers' to see whether they want to 
     41copy the original. 
     43=== The Storage Manager === 
     45Storage managers make storage available for shared use. This will be 
     46software installed by anyone where the owner can allocate, say 50Gb 
     47of HDD to the pool. The storage manager registrers with the Registrar 
     48and allows for mirroring of seeds. This will give us our free storage 
     49space in a redundant fashion. The big players can even create storage 
     50that mirrors all available seeds in one subject. 
     52=== The end user or service === 
     54Uses a bittorrent client to fetch his data based on a referrel URI/LSID. 
     56== Notes == 
     58A login is a good idea - as users are supposed to be qualified members 
     59of the scientific (biology) community. Registration for sharing may 
     60require some form of curation - as the community is probably small 
     61this should not be a problem. 
     63A user should not be able to remove data - as other people may 
     64use it and publish results. Versioned data may be stored as diffs. As 
     65the data is immutable it can be stored with an MD5. 
     67Every piece of data comes with a metadata description. This may be 
     68changed later by the user. Almost all metadata is optional as we 
     69don't want to bother users - it is up to them to provide detail. 
     71In the case of illegal files the Registrar should be able to 
     72mark data as 'tainted' which do get removed by all storage managers. 
     74First versions should be simple to implement. So we do not implement security (all 
     75data is public). Later versions may allow for security through 
     76encryption or marking specific storage for specific groups of users. 
    581  * [ Bioshare] is a publicly available repository for hosting and publishing biodiversity datasets and images. 
     83Pjotr Prins