| 1 | |
| 2 | == Sharing Data == |
| 3 | |
| 4 | With the rapid growth of data - when you add all the data generated by |
| 5 | all labs, whether it is EST banks, or microarray data - access to this |
| 6 | data has become a problem. That is a primary concern - storing data in |
| 7 | single locations is not a good idea. Worse even, most data never even |
| 8 | gets uploaded (90%). Sometimes there may be strategic reasons |
| 9 | (intellectual property, patents), but in many cases it may just be a |
| 10 | researcher has no convenient way of sharing his/her data. |
| 11 | |
| 12 | A shared data resource prevents webservices sending large SOAP |
| 13 | attachments around - which may block services. Sending a referrel ID around |
| 14 | (like LSID) pointing to a shared web resource would delay fetching |
| 15 | the file until the last moment and prevent denial of service problems. |
| 16 | |
| 17 | Moreover shared web storage may provide resilience and reliability - |
| 18 | if care is taken data files are replicated across different servers |
| 19 | geographically spread. |
| 20 | |
| 21 | == Simple Use Cases == |
| 22 | |
| 23 | === The Sharer === |
| 24 | |
| 25 | The sharer wants to put data in the public domain without hassle. We |
| 26 | can give him two options: have him host the data, or have it uploaded |
| 27 | to some storage. Most important is that the procedure has to be hassle |
| 28 | free. We may ask him for a login, a limited number of meta data fields |
| 29 | (like copyright type) and a free flow description. After that his data |
| 30 | is registred and in the public domain (irreversibly so). |
| 31 | |
| 32 | === The Registrar === |
| 33 | |
| 34 | Registrars handle the registration and synchronise data with other |
| 35 | registrars (or use a DNS type resolver). We should take care not to |
| 36 | |
| 37 | end with a single point of failure. Once a Sharer shares his data a |
| 38 | bittorrent descriptor gets made available where the original data is |
| 39 | the first seed. A URI (LSID?) is made available to the torrent. The |
| 40 | registrar contacts some 'storage managers' to see whether they want to |
| 41 | copy the original. |
| 42 | |
| 43 | === The Storage Manager === |
| 44 | |
| 45 | Storage managers make storage available for shared use. This will be |
| 46 | software installed by anyone where the owner can allocate, say 50Gb |
| 47 | of HDD to the pool. The storage manager registrers with the Registrar |
| 48 | and allows for mirroring of seeds. This will give us our free storage |
| 49 | space in a redundant fashion. The big players can even create storage |
| 50 | that mirrors all available seeds in one subject. |
| 51 | |
| 52 | === The end user or service === |
| 53 | |
| 54 | Uses a bittorrent client to fetch his data based on a referrel URI/LSID. |
| 55 | |
| 56 | == Notes == |
| 57 | |
| 58 | A login is a good idea - as users are supposed to be qualified members |
| 59 | of the scientific (biology) community. Registration for sharing may |
| 60 | require some form of curation - as the community is probably small |
| 61 | this should not be a problem. |
| 62 | |
| 63 | A user should not be able to remove data - as other people may |
| 64 | use it and publish results. Versioned data may be stored as diffs. As |
| 65 | the data is immutable it can be stored with an MD5. |
| 66 | |
| 67 | Every piece of data comes with a metadata description. This may be |
| 68 | changed later by the user. Almost all metadata is optional as we |
| 69 | don't want to bother users - it is up to them to provide detail. |
| 70 | |
| 71 | In the case of illegal files the Registrar should be able to |
| 72 | mark data as 'tainted' which do get removed by all storage managers. |
| 73 | |
| 74 | First versions should be simple to implement. So we do not implement security (all |
| 75 | data is public). Later versions may allow for security through |
| 76 | encryption or marking specific storage for specific groups of users. |