| | 1 | |
| | 2 | == Sharing Data == |
| | 3 | |
| | 4 | With the rapid growth of data - when you add all the data generated by |
| | 5 | all labs, whether it is EST banks, or microarray data - access to this |
| | 6 | data has become a problem. That is a primary concern - storing data in |
| | 7 | single locations is not a good idea. Worse even, most data never even |
| | 8 | gets uploaded (90%). Sometimes there may be strategic reasons |
| | 9 | (intellectual property, patents), but in many cases it may just be a |
| | 10 | researcher has no convenient way of sharing his/her data. |
| | 11 | |
| | 12 | A shared data resource prevents webservices sending large SOAP |
| | 13 | attachments around - which may block services. Sending a referrel ID around |
| | 14 | (like LSID) pointing to a shared web resource would delay fetching |
| | 15 | the file until the last moment and prevent denial of service problems. |
| | 16 | |
| | 17 | Moreover shared web storage may provide resilience and reliability - |
| | 18 | if care is taken data files are replicated across different servers |
| | 19 | geographically spread. |
| | 20 | |
| | 21 | == Simple Use Cases == |
| | 22 | |
| | 23 | === The Sharer === |
| | 24 | |
| | 25 | The sharer wants to put data in the public domain without hassle. We |
| | 26 | can give him two options: have him host the data, or have it uploaded |
| | 27 | to some storage. Most important is that the procedure has to be hassle |
| | 28 | free. We may ask him for a login, a limited number of meta data fields |
| | 29 | (like copyright type) and a free flow description. After that his data |
| | 30 | is registred and in the public domain (irreversibly so). |
| | 31 | |
| | 32 | === The Registrar === |
| | 33 | |
| | 34 | Registrars handle the registration and synchronise data with other |
| | 35 | registrars (or use a DNS type resolver). We should take care not to |
| | 36 | |
| | 37 | end with a single point of failure. Once a Sharer shares his data a |
| | 38 | bittorrent descriptor gets made available where the original data is |
| | 39 | the first seed. A URI (LSID?) is made available to the torrent. The |
| | 40 | registrar contacts some 'storage managers' to see whether they want to |
| | 41 | copy the original. |
| | 42 | |
| | 43 | === The Storage Manager === |
| | 44 | |
| | 45 | Storage managers make storage available for shared use. This will be |
| | 46 | software installed by anyone where the owner can allocate, say 50Gb |
| | 47 | of HDD to the pool. The storage manager registrers with the Registrar |
| | 48 | and allows for mirroring of seeds. This will give us our free storage |
| | 49 | space in a redundant fashion. The big players can even create storage |
| | 50 | that mirrors all available seeds in one subject. |
| | 51 | |
| | 52 | === The end user or service === |
| | 53 | |
| | 54 | Uses a bittorrent client to fetch his data based on a referrel URI/LSID. |
| | 55 | |
| | 56 | == Notes == |
| | 57 | |
| | 58 | A login is a good idea - as users are supposed to be qualified members |
| | 59 | of the scientific (biology) community. Registration for sharing may |
| | 60 | require some form of curation - as the community is probably small |
| | 61 | this should not be a problem. |
| | 62 | |
| | 63 | A user should not be able to remove data - as other people may |
| | 64 | use it and publish results. Versioned data may be stored as diffs. As |
| | 65 | the data is immutable it can be stored with an MD5. |
| | 66 | |
| | 67 | Every piece of data comes with a metadata description. This may be |
| | 68 | changed later by the user. Almost all metadata is optional as we |
| | 69 | don't want to bother users - it is up to them to provide detail. |
| | 70 | |
| | 71 | In the case of illegal files the Registrar should be able to |
| | 72 | mark data as 'tainted' which do get removed by all storage managers. |
| | 73 | |
| | 74 | First versions should be simple to implement. So we do not implement security (all |
| | 75 | data is public). Later versions may allow for security through |
| | 76 | encryption or marking specific storage for specific groups of users. |