Version 31 (modified by holland, 17 years ago) |
---|
The Problem
One of the goals of BioSQL was to provide an interchange platform for the Bio* objects. This has not yet succeeded due to differences in the way the Bio* projects interpret an individual sequence record and how they persist it to the database.
Common sequence semantics and object/ format handling would probably be of great benefit to many other WS providers and consumers. If Bio* can agree on semantics it would be a good reference for many other projects.
Possible Tasks
- Choose some 'reference' sequences to see how the Bio* projects 'round-trip' them.
- Where are the differences and why are there differences?
- Find out where each Bio* project persists it's data into BioSQL during ORM.
- Why are there differences?
- Establish guidelines for where things should go in BioSQL, eg given a Genbank file, what bits should go where.
- UML diagrams?
- Define an interchange format for the Bio* projects. Probably XML, probably borrow something already existing (XEMBL etc).
- Decide on a restricted vocab for annotations and feature types. Probably use SO.
- Define a middleware API for uniform I/O access to sequence database.
- Intially backed by BioSQL.
- Could be backed by any DB.
- Derby version of the BioSQL schema (Derby is the Java reference database).
- A BioSQL release.
Participants
- Mark Schreiber
- Jan Aerts
- Richard Holland
- Hilmar Lapp
- Heikki Lehvaslaiho
- Richard Bruskiewich
- Jan Byrne
- Raoul J.P. Bonnal
References
Formats
Random ideas from Jan Aerts
Mind that this is *very* incomplete. Just to help my really bad memory. As the issue is interoperability of the Bio* toolkits, we don't have to synchronize the toolkits at the object level, but rather at the interface level.
First thing to check: what types of objects do we want to synchronize? Of course sequence objects; but what else? The results of a BLAST parsing?
For sequences
Check if each toolkit reads and writes a GenBank/Fasta/... serialization in the same way. Input can either be an original GenBank/Fasta/... file or a dbfetch from any database.
- What should be conserved:
- Tags
- for a sequence: lower/uppercase
- Within a project it is desirable to mask an alphabet, for transfer between bio* projects this is not a good idea.
- What not necessarily should be conserved:
- for a sequence over multiple lines: length of each line
- Proposal for a common default value, 60bp ?
- for a sequence over multiple lines: length of each line
Task achieved
Tuesday
- Initial planning.
- Approved a BioSQL logo.
- Hilmar initiated BioSQL release discussion
- Selected sequence files to roundtrip
- BioRuby has no way of exporting Bio::Sequence objects to GenBank, EMBL, ...
- Began roundtrips
- Started UML diagram to describe object model with Richard Bruskiewich.
Wednesday
- Continue roundtrips
- BioRuby: started work on creating export filters to GenBank? and EMBL
- BioJava: Fixing issues with UniprotXML format and updating EMBLxml to new xsd.
- Adding BioSQL schema documentation to BioSQL wiki page.
Thursday
- Continue roundtrips
- Translation of BioSQL to Derby RDBMS.
- -- Would like to know why BioSQL can't add multi dbxrefs to one docref.
- -- Would like to work out how best to store EMBL assembly data in BioSQL.
Back to ListOfTopics
Attachments
-
dag1.fa
(5.6 KB) - added by jan.aerts
17 years ago.
DAG1 gene: FASTA formatted sequence
-
dag1.gb
(17.9 KB) - added by jan.aerts
17 years ago.
DAG1 gene: genbank formatted file exported from NCBI
-
dag1.insd
(38.6 KB) - added by jan.aerts
17 years ago.
DAG1 gene: INSD XML formatted file exported from NCBI
-
dag1.asn1
(58.0 KB) - added by jan.aerts
17 years ago.
DAG1 gene: ASN.1 formatted file exported from NCBI
-
dag1.gb_xml
(227.9 KB) - added by jan.aerts
17 years ago.
DAG1 gene: genbank XML formatted file exported from NCBI (probably?)
-
aj224122.asn1
(17.9 KB) - added by jan.aerts
17 years ago.
AJ224122 in ASN1 format as downloaded from NCBI
-
aj224122.fa
(3.9 KB) - added by jan.aerts
17 years ago.
AJ224122 in FASTA format as downloaded from NCBI
-
aj224122.gb
(8.9 KB) - added by jan.aerts
17 years ago.
AJ224122 in genbank format as downloaded from NCBI
-
aj224122.gb_xml
(68.3 KB) - added by jan.aerts
17 years ago.
AJ224122 in genbank XML format as downloaded from NCBI
-
aj224122.insd
(19.5 KB) - added by jan.aerts
17 years ago.
AJ224122 in INSD format as downloaded from NCBI
-
aj224122.swiss
(7.8 KB) - added by heikki
17 years ago.
AJ224122 translation (DOF37_ARATH/Q43385) in swiss-prot format as downloaded from UniProt?
-
dag1-biojava.fa
(5.7 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of fasta
-
aj224122-biojava.fa
(3.9 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of fasta
-
aj224122-biojava.gb
(9.1 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of genbank
-
aj224122-biojava.insd
(22.6 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of ISNDseq
-
aj224122.embl
(9.1 KB) - added by markjschreiber
17 years ago.
Correct version of EMBL file
-
aj224122.EMBL.xml
(14.9 KB) - added by markjschreiber
17 years ago.
EMBLxml format
-
aj224122.uniprot.xml
(17.2 KB) - added by markjschreiber
17 years ago.
uniprot xml
-
aj224122-biojava.insd.xml
(23.6 KB) - added by markjschreiber
17 years ago.
biojava round trip of INSDseq
-
Main.java
(4.7 KB) - added by markjschreiber
17 years ago.
Roundtrip program for biojava
-
derby-biosql.sql
(27.7 KB) - added by markjschreiber
17 years ago.
DERBY schema for BioSQL
-
aj224122-biojava.embl
(9.2 KB) - added by holland
17 years ago.
biojava embl round-trip
-
aj224122-biojava.swiss
(7.8 KB) - added by holland
17 years ago.
biojava uniprot round-trip
-
aj224122-biojava.uniprot.xml
(18.0 KB) - added by holland
17 years ago.
biojava uniprotxml round trip
-
aj224122-bioruby.embl
(8.9 KB) - added by jan.aerts
17 years ago.
bioruby roundtrip for EMBL format
-
roundtrip_bioruby.rb
(161 bytes) - added by jan.aerts
17 years ago.
Ruby script for roundtrip (i.c. EMBL)
-
bioruby_aav50056_fasta_annotated.png
(108.3 KB) - added by jan.aerts
17 years ago.
Annotated picture of a FASTA entry
-
bioruby_ab09071_embl_annotated.png
(92.0 KB) - added by jan.aerts
17 years ago.
Annotated picture of an EMBL entry
-
bioruby_ab09071_gb_annotated.png
(124.9 KB) - added by jan.aerts
17 years ago.
Annotated picture of a GenBank? entry
-
bioperl_convert.pl
(1.6 KB) - added by heikki
17 years ago.
Bioperl script fr roundtriping EMBL, GenBank? and Swiss-Prot files
-
aj224122-bioperl.fa
(3.9 KB) - added by heikki
17 years ago.
bioperl processed fasta file
-
aj224122-bioperl.gb
(8.9 KB) - added by heikki
17 years ago.
bioperl processed Genbank file
-
aj224122-bioperl.swiss
(7.8 KB) - added by heikki
17 years ago.
bioperl processed swissprot file
-
aj224122-bioperl.embl
(8.9 KB) - added by heikki
17 years ago.
bioperl processed EMBL file