Version 36 (modified by markjschreiber, 17 years ago) |
---|
The Problem
One of the goals of BioSQL was to provide an interchange platform for the Bio* objects. This has not yet succeeded due to differences in the way the Bio* projects interpret an individual sequence record and how they persist it to the database.
Common sequence semantics and object/ format handling would probably be of great benefit to many other WS providers and consumers. If Bio* can agree on semantics it would be a good reference for many other projects.
Possible Tasks
- Choose some 'reference' sequences to see how the Bio* projects 'round-trip' them.
- Where are the differences and why are there differences?
- Find out where each Bio* project persists it's data into BioSQL during ORM.
- Why are there differences?
- Establish guidelines for where things should go in BioSQL, eg given a Genbank file, what bits should go where.
- Define an interchange format for the Bio* projects. Probably XML, probably borrow something already existing (XEMBL etc).
- Decide on a restricted vocab for annotations and feature types. Probably use SO.
- Define a middleware API for uniform I/O access to sequence database.
- Intially backed by BioSQL.
- Could be backed by any DB.
- Derby version of the BioSQL schema (Derby is the Java reference database).
- A BioSQL release.
Participants
- Mark Schreiber
- Jan Aerts
- Richard Holland
- Hilmar Lapp
- Heikki Lehvaslaiho
- Richard Bruskiewich
- Jan Byrne
- Raoul J.P. Bonnal
References
Formats
Random ideas from Jan Aerts
Mind that this is *very* incomplete. Just to help my really bad memory. As the issue is interoperability of the Bio* toolkits, we don't have to synchronize the toolkits at the object level, but rather at the interface level.
First thing to check: what types of objects do we want to synchronize? Of course sequence objects; but what else? The results of a BLAST parsing?
For sequences
Check if each toolkit reads and writes a GenBank/Fasta/... serialization in the same way. Input can either be an original GenBank/Fasta/... file or a dbfetch from any database.
- What should be conserved:
- Tags
- for a sequence: lower/uppercase
- Within a project it is desirable to mask an alphabet, for transfer between bio* projects this is not a good idea.
- What not necessarily should be conserved:
- for a sequence over multiple lines: length of each line
- Proposal for a common default value, 60bp ?
- for a sequence over multiple lines: length of each line
Task achieved
Tuesday
- Initial planning.
- Approved a BioSQL logo.
- Hilmar initiated BioSQL release discussion
- Selected sequence files to roundtrip
- BioRuby has no way of exporting Bio::Sequence objects to GenBank, EMBL, ...
- Began roundtrips
- Started UML diagram to describe object model with Richard Bruskiewich.
Wednesday
- Continue roundtrips
- BioRuby: started work on creating export filters to GenBank? and EMBL
- BioJava: Fixing issues with UniprotXML format and updating EMBLxml to new xsd.
- Adding BioSQL schema documentation to BioSQL wiki page.
Thursday
- Continue roundtrips
- Translation of BioSQL to Derby RDBMS.
- -- Would like to know why BioSQL can't add multi dbxrefs to one docref.
- -- Would like to work out how best to store EMBL AS lines (EMBLxml 'assemblyElement') in BioSQL beyond simply storing as unparsed qualifier values. Hard also because in some records some columns are missing meaning that XML representation is not possible as XML does not allow for missing values (e.g. primary begin/end).
- -- Would like to work out how best to represent EMBL CO (EMBLxml 'contig') lines in BioSQL. These are extra hard as they are in place of actual sequence data - sequences that have CO lines have NO SQ lines - meaning that sequence length has to be computed as a function of the CO lines rather than being provided. CO lines look like GenBank? locations but have a simpler syntax plus one extra keyword for gaps, which is either numeric or a string 'unk100' indicating unknown gap size.
- Generation of EJB entity beans for BioSQL schema.
- EJB's must be manually generated for BIOENTRY_PATH (done), BIOENTRY_QUALIFIER_VALUE, SEQFEATURE_PATH and TAXON_NAME
- Discussion about consistent use of Unique Keys rather than Primary Keys for compound but mutable instances.
Back to ListOfTopics
Attachments
-
dag1.fa
(5.6 KB) - added by jan.aerts
17 years ago.
DAG1 gene: FASTA formatted sequence
-
dag1.gb
(17.9 KB) - added by jan.aerts
17 years ago.
DAG1 gene: genbank formatted file exported from NCBI
-
dag1.insd
(38.6 KB) - added by jan.aerts
17 years ago.
DAG1 gene: INSD XML formatted file exported from NCBI
-
dag1.asn1
(58.0 KB) - added by jan.aerts
17 years ago.
DAG1 gene: ASN.1 formatted file exported from NCBI
-
dag1.gb_xml
(227.9 KB) - added by jan.aerts
17 years ago.
DAG1 gene: genbank XML formatted file exported from NCBI (probably?)
-
aj224122.asn1
(17.9 KB) - added by jan.aerts
17 years ago.
AJ224122 in ASN1 format as downloaded from NCBI
-
aj224122.fa
(3.9 KB) - added by jan.aerts
17 years ago.
AJ224122 in FASTA format as downloaded from NCBI
-
aj224122.gb
(8.9 KB) - added by jan.aerts
17 years ago.
AJ224122 in genbank format as downloaded from NCBI
-
aj224122.gb_xml
(68.3 KB) - added by jan.aerts
17 years ago.
AJ224122 in genbank XML format as downloaded from NCBI
-
aj224122.insd
(19.5 KB) - added by jan.aerts
17 years ago.
AJ224122 in INSD format as downloaded from NCBI
-
aj224122.swiss
(7.8 KB) - added by heikki
17 years ago.
AJ224122 translation (DOF37_ARATH/Q43385) in swiss-prot format as downloaded from UniProt?
-
dag1-biojava.fa
(5.7 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of fasta
-
aj224122-biojava.fa
(3.9 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of fasta
-
aj224122-biojava.gb
(9.1 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of genbank
-
aj224122-biojava.insd
(22.6 KB) - added by markjschreiber
17 years ago.
biojava roundtrip of ISNDseq
-
aj224122.embl
(9.1 KB) - added by markjschreiber
17 years ago.
Correct version of EMBL file
-
aj224122.EMBL.xml
(14.9 KB) - added by markjschreiber
17 years ago.
EMBLxml format
-
aj224122.uniprot.xml
(17.2 KB) - added by markjschreiber
17 years ago.
uniprot xml
-
aj224122-biojava.insd.xml
(23.6 KB) - added by markjschreiber
17 years ago.
biojava round trip of INSDseq
-
Main.java
(4.7 KB) - added by markjschreiber
17 years ago.
Roundtrip program for biojava
-
derby-biosql.sql
(27.7 KB) - added by markjschreiber
17 years ago.
DERBY schema for BioSQL
-
aj224122-biojava.embl
(9.2 KB) - added by holland
17 years ago.
biojava embl round-trip
-
aj224122-biojava.swiss
(7.8 KB) - added by holland
17 years ago.
biojava uniprot round-trip
-
aj224122-biojava.uniprot.xml
(18.0 KB) - added by holland
17 years ago.
biojava uniprotxml round trip
-
aj224122-bioruby.embl
(8.9 KB) - added by jan.aerts
17 years ago.
bioruby roundtrip for EMBL format
-
roundtrip_bioruby.rb
(161 bytes) - added by jan.aerts
17 years ago.
Ruby script for roundtrip (i.c. EMBL)
-
bioruby_aav50056_fasta_annotated.png
(108.3 KB) - added by jan.aerts
17 years ago.
Annotated picture of a FASTA entry
-
bioruby_ab09071_embl_annotated.png
(92.0 KB) - added by jan.aerts
17 years ago.
Annotated picture of an EMBL entry
-
bioruby_ab09071_gb_annotated.png
(124.9 KB) - added by jan.aerts
17 years ago.
Annotated picture of a GenBank? entry
-
bioperl_convert.pl
(1.6 KB) - added by heikki
17 years ago.
Bioperl script fr roundtriping EMBL, GenBank? and Swiss-Prot files
-
aj224122-bioperl.fa
(3.9 KB) - added by heikki
17 years ago.
bioperl processed fasta file
-
aj224122-bioperl.gb
(8.9 KB) - added by heikki
17 years ago.
bioperl processed Genbank file
-
aj224122-bioperl.swiss
(7.8 KB) - added by heikki
17 years ago.
bioperl processed swissprot file
-
aj224122-bioperl.embl
(8.9 KB) - added by heikki
17 years ago.
bioperl processed EMBL file