Context Navigation

Version 37 (modified by markjschreiber, 16 years ago)
--

The Problem

One of the goals of BioSQL was to provide an interchange platform for the Bio* objects. This has not yet succeeded due to differences in the way the Bio* projects interpret an individual sequence record and how they persist it to the database.

Common sequence semantics and object/ format handling would probably be of great benefit to many other WS providers and consumers. If Bio* can agree on semantics it would be a good reference for many other projects.

Possible Tasks

Choose some 'reference' sequences to see how the Bio* projects 'round-trip' them.
- Where are the differences and why are there differences?
Find out where each Bio* project persists it's data into BioSQL during ORM.
- Why are there differences?
Establish guidelines for where things should go in BioSQL, eg given a Genbank file, what bits should go where.
- UML diagrams?
Define an interchange format for the Bio* projects. Probably XML, probably borrow something already existing (XEMBL etc).
Decide on a restricted vocab for annotations and feature types. Probably use SO.
Define a middleware API for uniform I/O access to sequence database.
- Intially backed by BioSQL.
- Could be backed by any DB.
Derby version of the BioSQL schema (Derby is the Java reference database).
A BioSQL release.

Participants

Mark Schreiber
Jan Aerts
Richard Holland
Hilmar Lapp
Heikki Lehvaslaiho
Richard Bruskiewich
Jan Byrne
Raoul J.P. Bonnal

References

Formats

GenBank sample

Random ideas from Jan Aerts

Mind that this is *very* incomplete. Just to help my really bad memory. As the issue is interoperability of the Bio* toolkits, we don't have to synchronize the toolkits at the object level, but rather at the interface level.

First thing to check: what types of objects do we want to synchronize? Of course sequence objects; but what else? The results of a BLAST parsing?

For sequences

Check if each toolkit reads and writes a GenBank/Fasta/... serialization in the same way. Input can either be an original GenBank/Fasta/... file or a dbfetch from any database.

What should be conserved:
- Tags
- for a sequence: lower/uppercase
  - Within a project it is desirable to mask an alphabet, for transfer between bio* projects this is not a good idea.

What not necessarily should be conserved:
- for a sequence over multiple lines: length of each line
  - Proposal for a common default value, 60bp ?

Task achieved

Tuesday

Initial planning.
Approved a BioSQL logo.
Hilmar initiated BioSQL release discussion
Selected sequence files to roundtrip
BioRuby has no way of exporting Bio::Sequence objects to GenBank, EMBL, ...
Began roundtrips
- BioPerlRoundTripFirstPass
- BioJavaRoundTripFirstPass
Started UML diagram to describe object model with Richard Bruskiewich.

Wednesday

Continue roundtrips
- BioPerlRoundTripSecondPass
- BioJavaRoundTripSecondPass
BioRuby: started work on creating export filters to GenBank? and EMBL
BioJava: Fixing issues with UniprotXML format and updating EMBLxml to new xsd.
Adding BioSQL schema documentation to BioSQL wiki page.

Thursday

Continue roundtrips
- BioJavaRoundTripThirdPass
Translation of BioSQL to Derby RDBMS.
-- Would like to know why BioSQL can't add multi dbxrefs to one docref.
-- Would like to work out how best to store EMBL AS lines (EMBLxml 'assemblyElement') in BioSQL beyond simply storing as unparsed qualifier values. Hard also because in some records some columns are missing meaning that XML representation is not possible as XML does not allow for missing values (e.g. primary begin/end).
-- Would like to work out how best to represent EMBL CO (EMBLxml 'contig') lines in BioSQL. These are extra hard as they are in place of actual sequence data - sequences that have CO lines have NO SQ lines - meaning that sequence length has to be computed as a function of the CO lines rather than being provided. CO lines look like GenBank? locations but have a simpler syntax plus one extra keyword for gaps, which is either numeric or a string 'unk100' indicating unknown gap size.
Generation of EJB entity beans for BioSQL schema.
- EJB's must be manually generated for BIOENTRY_PATH (done), BIOENTRY_QUALIFIER_VALUE (done), SEQFEATURE_PATH and TAXON_NAME
Discussion about consistent use of Unique Keys rather than Primary Keys for compound but mutable instances.

Back to ListOfTopics

Attachments

dag1.fa (5.6 KB) - added by jan.aerts 16 years ago. DAG1 gene: FASTA formatted sequence
dag1.gb (17.9 KB) - added by jan.aerts 16 years ago. DAG1 gene: genbank formatted file exported from NCBI
dag1.insd (38.6 KB) - added by jan.aerts 16 years ago. DAG1 gene: INSD XML formatted file exported from NCBI
dag1.asn1 (58.0 KB) - added by jan.aerts 16 years ago. DAG1 gene: ASN.1 formatted file exported from NCBI
dag1.gb_xml (227.9 KB) - added by jan.aerts 16 years ago. DAG1 gene: genbank XML formatted file exported from NCBI (probably?)
aj224122.asn1 (17.9 KB) - added by jan.aerts 16 years ago. AJ224122 in ASN1 format as downloaded from NCBI
aj224122.fa (3.9 KB) - added by jan.aerts 16 years ago. AJ224122 in FASTA format as downloaded from NCBI
aj224122.gb (8.9 KB) - added by jan.aerts 16 years ago. AJ224122 in genbank format as downloaded from NCBI
aj224122.gb_xml (68.3 KB) - added by jan.aerts 16 years ago. AJ224122 in genbank XML format as downloaded from NCBI
aj224122.insd (19.5 KB) - added by jan.aerts 16 years ago. AJ224122 in INSD format as downloaded from NCBI
aj224122.swiss (7.8 KB) - added by heikki 16 years ago. AJ224122 translation (DOF37_ARATH/Q43385) in swiss-prot format as downloaded from UniProt?
dag1-biojava.fa (5.7 KB) - added by markjschreiber 16 years ago. biojava roundtrip of fasta
aj224122-biojava.fa (3.9 KB) - added by markjschreiber 16 years ago. biojava roundtrip of fasta
aj224122-biojava.gb (9.1 KB) - added by markjschreiber 16 years ago. biojava roundtrip of genbank
aj224122-biojava.insd (22.6 KB) - added by markjschreiber 16 years ago. biojava roundtrip of ISNDseq
aj224122.embl (9.1 KB) - added by markjschreiber 16 years ago. Correct version of EMBL file
aj224122.EMBL.xml (14.9 KB) - added by markjschreiber 16 years ago. EMBLxml format
aj224122.uniprot.xml (17.2 KB) - added by markjschreiber 16 years ago. uniprot xml
aj224122-biojava.insd.xml (23.6 KB) - added by markjschreiber 16 years ago. biojava round trip of INSDseq
Main.java (4.7 KB) - added by markjschreiber 16 years ago. Roundtrip program for biojava
derby-biosql.sql (27.7 KB) - added by markjschreiber 16 years ago. DERBY schema for BioSQL
aj224122-biojava.embl (9.2 KB) - added by holland 16 years ago. biojava embl round-trip
aj224122-biojava.swiss (7.8 KB) - added by holland 16 years ago. biojava uniprot round-trip
aj224122-biojava.uniprot.xml (18.0 KB) - added by holland 16 years ago. biojava uniprotxml round trip
aj224122-bioruby.embl (8.9 KB) - added by jan.aerts 16 years ago. bioruby roundtrip for EMBL format
roundtrip_bioruby.rb (161 bytes) - added by jan.aerts 16 years ago. Ruby script for roundtrip (i.c. EMBL)
bioruby_aav50056_fasta_annotated.png (108.3 KB) - added by jan.aerts 16 years ago. Annotated picture of a FASTA entry
bioruby_ab09071_embl_annotated.png (92.0 KB) - added by jan.aerts 16 years ago. Annotated picture of an EMBL entry
bioruby_ab09071_gb_annotated.png (124.9 KB) - added by jan.aerts 16 years ago. Annotated picture of a GenBank? entry
bioperl_convert.pl (1.6 KB) - added by heikki 16 years ago. Bioperl script fr roundtriping EMBL, GenBank? and Swiss-Prot files
aj224122-bioperl.fa (3.9 KB) - added by heikki 16 years ago. bioperl processed fasta file
aj224122-bioperl.gb (8.9 KB) - added by heikki 16 years ago. bioperl processed Genbank file
aj224122-bioperl.swiss (7.8 KB) - added by heikki 16 years ago. bioperl processed swissprot file
aj224122-bioperl.embl (8.9 KB) - added by heikki 16 years ago. bioperl processed EMBL file

Download in other formats:

Plain Text