Version 10 (modified by markjschreiber, 11 years ago)


The Problem

One of the goals of BioSQL was to provide an interchange platform for the Bio* objects. This has not yet succeeded due to differences in the way the Bio* projects interpret an individual sequence record and how they persist it to the database.

Common sequence semantics and object/ format handling would probably be of great benefit to many other WS providers and consumers. If Bio* can agree on semantics it would be a good reference for many other projects.

Possible Tasks

* Choose some 'reference' sequences to see how the Bio* projects 'round-trip' them. Where are the differences and why are there differences? * Find out where each Bio* project persists it's data into BioSQL during ORM. Why are there differences? * Establish guidelines for where things should go in BioSQL, eg given a Genbank file, what bits should go where. * Define an interchange format for the Bio* projects. Probably XML, probably borrow something already existing (XEMBL etc). * Decide on a restricted vocab for annotations and feature types. Probably use SO.


Random ideas from Jan Aerts

Mind that this is *very* incomplete. Just to help my really bad memory. As the issue is interoperability of the Bio* toolkits, we don't have to synchronize the toolkits at the object level, but rather at the interface level.

First thing to check: what types of objects do we want to synchronize? Of course sequence objects; but what else? The results of a BLAST parsing?

For sequences

Check if each toolkit reads and writes a GenBank/Fasta/... serialization in the same way. Input can either be an original GenBank/Fasta/... file or a dbfetch from any database.

  • What should be conserved:
    • Tags
    • for a sequence: lower/uppercase
  • What not necessarily should be conserved:
    • for a sequence over multiple lines: length of each line

Task achieved


  • Initial planning.
  • Approved a BioSQL logo.

Back to ListOfTopics