Version 22 (modified by raoul.bonnal, 16 years ago)

--

The Problem

One of the goals of BioSQL was to provide an interchange platform for the Bio* objects. This has not yet succeeded due to differences in the way the Bio* projects interpret an individual sequence record and how they persist it to the database.

Common sequence semantics and object/ format handling would probably be of great benefit to many other WS providers and consumers. If Bio* can agree on semantics it would be a good reference for many other projects.

Possible Tasks

  • Choose some 'reference' sequences to see how the Bio* projects 'round-trip' them.
    • Where are the differences and why are there differences?
  • Find out where each Bio* project persists it's data into BioSQL during ORM.
    • Why are there differences?
  • Establish guidelines for where things should go in BioSQL, eg given a Genbank file, what bits should go where.
    • UML diagrams?
  • Define an interchange format for the Bio* projects. Probably XML, probably borrow something already existing (XEMBL etc).
  • Decide on a restricted vocab for annotations and feature types. Probably use SO.
  • Define a middleware API for uniform I/O access to sequence database.
    • Intially backed by BioSQL.
    • Could be backed by any DB.
  • Derby version of the BioSQL schema (Derby is the Java reference database).
  • A BioSQL release.

Participants

  • Mark Schreiber
  • Jan Aerts
  • Richard Holland
  • Hilmar Lapp
  • Heikki Lehvaslaiho
  • Richard Bruskiewich
  • Jan Byrne
  • Raoul J.P. Bonnal

Random ideas from Jan Aerts

Mind that this is *very* incomplete. Just to help my really bad memory. As the issue is interoperability of the Bio* toolkits, we don't have to synchronize the toolkits at the object level, but rather at the interface level.

First thing to check: what types of objects do we want to synchronize? Of course sequence objects; but what else? The results of a BLAST parsing?

For sequences

Check if each toolkit reads and writes a GenBank/Fasta/... serialization in the same way. Input can either be an original GenBank/Fasta/... file or a dbfetch from any database.

  • What should be conserved:
    • Tags
    • for a sequence: lower/uppercase
      • Within a project it is desirable to mask an alphabet, for transfer between bio* projects this is not a good idea.
  • What not necessarily should be conserved:
    • for a sequence over multiple lines: length of each line
      • Proposal for a default value, 60bp ?

Task achieved

Tuesday

  • Initial planning.
  • Approved a BioSQL logo.
  • Hilmar initiated BioSQL release discussion
  • Selected sequence files to roundtrip
  • Began roundtrips
  • Started UML diagram to describe object model with Richard Bruskiewich.

Back to ListOfTopics

Attachments