Version 26 (modified by czmasek, 10 years ago)

--

At present there is no standard web-service API for phylogenetic data that would allow integration of phylogenetic data and service providers into the programmable web. Hence, current approaches to integrate data and services into workflows are highly specific to the integration platform (CIPRES, Bioperl, Bio::Phylo, Kepler), and nearly unusable in other environments. This work group is formed to remedy this (to the extent that that's possible in a week).

Here are several ideas for tasks we can work on at the hackathon:

  • Defining scope
    • Issue of identifiers and OTUs
  • Accumulating  use-cases
  • Formulating a task-oriented API  requirements description
  • Identify "method signatures" for API (e.g. input: a tree, output: a number for methods that return tree scores)
  • Proposing a concrete REST or SOAP-based API
  • Propose input/output formats (e.g.  http://$resource?view=$format where $resource is an opaque url to some data resource (matrix, tree, etc) and $format something like nexus, phylip, nexml, phyloxml, json, etc.)
  • Start a reference implementation, for example based on data in BioSQL

Gathering of use-cases and task-oriented requirements has started at  http://evoinfo.nescent.org/PhyloWS.

The Open Space discussion centered on the following issues:

  • The OTU (Operational Taxonomic Unit) perspective is an important use-case.
    • Species tree hypothesis testing: splitting a given set of trees into subsets of trees as a function of compatibility to a given (set of) species tree(s). Degree of compatibility can be expressed as minimal sum of duplications needed to reconcile the gene with a species tree. I.e. measurement of the percentage of gene trees supporting an ecdysozoan versus a coelomata hypothesis.
      • Problem: the query topology will be given with either gene name labels, or species name labels, but the labels of the trees will be OTUs.
      • Hence, each OTU needs to be linked to the gene name(s) and taxon names, and it needs to be possible to specify that matching tree nodes use the linked taxon or gene names.
    • The analysis mentioned above could be extended by asking questions about the (majority of) functional categories supporting a given species tree. These examples require association of the following data with gene tree nodes: taxonomy identifier, sequence identifier (which then, ideally, allows to retrieve functional data, such as GO).
    • Gene tree analysis: similar to the Zmasek et al (2007) paper, one may want to build alignments and phylogenetic trees for all members of each protein (family) of a biological network (e.g. apoptsis). After loading the trees into a database, one could then query the database for those gene trees that exhibit a given pattern (e.g. lineage specific gene expansion or gene loss).
    • In molecular and comparative genomics applications, one may want to find all trees that have been built for a certain sequence.
      • Problem: As above, querying by sequence will give the gene name or the sequence accession number to match by, but tree nodes will have OTUs as labels.
  • We discussed whether we need identifiers for OTUs.
    • Pros: Rather than many individual idiosyncratic schemes for encoding sequence ID and taxon (and possibly additional information) into an OTU label, a single identifier could be resolved to the metadata using a common mechanism (such as LSID). Alternatively, one could standardize on a common encoding mechanism, that could then be parsed by a common mechanism.
    • Cons: If using an (presumably opaque) identifier for OTUs, one ought to be able to expect that the same combination of sequence ID, taxon name (where one often implies the other, unless sequence ID is really an ambiguous gene name), and additional metadata (such as allele, population sample, etc) results in the same identifier, in essence necessitating an OTU identifier registry, or a common algorithm for constructing the identifier (which would then no longer be opaque). A standardized encoding mechanism would need to be widely supported and adopted.
  • We also need to be able to ossociate (typed) data with tree branches
    • The obvious example are branch lengths
    • But we usually also have (possibly multiple) support values associated with tree branches

Attachments