Version 32 (modified by rvos, 16 years ago)


At present there is no standard web-service API for phylogenetic data that would allow integration of phylogenetic data and service providers into the programmable web. Hence, current approaches to integrate data and services into workflows are highly specific to the integration platform (CIPRES, Bioperl, Bio::Phylo, Kepler), and nearly unusable in other environments. This work group is formed to remedy this (to the extent that that's possible in a week).

Here are several ideas for tasks we can work on at the hackathon:

  • Defining scope
    • Issue of identifiers and OTUs
  • Accumulating  use-cases
  • Formulating a task-oriented API  requirements description
  • Proposing a concrete REST or SOAP-based API
  • Propose input/output formats (e.g.  http://$resource?view=$format where $resource is an opaque url to some data resource (matrix, tree, etc) and $format something like nexus, phylip, nexml, phyloxml, json, etc.)
  • Start a reference implementation, for example based on data in BioSQL

Gathering of use-cases and task-oriented requirements has started at

Open Space discussion

The Open Space discussion centered on the following issues:

  • The OTU (Operational Taxonomic Unit) perspective is an important use-case.
    • Species tree hypothesis testing: splitting a given set of trees into subsets of trees as a function of compatibility to a given (set of) species tree(s). Degree of compatibility can be expressed as minimal sum of duplications needed to reconcile the gene with a species tree. I.e. measurement of the percentage of gene trees supporting an ecdysozoan versus a coelomata hypothesis.
      • Problem: the query topology will be given with either gene name labels, or species name labels, but the labels of the trees will be OTUs.
      • Hence, each OTU needs to be linked to the gene name(s) and taxon names, and it needs to be possible to specify that matching tree nodes use the linked taxon or gene names.
    • The analysis mentioned above could be extended by asking questions about the (majority of) functional categories supporting a given species tree. These examples require association of the following data with gene tree nodes: taxonomy identifier, sequence identifier (which then, ideally, allows to retrieve functional data, such as GO).
    • Gene tree analysis: similar to the Zmasek et al (2007) paper, one may want to build alignments and phylogenetic trees for all members of each protein (family) of a biological network (e.g. apoptsis). After loading the trees into a database, one could then query the database for those gene trees that exhibit a given pattern (e.g. lineage specific gene expansion or gene loss).
    • In molecular and comparative genomics applications, one may want to find all trees that have been built for a certain sequence.
      • Problem: As above, querying by sequence will give the gene name or the sequence accession number to match by, but tree nodes will have OTUs as labels.
  • We discussed whether we need identifiers for OTUs.
    • Pros: Rather than many individual idiosyncratic schemes for encoding sequence ID and taxon (and possibly additional information) into an OTU label, a single identifier could be resolved to the metadata using a common mechanism (such as LSID). Alternatively, one could standardize on a common encoding mechanism, that could then be parsed by a common mechanism.
    • Cons: If using an (presumably opaque) identifier for OTUs, one ought to be able to expect that the same combination of sequence ID, taxon name (where one often implies the other, unless sequence ID is really an ambiguous gene name), and additional metadata (such as allele, population sample, etc) results in the same identifier, in essence necessitating an OTU identifier registry, or a common algorithm for constructing the identifier (which would then no longer be opaque). A standardized encoding mechanism would need to be widely supported and adopted.
  • We also need to be able to ossociate (typed) data with tree branches
    • The obvious example are branch lengths
    • But we usually also have (possibly multiple) support values associated with tree branches

Tuesday session

In a white board exercise, we identified plausible input and output data types for phyloinformatic webservices. Plausibility is defined by our being able to imagine  use cases (no time line for implementation implied, the goal here is to come up with interfaces)


  • One Tree - exactly one tree, which might function as a query topology, as an input for topology metric calculations, or as something for which associated data (matrices) and metadata might be retrieved
  • Pair of Trees - exactly two trees, which function as inputs for tree-to-tree distance calculations
  • Set of Trees - input for consensus calculations, or as query topologies
  • One OTU - exactly one OTU for which associated data (trees or matrices that contain it) and metadata might be retrieved
  • Pair of OTUs - exactly two OTUs, as input for topological queries (MRCA) and calculations (patristic distance)
  • Set of OTUs - input for topological queries (MRCA) and for which data (trees or matrices that contain them) and metadata might be retrieved
  • One Node - input for tree traversal operations (parent, children) and for which metadata might be retrieved
  • Pair of Nodes - input for topological queries (MRCA) and calculations (patristic distance)
  • Set of Nodes - input for topological queries (MRCA)
  • One Character - exactly one character (matrix column) for which calculations are performed (variability) and metadata is retrieved
  • Set of Characters - input as filter predicate, to retrieve OTUs that contain recorded states for the characters
  • One Character State Sequence - for which metadata is retrieved
  • Pair of Character State Sequences - as input for pairwise alignments, as input to calculate pairwise divergence
  • Set of Character State Sequences - as input for multiple sequence alignment
  • Character State Matrix - as input for inference (of one tree or set of trees), as input for calculations (average sequence divergence) and for which metadata is retrieved