BioPerl Round Trip , Second Pass

Looking into problems identified in the first pass. Solving major problems when possible. Tag MAJOR means that this should be solved if possible. minor comments are for logging only.

Based on bioperl-live SVN revision 14501.

I noticed that it is not enough to read in a database entry and write it out. You also must be able to read the first output and finally write out the second output file and show that is identical (or near enough) to the original entry. The similarities of this process to  Koch's postulates are clear. :)

BioPerl does not have parsers for these formats:

  • ASN.1
  • genbank XML
  • INSD XML

fasta

  • minor: the length of the sequence line can vary (settable using method Bio::SeqIO::fasta::width() )

embl

  • MAJOR: sequence name and accession lost in conversion
    • The downloaded sequence file was mysteriously mangled. New file uploaded. Parser works.
    • Note: EMBL format does not have a separate name any more. The primary accession number is now the name on the ID line.
  • MAJOR: OX line for TaxId is lost
    • An other error caused by the mangled entry
    • The FT key source contains qualifier db_xref to the taxon
  • minor: only the actual date on the DT (date) line is kept, release information and document version is lost.
    • Note: We now track only the sequence version from the ID line, not the document version from the second DT line.
    • Note: BioSQL can store both versions. Should we update the BioPerl Bio::Seq::RichSeqI API to have document version, too?
      DT   27-FEB-1998 (Rel. 54, Created)
      DT   14-NOV-2006 (Rel. 89, Last updated, Version 6)
      ->
      DT   27-FEB-1998
      DT   14-NOV-2006
      
  • minor: The RC (Reference Comment) lines in the Reference section are ignored.
    RC   revised by [4]
    
  • minor: Word wrapping differences if free text lines, especially in author lists
  • minor: the feature key/value pairs (FT) are not returned in order
  • minor: SQ line does not contain CRC32 value
    • the current EMBL format does not use CRC32 any more!
    • note: there is a method for CRC64 in Bio::SeqIO::swiss::_crc64

genbank

  • MAJOR: SOURCE line adds full stop to the end of the line (following old genbank convention?)
    • fixed: SVN revision 14502
  • MAJOR: line BASE not present in recent genbank format, but it is still generated by BioPerl
    • This should be safe line to drop, because GenPept? has never had it and all (most?) parsers can deal with both.
    • fixed: SVN revision 14503.
  • minor: features are not returned in order

swiss-prot

  • MAJOR: No full stop at the end of the DT lines
    • This is actually important to have. The parser will not grab the sequence version without the stop character.
    • Changed the parser to work regardless of the end stop character
    • write_seq now writes out the stops
    • fixed: SVN revision 14504.
  • MAJOR: PE (evidence) line returned between CC and DR lines when it should be between DR and KW lines
    • fixed: SVN revision 14505.
  • MAJOR: RX line:DOI key/value pair lost
    • The parser rexerps depend on the order of references. It will be easier to maintain without this restriction.
    • The parser does not take into account that any of the refs can be missing
    • Rewrote RX line parsing and writing
    • fixed: SVN revision 14505.
  • MAJOR: Extra spaces and a stop added to FT HELIX and STRAND lines:
    FT   STRAND      910    913
    ->
    FT   STRAND      910    913       .
    
    • fixed: SVN revision 14509.
  • MAJOR: word wrapping differences and extra spaces
    • minor: OC line word wrapping differences
    • minor: extra spaces at the end of the first RT line when there are more than one of them
    • minor: extra space after first FT line
    • minor: extra space written to the end of the sequence line
      • Note: All these extra spaces at the end of the line come from _write_line_swissprot_regex(). Check if this can be fixed!
      • SW max line length seems to be 76, not 80
    • Note: Not all isues are solved, see below
    • fixed: SVN revision 14509.
  • MAJOR: GN line returning only values from key/value pairs (e.g.
    GN   Name=DOF3.7; Synonyms=BBFA, DAG1;...   
    ->  
    GN   DOF3.7 OR BBFA OR DAG1 ...
    
    • fixed: SVN revision 14510.
  • minor: FTid not written on its own line
  • minor: SW does not word wrap between author surname and initials, but does at hyphen '-'
  • Molecular weight differs by a few daltons