Version 3 (modified by heikki, 17 years ago)

--

BioPerl Round Trip , Second Pass

Looking into problems identified in the first pass. Solving major problems when possible. Tag MAJOR means that this should be solved if possible. minor comments are for logging only.

Based on bioperl-live SVN revision 14501.

BioPerl does not have parsers for these formats:

  • ASN.1
  • genbank XML
  • INSD XML

fasta

  • minor: the length of the sequence line can vary (settable using method Bio::SeqIO::fasta::width() )

embl

  • MAJOR: sequence name and accession lost in conversion
    • The downloaded sequence file was mysteriously mangled. New file uploaded. Parser works.
    • Note: EMBL format does not have a separate name any more. The primary accession number is now the name on the ID line.
  • MAJOR: OX line for TaxId is lost
    • An other error caused by the mangled entry
    • The FT key source contains qualifier db_xref to the taxon
  • minor: only the actual date on the DT (date) line is kept, release information and document version is lost.
    • Note: We now track only the sequence version from the ID line, not the document version from the second DT line.
    • Note: BioSQL can store both versions. Should we update the BioPerl Bio::Seq::RichSeqI API to have document version, too?
      DT   27-FEB-1998 (Rel. 54, Created)
      DT   14-NOV-2006 (Rel. 89, Last updated, Version 6)
      ->
      DT   27-FEB-1998
      DT   14-NOV-2006
      
  • minor: The RC (Reference Comment) lines in the Reference section are ignored.
    RC   revised by [4]
    
  • minor: Word wrapping differences if free text lines, especially in author lists
  • minor: the feature key/value pairs (FT) are not returned in order
  • minor: SQ line does not contain CRC32 value
    • the current EMBL format does not use CRC32 any more!
    • note: there is a method for CRC64 in Bio::SeqIO::swiss::_crc64

genbank

  • MAJOR: SOURCE line adds full stop to the end of the line (following old genbank conversion?)
  • minor: line BASE not present in recent genbank file, still generated by BioPerl
  • minor: features are not returned in order

swiss-prot

  • minor: No full stop at the end of the DT lines
  • MAJOR: GN line returning only value from key/value pairs (e.g.
    GN   Name=DOF3.7; Synonyms=BBFA, DAG1;...   
    ->  
    GN   DOF3.7 OR BBFA OR DAG1 ...
    
  • minor: OC line word wrapping differences
  • minor: extra spaces at the end of the first RT line when there are more than one of them
  • MAJOR: RX line:DOI key/value pair lost
  • MAJOR: PE (evidence) line returned between CC and DR lines when it should be between DR and KW lines
  • minor: extra space after first FT line
  • minor: FTid sometimes not written on its own line
  • minor: extra space written to the end of the sequence line