Version 1 (modified by heikki, 10 years ago)

--

BioPerl Round Trip , Second Pass

Looking into problems identified in the first pass. Solving major problems when possible. Tag MAJOR means that this should be solved if possible. minor comments are for logging only.

Based on bioperl-live SVN revision 14501.

BioPerl does not have parsers for these formats:

  • ASN.1
  • genbank XML
  • INSD XML

fasta

  • minor: the length of the sequence line can vary (settable using method Bio::SeqIO::fasta::width() )

embl

  • MAJOR: sequence name and accession lost in conversion
    • The downloaded sequence file was mysteriously mangled. New file uploaded. Parser works.
    • Note: EMBL format does not have a separate name any more. The primary accession number is now the name on the ID line.
  • MAJOR: OX line for TaxId is lost
  • minor: only the actual data on the DT (date) line is kept
    DT   27-FEB-1998 (Rel. 54, Created)
    DT   14-NOV-2006 (Rel. 89, Last updated, Version 6)
    ->
    DT   27-FEB-1998
    DT   14-NOV-2006
    
  • minor: The RC (Reference Comment) lines in the Reference section are ignored.
    RC   revised by [4]
    
  • minor: Word wrapping differnences if free text lines, especially in author lists
  • minor: the feature key/value pairs (FT) are not returned in order
  • minor: SQ line does not contain CRC32 value
    • note: there is a method for CRC64 in Bio::SeqIO::swiss::_crc64

genbank

  • MAJOR: SOURCE line adds full stop to the end of the line (following old genbank conversion?)
  • minor: line BASE not present in recent genbank file, still generated by bioperl
  • minor: features are not returned in order

swiss-prot

  • minor: No full stop at the end of the DT lines
  • MAJOR: GN line returning only value from key/value pairs (e.g.
    GN   Name=DOF3.7; Synonyms=BBFA, DAG1;...   
    ->  
    GN   DOF3.7 OR BBFA OR DAG1 ...
    
  • minor: OC line word wrapping differences
  • minor: extra spaces at the end of the first RT line when there are more than one of them
  • MAJOR: RX line:DOI key/value pair lost
  • MAJOR: PE (evidence) line returned between CC and DR lines when it should be between DR and KW lines
  • minor: extra space after first FT line
  • minor: FTid sometimes not written on its own line
  • minor: extra space written to the end of the sequence line