Version 6 (modified by heikki, 17 years ago) |
---|
BioPerl Round Trip , Second Pass
Looking into problems identified in the first pass. Solving major problems when possible. Tag MAJOR means that this should be solved if possible. minor comments are for logging only.
Based on bioperl-live SVN revision 14501.
BioPerl does not have parsers for these formats:
- ASN.1
- genbank XML
- INSD XML
fasta
- minor: the length of the sequence line can vary (settable using method Bio::SeqIO::fasta::width() )
embl
MAJOR: sequence name and accession lost in conversion- The downloaded sequence file was mysteriously mangled. New file uploaded. Parser works.
- Note: EMBL format does not have a separate name any more. The primary accession number is now the name on the ID line.
MAJOR: OX line for TaxId is lost- An other error caused by the mangled entry
- The FT key source contains qualifier db_xref to the taxon
- minor: only the actual date on the DT (date) line is kept, release information and document version is lost.
- Note: We now track only the sequence version from the ID line, not the document version from the second DT line.
- Note: BioSQL can store both versions. Should we update the BioPerl Bio::Seq::RichSeqI API to have document version, too?
DT 27-FEB-1998 (Rel. 54, Created) DT 14-NOV-2006 (Rel. 89, Last updated, Version 6) -> DT 27-FEB-1998 DT 14-NOV-2006
- minor: The RC (Reference Comment) lines in the Reference section are ignored.
RC revised by [4]
- minor: Word wrapping differences if free text lines, especially in author lists
- minor: the feature key/value pairs (FT) are not returned in order
minor: SQ line does not contain CRC32 value- the current EMBL format does not use CRC32 any more!
- note: there is a method for CRC64 in Bio::SeqIO::swiss::_crc64
genbank
MAJOR: SOURCE line adds full stop to the end of the line (following old genbank convention?)- fixed: SVN revision 14502
MAJOR: line BASE not present in recent genbank format, but it is still generated by BioPerl- This should be safe line to drop, because GenPept? has never had it and all (most?) parsers can deal with both.
- fixed SVN revision 14503.
- minor: features are not returned in order
swiss-prot
- minor: No full stop at the end of the DT lines
- MAJOR: GN line returning only value from key/value pairs (e.g.
GN Name=DOF3.7; Synonyms=BBFA, DAG1;... -> GN DOF3.7 OR BBFA OR DAG1 ...
- minor: OC line word wrapping differences
- minor: extra spaces at the end of the first RT line when there are more than one of them
- MAJOR: RX line:DOI key/value pair lost
- MAJOR: PE (evidence) line returned between CC and DR lines when it should be between DR and KW lines
- minor: extra space after first FT line
- minor: FTid sometimes not written on its own line
- minor: extra space written to the end of the sequence line