= !BioPerl Round Trip , Second Pass = Looking into problems identified in the first pass. Solving major problems when possible. Tag '''MAJOR''' means that this should be solved if possible. '''minor''' comments are for logging only. Based on bioperl-live SVN revision 14501. I noticed that it is not enough to read in a database entry and write it out. You also must be able to read the first output and finally write out the second output file and show that is identical (or near enough) to the original entry. The similarities of this process to [http://en.wikipedia.org/wiki/Koch's_postulates Koch's postulates] are clear. :) !BioPerl does not have parsers for these formats: * ASN.1 * genbank XML * INSD XML == fasta == * minor: the length of the sequence line can vary (settable using method Bio::SeqIO::fasta::width() ) == embl == * ~~MAJOR: sequence name and accession lost in conversion~~ * The downloaded sequence file was mysteriously mangled. New file uploaded. Parser works. * '''Note:''' EMBL format does not have a separate name any more. The primary accession number is now the name on the ID line. * ~~MAJOR: OX line for !TaxId is lost~~ * An other error caused by the mangled entry * The FT key '''source''' contains qualifier db_xref to the taxon * minor: only the actual date on the DT (date) line is kept, release information and document version is lost. * '''Note:''' We now track only the sequence version from the ID line, not the document version from the second DT line. * '''Note:''' BioSQL can store both versions. Should we update the !BioPerl Bio::Seq::RichSeqI API to have document version, too? {{{ DT 27-FEB-1998 (Rel. 54, Created) DT 14-NOV-2006 (Rel. 89, Last updated, Version 6) -> DT 27-FEB-1998 DT 14-NOV-2006 }}} * minor: The RC (Reference Comment) lines in the Reference section are ignored. {{{ RC revised by [4] }}} * minor: Word wrapping differences if free text lines, especially in author lists * minor: the feature key/value pairs (FT) are not returned in order * ~~minor: SQ line does not contain CRC32 value~~ * the current EMBL format does not use CRC32 any more! * note: there is a method for CRC64 in Bio::SeqIO::swiss::_crc64 == genbank == * ~~MAJOR: SOURCE line adds full stop to the end of the line (following old genbank convention?)~~ * fixed: SVN revision 14502 * ~~MAJOR: line BASE not present in recent genbank format, but it is still generated by !BioPerl~~ * This should be safe line to drop, because GenPept has never had it and all (most?) parsers can deal with both. * fixed: SVN revision 14503. * minor: features are not returned in order == swiss-prot == * ~~MAJOR: No full stop at the end of the DT lines~~ * This is actually important to have. The parser will not grab the sequence version without the stop character. * Changed the parser to work regardless of the end stop character * write_seq now writes out the stops * fixed: SVN revision 14504. * ~~MAJOR: PE (evidence) line returned between CC and DR lines when it should be between DR and KW lines~~ * fixed: SVN revision 14505. * ~~MAJOR: RX line:DOI key/value pair lost~~ * The parser rexerps depend on the order of references. It will be easier to maintain without this restriction. * The parser does not take into account that any of the refs can be missing * Rewrote RX line parsing and writing * fixed: SVN revision 14505. * MAJOR: GN line returning only values from key/value pairs (e.g. {{{ GN Name=DOF3.7; Synonyms=BBFA, DAG1;... -> GN DOF3.7 OR BBFA OR DAG1 ... }}} * MAJOR: Extra spaces and a stop added to FT HELIX and STRAND lines: {{{ FT STRAND 910 913 -> FT STRAND 910 913 . }}} * minor: OC line word wrapping differences * minor: extra spaces at the end of the first RT line when there are more than one of them * minor: extra space after first FT line * minor: FTid sometimes not written on its own line * minor: extra space written to the end of the sequence line * '''Note:''' All these extra spaces at the end of the line come from _write_line_swissprot_regex(). Check if this can be fixed!