In-silico PCR
From CLAB
CLAB has been involved in several in-silico PCR projects in 2008.
Contents |
[edit] RT376
You can read the details and grab the source code below. This is a summary of the current state of this request and solution to date. --Jhannah 15:53, 29 September 2008 (CDT)
[edit] Request
Given this set of PCR primers
id => 1, name => 'First primers', forward => "aaaccgacgacatcgaccacttc", reverse => "tacggcgtttcgatgaacccgaac", id => 2, name => 'IS 6110', forward => "gccgatctcgtccagcgc", reverse => "gcgtaggcgtcggtgacaaa", id => 3, name => 'IS 1081', forward => "tcgcgtgatccttcgaaacg", reverse => "gcgagttggtcagccagaa", id => 4, name => 'Interesting 1', forward => "aggtcgacgacatcgaccacttc", reverse => "tacggcgtctcgatgaagccgaac",
Search all known bacterial genomes and locate all PCR products of lengths 70-520. For each product found, report several GenBank headers and all annotation information available for every feature overlapping any product found.
[edit] Methodology
Our data source is 29GB of plain text GenBank files sitting on a Linux server. (klab.ist.unomaha.edu:/sequences/genbank/167-20080901). This dataset is refreshed from the NCBI FTP site as frequently as once a month, as new releases become available.
The program do_stuff.pl creates a single table SQLite database to store all hits found. BioPerl is used to parse each GenBank file, one file at a time, one sequence at a time. A regular expression is run against each sequence, searching for each forward and reverse primer of interest, in both strands. (Other projects using this same software also leverage BioPerl for IUPAC ambiguity code expansion so primers can contain gaps or ambiguities. This specific ticket doesn't currently use that functionality, but could.)
Each of these hits is inserted into the SQLite database. A run across all bacterial sequences takes several hours.
Once all hits have been found, they are examined for products. A product occurs when the forward and reverse primers are oriented correctly in the same sequence. The program products.pl reads through all hits in the database, locating all products of the desired length. For each product found, the desired GenBank header information is reported.
The GenBank information is reported by the program subseq.pl which uses seqget.pl to re-fetch the sequence from the original text data source. Using BioPerl for the parsing again, subseq.pl prints a series of headers from the sequence, and then walks through each feature annotated on the sequence. For each feature that overlaps any of the products found, all available information about that feature is printed.
[edit] Next steps
The current (Sep 22, 2008) output of the programs is functional and accurate (as far as I know), but ugly. Fancy HTML / CSS / Javascripting of the output may make the result set easier to consume for the scientist using a web browser. I left off wondering if I should convert subseq.pl output to an XML format, using Template Toolkit to produce pretty reports. (SVN r171 started that conversion, which I'm not positive is the way to go and haven't circled back to this project again yet.)
Other than "looks cool" we haven't received any feedback about the software from the requestors yet.
--Jhannah 16:36, 29 September 2008 (CDT)
[edit] Links
- RT376 username: guest password: guest
- on the web
- source code:
svn co https://clabsvn.ist.unomaha.edu/anonsvn/CLAB/RT376
- A CLOSER LOOK INTO THE rRNA GENE COMPLEX OF PATHOGENIC MEMBERS OF ACTINOBACTERIA by Laura E. Heuermann
- RT262 is an older ticket which was superseded by RT376.
- UCSC In-silico PCR is very cool, and runs instantly instead of my process which takes several hours. But it only does one primer set at a time, and doesn't let you set up your own arbitrary database to search. (It appears the source code is available so you could host your own version with custom datasets. And looping primersets to the tool should be trivial. But there's still no annotation of overlapping features report...? But you could write one based on the output of the UCSC engine instead of my engine...?)

