RT11
From CLAB
This page is for discussion of RT 11
GVRG: Build web based search driven by any keyword from all available data
[edit] Problem
RT 10 describes a search function that is driven by gene ID. This project is an extention of that project to enable searching by any keyword or phrase (e.g.: "isomerase") from all available datasources.
[edit] Design
- Choose an index engine
- Perhaps Swish-e? It also has Perl hooks.
- I'm feeling pretty good about Solr so far. It's a Lucene derivative.
- Looks like the GMOD Java folk has done some work for me: LuceGene
- Uses Readseq, a rather powerful looking Java sequence reader / reformater.
- EB-eye Search also has potential. I emailed their help desk asking if their content indexing source code is available. No response. I added a complaint the LuceGene homepage.
- Also see LuceGene homepage for Lucene search at Uniprot.
- Hmm... Lucene doesn't natively do numeric fields? eek!
- But Solr does. phew!
- Looks like the GMOD Java folk has done some work for me: LuceGene
- Other Bio search engines
- Write programs to build the index from all available datasources (LuceGene has done the hard work already?)
- GenBank: /home/jhannah/apache2/htdocs/gbrowse/databases/misc/*gbk
- "10 cluster RNA I+II+III PROTECTED pwdchip 04-25-06.xls"
- Extend this tool to use the search engine if the search string isn't a gene ID: http://kiran.homelinux.net:8081/cgi-bin/gvrg.pl

