MotifHound

MotifHound Algorithm

1. Download

You can download a tar file here. After downloading, the file needs to be decompressed:

> tar -xvzf MotifHound_130801.tar.gz
> cd MotifHound/

2. Third party libraries

In order to work with S. cerevisiae and H. sapiens datasets, we also provide precomputed data on disorder, Pfam domains, evolution and function descriptions. These data are not required to run MotifHound but they are recommended to use it with its full potential. These data can be downloaded here and copied in the "data" directory of MotifHound.

Importantly, MotifHound uses the following programs/libraries that need to be installed:

"Judy arrays" is a C library that can be either installed with the following command on Ubuntu systems:

> sudo apt-get install libjudydebian1

alternatively, the source of this library can be downloaded here.
blast is required if masking of homologous regions is desired. It can be installed with the following command on Ubuntu systems:

> sudo apt-get install Blast2

alternatively, please refer to the NCBI blast pages for other installation instructions.
Some Perl modules are also necessary, for this we recommend to use the following commands:

> sudo apt-get install perl cpanminus

followed by:

> sudo cpanm Tk Tk::TableMatrix File::Basename File::Copy List::MoreUtils Cwd Getopt::Long FindBin Benchmark

3. Running the program

Following these installs, you can then run MotifHound with the following command (and may configure the options as you want):

> perl ./Scripts/MotifHound.pl --Setfile ./Data/Seq/Set/YEAST_Set_TEST.faa --Proteome ./Data/Seq/Proteome/YEAST_Proteome_TEST.fasta --Size 3 10 --Scan --WD ./Results --H --Blastfile ./Data/Blast/Blast_YEAST_Proteome.blast --D --Disofile ./Data/Disorder/YEAST_Proteome_DISORDER.dat --Pfam_annot ./Data/Domains/YEAST_Proteome_Pfam_Domains.txt --Gene_annot ./Data/Genes/YEAST.data --HTML

To display the help :

> perl ./Scripts/MotifHound.pl --help

Benchmark data

We benchmarked MotifHound by creating datasets of protein sequences from S. cerevisiae, in which we spiked-in known motifs. The motifs spiked-in vary in length, number of defined positions, and number of repeats. To exhautively cover these three parameters combinations, we created 11,880 datasets. These datasets are available for download, which we hope will help in the development of future algorithms for motif discovery.