allometra | pymood | cacao project

Search allometra.com

home

contact

press

PyMood

Cacao Example Project

Here as an example we show how GenBank cacao ESTs can be sorted into usable and unusable sequences using the PyMood Sequence Processor.

Processing and Masking Cacao ESTs

Cacao EST sequences were retrieved from GenBank at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=nucleotide using the following Boolean string: txid3641[Organism] AND EST[PROP]

We downloaded the 6557 ESTs into one FASTA file and saved it as a cacao.fasta file.

Step 1. Blast for Sequence Processor

During this step the query fasta file is compared to a reference fasta file that contains undesired sequences.

We selected:

"blastn" (nucleotides vs nucleotides)
12 for maximum number of hits
3 for maximum number of alignments
1e-5 (1 x 10^-5) for the expectation value cutoff
OFF for Filter
query FASTA file: cacao.fasta
reference target FASTA file: Vector_M_ATGC_R.fasta

The BLAST run can take from a few minutes to many hours depending on the filesize and the computer processor capacity. 6557 cacao ESTs are typically BLASTed against the supplied reference file "Vector_M_ATGC_R.fasta" in a few minutes.

Upon completion of the BLAST run, PyMood BLAST Launcher / Parser produced the following nine files:

cacao.annotation
Vector_M_ATGC_R.annotation
cacao_vs_Vector_M_ATGC_R.blastn
cacao_vs_Vector_M_ATGC_R.blastn.matrix
cacao_vs_Vector_M_ATGC_R.blastn.matrix.subj.annotation
cacao_vs_Vector_M_ATGC_R.blastn.matrix.all_hits
cacao_vs_Vector_M_ATGC_R.blastn.matrix.blast_stat
cacao_vs_Vector_M_ATGC_R.blastn.matrix.info1
cacao_vs_Vector_M_ATGC_R.blastn.matrix.info2

The detailed description of these types of files is available at our PyMood BLAST Launcher / Parser page

Step 2. Sequence Processor

During this step the resulting new FASTA files and tab-delimited summary files are produced.

Here we made the following selections for cutoffs:

60 nucleotides

12%

20% to 80%

All sequences that meet these first three options are written to a ".good.fasta" file, the ones that do not are written to a ".bad.fasta" file.

The last two options affect only ".good.fasta" sequences. We selected:

To produce the resulting masked file the Sequence Processor compares the query fasta file with the corresponding .all_hits file

Here we selected:
cacao.fasta as the the query file for processing.
cacao_vs_Vector_M_ATGC_R.blastn.matrix.all_hits as the corresponding .all_hits file.

The processing of this particular combination took a few seconds.

Output Files produced by PyMood Sequence Processor

cacao.proc.stat – A tab-delimited file with data on sequence composition for each sequence in the original query cacao.fasta file.

cacao.proc.all.fasta – The original query fasta file formatted so that each sequence is in one line.
cacao.proc.bad.fasta – A new fasta file containing only sequences that have not passed the first three selected options in the Sequence Processor.
cacao.proc.good.fasta – A new fasta file containing only sequences that have passed the first three selected options in the Sequence Processor.
cacao.proc.good.masked.fasta – A new fasta file containing only sequences that have passed all selected options in the Sequence Processor and have the undesired parts masked with the masking letter.
cacao.proc.good.maskedx – A new fasta file containing sequences that passed the first three selected options in the Sequence Processor but have not passed the last option.

cacao.proc.masked.list - a tab-delimited file produced during processing, contains four columns with the information on sequences that have hits to the query fasta file, where the columns are:

A. cacao sequence GI ID
B. ID(s) of the hit sequence(s) from the target fasta file Vector_M_ATGC_R.fasta
C. Length of the alignment
D. Length of the cacao sequence

cacao.proc.tab_all – a tab-delimited file produced from the cacao.proc.all.fasta file
cacao.proc.tab_bad – a tab-delimited file produced from the cacao.proc.bad.fasta file
cacao.proc.tab_good – a tab-delimited file produced from the cacao.proc.good.fasta file
These three files have three columns:

A. cacao sequence GI ID
B. sequence length
C. sequence composition

Public Projects