Mercurial > repos > mvdbeek > patser
annotate patser-v3e.xml @ 1:4d9823e0f6f7 draft default tip
planemo upload for repository http://stormo.wustl.edu/resources.html
author | mvdbeek |
---|---|
date | Mon, 29 Jun 2015 05:57:00 -0400 |
parents | f9ab3aa3e538 |
children |
rev | line source |
---|---|
1
4d9823e0f6f7
planemo upload for repository http://stormo.wustl.edu/resources.html
mvdbeek
parents:
0
diff
changeset
|
1 <tool id="patser-v3e" name="patser" version="0.1.2"> |
4d9823e0f6f7
planemo upload for repository http://stormo.wustl.edu/resources.html
mvdbeek
parents:
0
diff
changeset
|
2 <description>finds putative transcription factor binding sites</description> |
0 | 3 <requirements> |
4 <requirement type="package" version="v3e">patser</requirement> | |
5 </requirements> | |
6 <stdio> | |
7 <exit_code range="1:" /> | |
8 </stdio> | |
9 | |
10 <command><![CDATA[ | |
11 ## We need to transform the fasta input file into the awkward format in that patser can work on | |
12 ## The fasta header must be followed by the nucleotide sequence encapsulated by backslashes. | |
13 ## We simply add backslashes before and after each fasta header and skip the first line, | |
14 ## and we add a final backslash at the end of the file. | |
15 awk '/>/{print "\\"}1' "$input_fasta"|awk '/>/{print;print "\\";next}1'|tail -n +2 >> special.fa; | |
16 echo "\\" >> special.fa; | |
17 patser-v3e -A a:t "$at" c:g "$gc" -m "$input_matrix" -b "$b" $c -d1 -ls "$ls" -f special.fa "$p" > "$output1" | |
18 ]]></command> | |
19 <inputs> | |
20 <param type="data" name="input_matrix" format="txt" help="Provide alignment matrix file"/> | |
21 <param type="data" name="input_fasta" format="fasta" help="Fasta file with sequence"/> | |
22 <param name="v" type="boolean" label="the matrix is a vertical matrix (default: horizontal matrix)" | |
23 truevalue="-v" falsevalue="" | |
24 help="commandline option -v" /> | |
25 <param name="b" type="integer" label="Correction added to the elements of the alignment matrix" | |
26 value="1" | |
27 help="commandline option -b" /> | |
28 <param name="gc" type="float" label="Enter the GC frequency" | |
29 value="0.25" min="0" max="1" | |
30 help="commandline option -A gc:(value)" /> | |
31 <param name="at" type="float" label="Enter the AT frequency" | |
32 value="0.25" min="0" max="1" | |
33 help="commandline option -A at:" /> | |
34 <param name="c" type="boolean" label="Also score the complementary sequences" | |
35 truevalue="-c" falsevalue="" checked="true" | |
36 help="commandline option -c: Also score the complementary sequences. The complements are determined by the program and are not explicitly stated in the sequence fasta" /> | |
37 <param name="p" type="boolean" label="print the weight matrix derived from the alignment matrix" | |
38 truevalue="-p" falsevalue="" checked="true" | |
39 help="commandline option -p" /> | |
40 <param name="ls" type="float" label="Lower-threshold score, inclusive" | |
41 value="7" | |
42 help="commandline option -ls" /> | |
43 </inputs> | |
44 <outputs> | |
45 <data name="output1" format="txt" from_work_dir="output.txt" /> | |
46 </outputs> | |
47 <tests> | |
48 <test> | |
49 <param name="input_matrix" value="PWM_training_EcR-USP.txt"/> | |
50 <param name="input_fasta" value="EcR_USP_224.fa"/> | |
51 <output name="output1" file="output.txt" lines_diff="6"/> | |
52 </test> | |
53 </tests> | |
54 <help><![CDATA[ | |
55 | |
56 This wrapper has been written by Marius van den Beek (m.vandenbeek at gmail.com). | |
57 Patser is available from http://stormo.wustl.edu/resources.html . | |
58 | |
59 ------------------------------------------------------------------------------- | |
60 | |
61 The following options can be determined on the command line: | |
62 | |
63 :: | |
64 | |
65 0) -h: print these directions. | |
66 | |
67 1) Matrix options. | |
68 -m filename: (default name is "matrix") file containing the matrix. | |
69 -w: the matrix is a weight matrix (default: alignment matrix) | |
70 -b number: a non-negative number indicating the total number of | |
71 pseudo-counts added to each alignment position (default: 1). | |
72 Before converting an alignment matrix to a weight matrix, the | |
73 total pseudo-counts multiplied by the a priori probability | |
74 (see section 3 below) of the corresponding letter is added | |
75 to each matrix element. | |
76 -v: the matrix is a vertical matrix (default: horizontal matrix). | |
77 -p: print the weight matrix derived from the alignment matrix. | |
78 | |
79 2) -f filename: this file (default: read from the standard input) contains | |
80 the names of the sequences. The corresponding sequence may follow | |
81 its name if the sequence is enclosed between backslashes (\). | |
82 Otherwise, the sequence is assumed to be in a separate file having | |
83 the indicated name. | |
84 | |
85 In the sequences, whitespace, slashes (/), periods, dashes (unless | |
86 part of an integer when the "-i" option is used), and comments | |
87 beginning with ';', '%', or '#' are ignored. When using letter | |
88 characters (i.e., with the "-a" or "-A" alphabet option), integers | |
89 are also ignored so that the sequence file can contain positional | |
90 information. When using integer characters (i.e., with the "-i" | |
91 alphabet option) the integers must be separated by whitespace. | |
92 | |
93 A "-c" preceding the name of a sequence file indicates that the | |
94 corresponding sequence is circular. | |
95 | |
96 3) Alphabet options---the three options in this section are mutually | |
97 exclusive (default: "-a alphabet"). The a priori probabilities mentioned | |
98 below are used when converting an alignment matrix to a weight matrix. | |
99 -a filename: file containing the alphabet and normalization information. | |
100 | |
101 Each line contains a letter (a symbol in the alphabet) followed by an | |
102 optional normalization number (default: 1.0). The normalization is | |
103 based on the relative a priori probabilities of the letters. For | |
104 nucleic acids, this might be be the genomic frequency of the bases | |
105 or the frequencies observed in the data used to generate the alignment. | |
106 In nucleic acid alphabets, a letter and its complement appear on the | |
107 same line, separated by a colon (a letter can be its own complement, | |
108 e.g. when using a dimer alphabet). Complementary letters may use the | |
109 same normalization number. Only the standard 26 letters are | |
110 permissible; however, when the "-CS" option is used, the alphabet is | |
111 case sensitive so that a total of 52 different characters are possible. | |
112 | |
113 POSSIBLE LINE FORMATS WITHOUT COMPLEMENTARY LETTERS: | |
114 letter | |
115 letter normalization | |
116 | |
117 POSSIBLE LINE FORMATS WITH COMPLEMENTARY LETTERS: | |
118 letter:complement | |
119 letter:complement normalization | |
120 letter:complement normalization:complement's_normalization | |
121 | |
122 -i filename: same as the "-a" option, except that the symbols of | |
123 the alphabet are represented by integers rather than by letters. | |
124 Any integer permitted by the machine is a permissible symbol. | |
125 | |
126 -A alphabet_and_normalization_information: same as "-a" option, except | |
127 information appears on the command line (e.g., -A a:t 3 c:g 2). | |
128 | |
129 4) Alphabet modifiers indicating whether ascii alphabets are case | |
130 sensitive---the two options in this section are mutually exclusive | |
131 with each other and with the "-i" option (default: ascii alphabets are | |
132 case insensitive). | |
133 -CS: ascii alphabets are case sensitive. | |
134 -CM: ascii alphabets are case insensitive, but mark the location | |
135 of lowercase letters by printing a line containing their locations. | |
136 This option is useful when lowercase letters indicate a functional | |
137 landmark such as a transcriptional start in a DNA sequence. | |
138 | |
139 5) Options for adjusting or restricting which information | |
140 and scores are printed. | |
141 The "-ls", "-li", and "-lp" options are mutually exclusive. | |
142 -c: also score the complementary sequences. The complements are | |
143 determined by the program and are not explicitly stated in the | |
144 sequence files. | |
145 -ls number: lower threshold for printing scores, inclusive | |
146 (formerly the -l option). | |
147 -li: assume that the maximum ln(p-value) for printing scores equals | |
148 the negative of the sample-size adjusted information content; | |
149 indirectly determines the lower threshold for printing scores. | |
150 -lp number: the maximum ln(p-value) for printing scores; indirectly | |
151 determines the lower threshold for printing scores. | |
152 -u number: upper threshold for printing scores, exclusive. | |
153 | |
154 -t: just print the top score for each sequence. | |
155 -t number: print the indicated number of top scores for each sequence. | |
156 -ds: if the "-t number" option is used, print the top scores for each | |
157 sequence in the order of decreasing score (default: order the | |
158 scores according to their position within the sequence). | |
159 -e number: the small difference for considering 2 scores equal | |
160 (default: 0.000001) | |
161 | |
162 -s: print the sequence corresponding to each score that is printed. | |
163 | |
164 6) Options indicating how unrecognized symbols are treated (default: -d1). | |
165 Symbols are letters when option "-a" or "-A" is used; | |
166 symbols are integers when option "-i" is used. | |
167 The following three options are mutually exclusive. | |
168 -d0: treat unrecognized symbols as errors and exit the program. | |
169 -d1: treat unrecognized symbols as discontinuities, but print a warning. | |
170 Treating a symbol as a discontinuity means that any L-mer | |
171 containing the unrecognized symbol will be ignored. | |
172 -d2: treat unrecognized symbols as discontinuities, and print NO warning. | |
173 | |
174 7) Options for adjusting the estimation of p-value. | |
175 If the -R option is set to zero, the p-value is not estimated. | |
176 -R number: the range for approximating a column of the weight matrix with | |
177 integers (default: 10000). This number is the difference | |
178 between the largest and smallest integers used to estimate | |
179 the scores. Higher values increase precision, but will take | |
180 longer to calculate the table of p-values. | |
181 -M number: the minimum score for approximating p-values (default: 0). | |
182 Higher values will increase precision, | |
183 but may miss interesting scores. | |
184 | |
185 | |
186 :: | |
187 | |
188 ---------------------------------------------------------------------- | |
189 | |
190 Copyright 1990, 1994, 1995, 1996, 2000, 2001, 2002 Gerald Z. Hertz | |
191 May be copied for noncommercial purposes. | |
192 | |
193 Author: | |
194 Gerald Z. Hertz | |
195 gzhertz AT alum.mit.edu | |
196 | |
197 PATSER (version 3e) | |
198 | |
199 This program scores the L-mers (subsequences of length L) of the | |
200 indicated sequences against the indicated alignment or weight matrix. | |
201 The elements of an alignment matrix are simply the number of times | |
202 that the indicated letter is observed at the indicated position of a | |
203 sequence alignment. Such elements must be processed before the matrix | |
204 can be used to score an L-mer (e.g., Hertz and Stormo, 1999, | |
205 Bioinformatics, 15:563-577). A weight matrix is a matrix whose | |
206 elements are in a form considered appropriate for scoring an L-mer. | |
207 | |
208 Each element of an alignment matrix is converted to an element of a | |
209 weight matrix by first adding pseudo-counts in proportion to the a | |
210 priori probability of the corresponding letter (see option "-b" in | |
211 section 1 below) and dividing by the total number of sequences plus | |
212 the total number of pseudo-counts. The resulting frequency is | |
213 normalized by the a priori probability for the corresponding letter. | |
214 The final quotient is converted to an element of a weight matrix by | |
215 taking its natural logarithm. The use of pseudo-counts here differs | |
216 from previous versions of this program by being proportional to the a | |
217 priori probability. | |
218 | |
219 Version 3 of this program differs from previous versions by also | |
220 numerically estimating the p-value of the scores. The p-value | |
221 calculated here is the probability of observing a particular score or | |
222 higher at a particular sequence position and does NOT account for the | |
223 total amount of sequence being scored. P-values are estimated by the | |
224 method described in Staden, 1989, CABIOS, p. 89--96. The relative | |
225 value for each element of the weight matrix is approximated by | |
226 integers in a range determined by the "-R" and "-M" options (section 6 | |
227 below). The p-value is calculated for each possible integer score and | |
228 the values are stored. The actual scores for the sequences are | |
229 determined from the true weight matrix. The true scores are converted | |
230 to their corresponding integer values and their p-values are looked up. | |
231 | |
232 Matrices can be either horizontal or vertical. In a horizontal | |
233 matrix, the columns correspond to the positions within the pattern, | |
234 and the rows correspond to the letters. Each row begins with the | |
235 corresponding letter (or integer, if the "-i" option is used). In a | |
236 vertical matrix, the rows correspond to the positions within the | |
237 pattern, and the columns correspond to the letters. The first row | |
238 contains the letters (or integers, if the "-i" option is used) | |
239 corresponding to each column. In both types of matrices, spaces, | |
240 tabs, and vertical bars (|) are ignored. The output of the "consensus" | |
241 and "wconsensus" programs consists of horizontal alignment matrices. | |
242 | |
243 The input files can contain comments according to the following | |
244 convention. The portion of a line following a ';', '%', or '#' is | |
245 considered a comment and is ignored. Comments can begin anywhere in a | |
246 line and always end at the end of the line. The output of this | |
247 program is sent to the standard output. | |
248 | |
249 | |
250 ]]></help> | |
251 <citations> | |
252 <citation type="doi">10.1093/bioinformatics/15.7.563</citation> | |
253 </citations> | |
254 </tool> |