comparison rnabob.xml @ 1:95756833bc6c draft default tip

Uploaded
author rnateam
date Fri, 09 Jan 2015 14:23:53 -0500
parents 0a63e16e1e84
children
comparison
equal deleted inserted replaced
0:0a63e16e1e84 1:95756833bc6c
4 <requirement type="package" version="2.2.1">rnabob</requirement> 4 <requirement type="package" version="2.2.1">rnabob</requirement>
5 </requirements> 5 </requirements>
6 <version_command>echo "2.2.1"</version_command> 6 <version_command>echo "2.2.1"</version_command>
7 <command> 7 <command>
8 <![CDATA[ 8 <![CDATA[
9 rnabob 9 rnabob
10 -q 10 -q
11 $fancy 11 $fancy
12 $compStrands 12 $compStrands
13 $skipOverlapping 13 $skipOverlapping
14 $descriptorFile 14 $descriptorFile
15 $sequenceFile > $stdout 15 $sequenceFile > $stdout
16 ]]> 16 ]]>
17 </command> 17 </command>
18 <stdio> 18 <stdio>
19 <exit_code range="1:" level="fatal" description="Error occurred. Please check Tool Standard Error" /> 19 <exit_code range="1:" level="fatal" description="Error occurred. Please check Tool Standard Error" />
20 <exit_code range=":-1" level="fatal" description="Error occurred. Please check Tool Standard Error" /> 20 <exit_code range=":-1" level="fatal" description="Error occurred. Please check Tool Standard Error" />
21 </stdio> 21 </stdio>
22 <inputs> 22 <inputs>
23 <param name="descriptorFile" type="data" format="txt" multiple="false" label="Motif Descriptor File" help="This file contains the description of the motif for which to search"/> 23 <param name="descriptorFile" type="data" format="txt" multiple="false" label="Motif Descriptor File" help="This file contains the description of the motif for which to search"/>
24 <param name="sequenceFile" type="data" format="fasta" multiple="false" label="Sequence File" help="This file specifies the sequence in which the motif will be searched"/> 24 <param name="sequenceFile" type="data" format="fasta" multiple="false" label="Sequence File" help="This file specifies the sequence in which the motif will be searched"/>
25 <param name="compStrands" type="boolean" truevalue="-c" falsevalue="" checked="false" label="Also search on complementary strands" help="-c : Search both strands of the supplied sequence"/> 25 <param name="compStrands" type="boolean" truevalue="-c" falsevalue="" checked="false" label="Also search on complementary strands" help="-c : Search both strands of the supplied sequence"/>
26 <param name="skipOverlapping" type="boolean" truevalue="-s" falsevalue="" checked="false" label="Skip overlapping matches" help="-s : This is a workaround to avoid a problem in the DNABANK, overlapping matches will be ignored"/> 26 <param name="skipOverlapping" type="boolean" truevalue="-s" falsevalue="" checked="false" label="Skip overlapping matches" help="-s : This is a workaround to avoid a problem in the DNABANK, overlapping matches will be ignored"/>
27 <param name="fancy" type="boolean" checked="false" truevalue="-F" falsevalue="" label="Show Alignments" help="Display full alignments to pattern"/> 27 <param name="fancy" type="boolean" checked="false" truevalue="-F" falsevalue="" label="Show Alignments" help="Display full alignments to pattern"/>
28 </inputs> 28 </inputs>
29 <outputs> 29 <outputs>
30 <data format="txt" name="stdout" label="${tool.name} on ${on_string}" /> 30 <data format="txt" name="stdout" label="${tool.name} on ${on_string}" />
31 </outputs> 31 </outputs>
32 <tests> 32 <tests>
46 <param name="fancy" value="False" /> 46 <param name="fancy" value="False" />
47 <output name="stdout" file="trna.bob" /> 47 <output name="stdout" file="trna.bob" />
48 </test> 48 </test>
49 </tests> 49 </tests>
50 <help> 50 <help>
51 <![CDATA[
51 **What RNABOB does** 52 **What RNABOB does**
52 53
53 RNABOB allows searching a sequence database for RNA structural motifs. 54 RNABOB allows searching a sequence database for RNA structural motifs.
54 The probe motif is specified in a *descriptor* file, 55 The probe motif is specified in a *descriptor* file,
55 which describes its primary sequence, secondary structure, and tertiary constraints. 56 which describes its primary sequence, secondary structure, and tertiary constraints.
57 58
58 ----- 59 -----
59 60
60 **Sequence database format** 61 **Sequence database format**
61 62
62 RNABOB is currently restricted to reading sequence files in FASTA format. 63 RNABOB is currently restricted to reading sequence files in FASTA format.
63 The command line version of RNABOB can also read sequence files in GCG, EMBL, GenBank and other formats. 64 The command line version of RNABOB can also read sequence files in GCG, EMBL, GenBank and other formats.
64 65
65 ----- 66 -----
66 67
67 **Descriptor file syntax** 68 **Descriptor file syntax**
68 69
69 The descriptor file syntax is fairly powerful, and allows a great deal of freedom for specifying 70 The descriptor file syntax is fairly powerful, and allows a great deal of freedom for specifying
70 RNA motifs. The syntax is therefore a bit complicated. 71 RNA motifs. The syntax is therefore a bit complicated.
71 72
72 The descriptor file has two parts: a **topology** description and an **explicit** description. 73 The descriptor file has two parts: a **topology** description and an **explicit** description.
73 74
74 The first non-blank, non-comment line of the file is the topology description. It defines the 75 The first non-blank, non-comment line of the file is the topology description. It defines the
75 order of occurrence of a series of single-stranded, double-stranded and related elements. Each 76 order of occurrence of a series of single-stranded, double-stranded and related elements. Each
76 element must be given a unique name (a number, typically) and must be prefixed with '**s**', 77 element must be given a unique name (a number, typically) and must be prefixed with '**s**',
77 '**h**', or '**r**', indicating single-strand, helical, or a relational element. Helical and 78 '**h**', or '**r**', indicating single-strand, helical, or a relational element. Helical and
78 relational elements are paired to other elements, which are suffixed by a prime, **\'**. 79 relational elements are paired to other elements, which are suffixed by a prime, **\'**.
79 80
80 For example:: 81 For example::
81 82
82 \ 83 \
83 h1 s1 h1' 84 h1 s1 h1'
84 85
85 describes a hairpin loop structure with a simple helix and single-stranded loop. If the helix 86 describes a hairpin loop structure with a simple helix and single-stranded loop. If the helix
86 always contained a non-canonical base pair at one position, the topology coud be described as:: 87 always contained a non-canonical base pair at one position, the topology coud be described as::
87 88
88 \ 89 \
89 h1 r1 h2 s1 h2' r1' h1' 90 h1 r1 h2 s1 h2' r1' h1'
90 91
91 where r1,r1' indicate a correlation, where the sequence r1 constrains the sequence of r1'. 92 where r1,r1' indicate a correlation, where the sequence r1 constrains the sequence of r1'.
92 (Helices are a special case of this.) 93 (Helices are a special case of this.)
93 94
94 The remaining non-comment, non-blank lines are explicit descriptions of each element in turn. Each 95 The remaining non-comment, non-blank lines are explicit descriptions of each element in turn. Each
95 line contains 3 or 4 fields, separated by tabs or blank space. The first field is the name of the 96 line contains 3 or 4 fields, separated by tabs or blank space. The first field is the name of the
96 element, from the topology description. The second field is the number of mismatches allowed in 97 element, from the topology description. The second field is the number of mismatches allowed in
97 this element. The third field is the primary sequence constraint to apply to this element. 98 this element. The third field is the primary sequence constraint to apply to this element.
98 99
99 Helices and relational element pairs are specified on a single line rather than two. Mismatches 100 Helices and relational element pairs are specified on a single line rather than two. Mismatches
100 and primary sequence constraints are given as pairs, separated by a colon '**:**'. The left side 101 and primary sequence constraints are given as pairs, separated by a colon '**:**'. The left side
101 is the constraint applied to the upstream element, and the right side is applied to the downstream 102 is the constraint applied to the upstream element, and the right side is applied to the downstream
102 elements. 103 elements.
103 104
104 The primary sequence constraint is given as a sequence of nucleotides. Any IUPAC single-letter 105 The primary sequence constraint is given as a sequence of nucleotides. Any IUPAC single-letter
105 code is recognized, including N if the position can have any base identity. Allowed length 106 code is recognized, including N if the position can have any base identity. Allowed length
106 variations are specified with asterisks ``'*'``, where each ``*`` will allow either 0 or 1 N at 107 variations are specified with asterisks ``'*'``, where each ``*`` will allow either 0 or 1 N at
107 that position. 108 that position.
108 109
109 For example:: 110 For example::
110 111
111 \ 112 \
112 GGAGG******NNNAUG 113 GGAGG******NNNAUG
113 114
114 specifies a GGAGG Shine/Dalgarno site and an AUG initiation codon, separated by a spacer of 3 to 9 115 specifies a GGAGG Shine/Dalgarno site and an AUG initiation codon, separated by a spacer of 3 to 9
115 nucleotides of any sequence. 116 nucleotides of any sequence.
116 117
117 An alternative syntax can be used for very long gaps:: 118 An alternative syntax can be used for very long gaps::
118 119
119 \ 120 \
120 GGAGG[10]NNNAUG is the same as GGAGG**********NNNAUG 121 GGAGG[10]NNNAUG is the same as GGAGG**********NNNAUG
121 122
122 Be careful defining variable length helices and relational elements; if the number and type (gap 123 Be careful defining variable length helices and relational elements; if the number and type (gap
123 or identity) of position do not match on left and right sides, the program will refuse to accept 124 or identity) of position do not match on left and right sides, the program will refuse to accept
124 the descriptor. 125 the descriptor.
125 126
126 Relational elements have an additional field which specifies a "transformation matrix" of four 127 Relational elements have an additional field which specifies a "transformation matrix" of four
127 nucleotides, specifying the rule for making the ``r'`` pattern from the ``r`` sequence in order 128 nucleotides, specifying the rule for making the ``r'`` pattern from the ``r`` sequence in order
128 ``A-C-G-T``. For example, the transformation matrix for a simple helix is ``TGCA``; if you allow 129 ``A-C-G-T``. For example, the transformation matrix for a simple helix is ``TGCA``; if you allow
129 ``G-U`` pairs, it is ``TGYR``. RNABOB allows ``G-U`` pairing by default and uses the ``TGYR`` 130 ``G-U`` pairs, it is ``TGYR``. RNABOB allows ``G-U`` pairing by default and uses the ``TGYR``
130 matrix for helical elements. 131 matrix for helical elements.
131 132
132 For example, the explicit description of our hairpin might be: 133 For example, the explicit description of our hairpin might be:
133 134
134 :: 135 ::
135 136
136 \ 137 \
137 h1 0:0 NNN:NNN 138 h1 0:0 NNN:NNN
138 r1 0:0 R:N GNAN 139 r1 0:0 R:N GNAN
139 h2 0:0 **NC:GN** 140 h2 0:0 **NC:GN**
140 s1 0 UUCG 141 s1 0 UUCG
141 142
142 This describes a stem of 6 to 8 base pairs, in which the 4th pair from the bottom of the stem must 143 This describes a stem of 6 to 8 base pairs, in which the 4th pair from the bottom of the stem must
143 be a non-canonical GA pair. Note that, in general, the left side of the primary constraint for 144 be a non-canonical GA pair. Note that, in general, the left side of the primary constraint for
144 helices and relational elements is redundant, and should be given as all N. In some cases it is 145 helices and relational elements is redundant, and should be given as all N. In some cases it is
145 convenient to constrain the right side to require a particular base pair (GU, for instance) at one 146 convenient to constrain the right side to require a particular base pair (GU, for instance) at one
146 position. 147 position.
147 148
148 A note on mismatches: The split format for helices and relational elements works like this. The 149 A note on mismatches: The split format for helices and relational elements works like this. The
149 number on the left constrains the primary sequence match of the left side of the primary 150 number on the left constrains the primary sequence match of the left side of the primary
150 constraint. The number on the right constrains the match of the right side of the primary 151 constraint. The number on the right constrains the match of the right side of the primary
151 constraint, *after* that side has been constructed according to the sequence on the left. In other 152 constraint, *after* that side has been constructed according to the sequence on the left. In other
152 words, the number on the left constrains the mismatches in primary sequence only, while the number 153 words, the number on the left constrains the mismatches in primary sequence only, while the number
153 on the right will constrain the number of mispaired positions in the helix. 154 on the right will constrain the number of mispaired positions in the helix.
154 155
155 Finally: any line that begins with a pound sign '#' is a comment line, and will not be interpreted 156 Finally: any line that begins with a pound sign '#' is a comment line, and will not be interpreted
156 by the pattern compiler. 157 by the pattern compiler.
157 158
158 **Options** 159 **Options**
159 160
160 The behavior of RNABOB can be modified by use of the following options: 161 The behavior of RNABOB can be modified by use of the following options:
161 162
162 *Complement*: Selecting this option will cause RNABOB to search for the pattern also on the 163 *Complement*: Selecting this option will cause RNABOB to search for the pattern also on the
163 complementary strands. 164 complementary strands.
164 165
165 *Skip*: This is a workaround to avoid a problem in the DNABANK. There are some sequences in the 166 *Skip*: This is a workaround to avoid a problem in the DNABANK. There are some sequences in the
166 database which have long stretches of ambiguous sequence (N's). Descriptors with no primary 167 database which have long stretches of ambiguous sequence (N's). Descriptors with no primary
167 sequence constraints will match these garbage sequences at many, many positions, and generate huge 168 sequence constraints will match these garbage sequences at many, many positions, and generate huge
168 outputs. This option toggles a search strategy that skips forward a pattern-length rather than a 169 outputs. This option toggles a search strategy that skips forward a pattern-length rather than a
169 single base when a match is found, thus printing out only a single match when overlapping matches 170 single base when a match is found, thus printing out only a single match when overlapping matches
170 are found. 171 are found.
171 172
172 **Examples** 173 **Examples**
173 174
174 The following example descriptors included in the source distribution 175 The following example descriptors included in the source distribution
175 (http://selab.janelia.org/software/rnabob/rnabob.tar.gz): 176 (http://selab.janelia.org/software/rnabob/rnabob.tar.gz):
176 177
177 - trna.des - a general descriptor of a tRNA structure 178 - trna.des - a general descriptor of a tRNA structure
178 - r17.des - descriptor of the consensus binding site for the r17 phage coat protein 179 - r17.des - descriptor of the consensus binding site for the r17 phage coat protein
179 - pseudoknot.des - description of a simple pseudoknotted structure 180 - pseudoknot.des - description of a simple pseudoknotted structure
180 181
181 An example cosmid ``F22B7.fa`` from the *C. elegans* genome sequencing project is also provided 182 An example cosmid ``F22B7.fa`` from the *C. elegans* genome sequencing project is also provided
182 for running these descriptors against. 183 for running these descriptors against.
183 184
184 :: 185 ::
185 186
186 \ 187 \
187 # trna.des 188 # trna.des
188 # 189 #
189 # Generalized descriptor of a tRNA cloverleaf. Doesn't 190 # Generalized descriptor of a tRNA cloverleaf. Doesn't
190 # find them all though. 191 # find them all though.
191 # 192 #
192 193
193 h1 s1 h2 s2 h2' s3 h3 s4 h3' s5 h4 s6 h4' h1' s8 194 h1 s1 h2 s2 h2' s3 h3 s4 h3' s5 h4 s6 h4' h1' s8
194 195
195 h1 0:2 NNNNNNN:NNNNNNN 196 h1 0:2 NNNNNNN:NNNNNNN
196 h2 0:1 *NNN:NNN* 197 h2 0:1 *NNN:NNN*
197 h3 0:1 NNNNN:NNNNN 198 h3 0:1 NNNNN:NNNNN
198 h4 0:1 NNNNN:NNNNN 199 h4 0:1 NNNNN:NNNNN
199 s1 0 TN 200 s1 0 TN
200 s2 0 NNNN********** 201 s2 0 NNNN**********
201 s3 0 N 202 s3 0 N
202 s4 0 NNNNNN* 203 s4 0 NNNNNN*
203 s5 0 NN******************** 204 s5 0 NN********************
204 s6 0 TTC**** 205 s6 0 TTC****
205 s8 0 NCCA 206 s8 0 NCCA
206 207
207 Running RNABOB with ``trna.des`` against ``F22B7.fa`` searches the top strand of the cosmid for 208 Running RNABOB with ``trna.des`` against ``F22B7.fa`` searches the top strand of the cosmid for
208 the above motif. ``trna.des`` hits twice, once on each strand. (F22B7 has several other tRNA genes 209 the above motif. ``trna.des`` hits twice, once on each strand. (F22B7 has several other tRNA genes
209 in it which the pattern fails to detect - this is *not* a pattern to use for tRNA genefinding!). 210 in it which the pattern fails to detect - this is *not* a pattern to use for tRNA genefinding!).
210 </help> 211 ]]>
212 </help>
211 <citations> 213 <citations>
212 <citation type="doi">10.1093/bioinformatics/6.4.325</citation> 214 <citation type="doi">10.1093/bioinformatics/6.4.325</citation>
213 <citation type="bibtex">@UNPUBLISHED{rnabob, 215 <citation type="bibtex">@UNPUBLISHED{rnabob,
214 author = {Eddy S.R}, 216 author = {Eddy S.R},
215 title = {RNABOB: a program to search for RNA secondary structure motifs in sequence databases}, 217 title = {RNABOB: a program to search for RNA secondary structure motifs in sequence databases},
216 note = {}}</citation> 218 note = {}}</citation>
217 </citations> 219 </citations>
218 </tool> 220 </tool>