comparison MUMmer/mummer_clustering.xml @ 0:59f302448cf6

Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
author abossers
date Tue, 07 Jun 2011 17:22:27 -0400
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:59f302448cf6
1 <tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True">
2 <description>: order sequence matches in clusters</description>
3 <command>
4 <!-- update this path to the installed location -->
5 /opt/MUMmer/MUMmer/$tool.cmd
6 #if $tool.cmd=="gaps":
7 $in_reference
8 #if $tool.gaps_r=="yes":
9 -r
10 #end if
11 #end if
12 #if $tool.cmd=="mgaps":
13 #if $tool.cmd_C=="yes":
14 -C
15 #end if
16 -d $tool.cmd_d
17 #if $tool.cmd_e=="yes":
18 -e
19 #end if
20 -f $tool.cmd_f
21 -l $tool.cmd_l
22 -s $tool.cmd_s
23 #end if
24 &lt; $tool.in_match_list
25 &gt; $out_tool
26
27 </command>
28 <inputs>
29 <conditional name="tool">
30 <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" >
31 <option value="gaps" selected="true">gaps</option>
32 <option value="mgaps">mgaps</option>
33 </param>
34 <when value="gaps">
35 <param name="in_reference" type="data" format="fasta" label="Reference FastA file" />
36 <param name="gaps_r" type="select" label="Use reversed [-r]" >
37 <option value="no" selected="true">No</option>
38 <option value="yes">Yes</option>
39 </param>
40 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
41 </when>
42 <when value="mgaps">
43 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" />
44 <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" >
45 <option value="no" selected="true">No</option>
46 <option value="yes">Yes</option>
47 </param>
48 <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" />
49 <param name="cmd_e" type="select" label="Use extent of cluster [-e]" >
50 <option value="no" selected="true">No</option>
51 <option value="yes">Yes</option>
52 </param>
53 <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" />
54 <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" />
55 <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" />
56 </when>
57 </conditional>
58 </inputs>
59 <outputs>
60 <data name="out_tool" format="text" label="Clustering output" />
61 </outputs>
62 <requirements>
63 <requirement type="binary">gaps</requirement>
64 <requirement type="binary">mgaps</requirement>
65 </requirements>
66 <tests>
67 <test>
68 </test>
69 </tests>
70 <help>
71 |
72
73
74 **Reference**
75 =============
76
77 - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands.
78
79 - **MUMmer suite v3.22:** http://mummer.sourceforge.net
80
81 - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/
82
83 If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve
84 or modify the wrappers please add instead of substitute yourself into the acknowlegement section :)
85
86
87 **MUMmer Clustering**
88 =====================
89
90 MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters
91 in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment
92 regions from a match list, however when examining the data without graphic aids, it is very difficult
93 to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches
94 together into larger groups of neighboring matches makes this process much easier by ordering the
95 data and removing spurious matches.
96
97
98 Gaps
99 ----
100
101 *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering"
102 step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm
103 to extract the longest consistent set of matches between two sequences, and generates a single
104 cluster that represents the best "straight-line" arrangement of matches between the sequences. By
105 straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches
106 between the two sequences. This limits the usability of this program to the alignment of genomes
107 that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of
108 near identical sequences with the goal of finding minor mutations like SNPs and small indels.
109
110 Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in
111 the Known problems (manual) section, and requires the header be stripped from the mummer output. In
112 addition, gaps is only designed to handle a single reference and a single query sequence, thus the
113 preceding mummer run must also follow this constraint. The -r is optional and designates the incoming
114 matches as reverse complement matches which must reference the reverse complement of the sequence,
115 therefore forcing mummer to be run without the -c option.
116
117 Reference: http://mummer.sourceforge.net/manual/#gaps
118
119 **Output:**
120 ::
121
122 > /home/aphillip/data/GHP.1con Consistent matches
123 183 17 22 none - -
124 238 72 108 none 33 33
125 347 181 92 none 1 1
126 458 292 50 none 19 19
127 705 539 44 none 1 1
128 750 584 38 none 1 1
129 807 641 23 -16 0 4
130 (output continues ...)
131 > Wrap around
132 334398 329917 47 none - 225
133 334446 329965 62 none 1 1
134 334539 330058 20 none 31 31
135 334560 330079 92 none 1 1
136 334653 330172 77 none 1 1
137 334740 330259 41 none 10 10
138 (output continues ...)
139 > /home/aphillip/data/GHP.1con Other matches
140 1317231 4891 21 none - -
141 1317275 4927 21 none - -
142 1317804 5399 25 none 508 451
143 947580 5436 36 none - -
144 23406 5518 34 none - -
145 333079 6592 32 none - -
146 (output continues ...)
147
148 Where the first line is the location of the reference file, and the first three columns are the same
149 as the three column match format described in the mummer section. The final three columns are the
150 overlap between this match and the previous match, the gap between the start of this match and the
151 end of the previous match in the reference, and the gap between the start of this match and the end
152 of the previous match in the query respectively.
153
154
155 mgaps
156 -----
157
158 *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale
159 rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable
160 of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of
161 command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only
162 matches that were included in clusters will appear in the output, so by adjusting the command-line
163 parameters it is possible to filter out many of the spurious matches, thus leaving only the larger
164 areas of conservation between the input sequences. The major advantage of mgaps is its ability to
165 identify these "islands" of conservation. This frees the user from the single LIS restraints of the
166 gaps program and allows for the identification of large-scale rearrangements, duplications, gene
167 families and so on.
168
169 Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using
170 mgaps, all regions of conservation can now been identified. The only fallback being the increased
171 complexity of the output, where you once had only one cluster for the whole comparison, you usually
172 now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from
173 "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*.
174
175 Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or
176 more query sequences, thus the preceding mummer run must also follow this constraint. Please refer
177 to the run-mummer3 script (see online manual) for an example of how to use this program in an
178 alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement
179 matches must reference the reverse complement strand of the query sequence, therefore forcing mummer
180 to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences
181 and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but
182 may eventually appear.
183
184 The -d option can be interpreted as the number of insertions allowed between two matches in the same
185 cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where
186 a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained
187 matches unless the -e option is used. The best way to get a feel for what each parameter controls
188 is to cluster the same data set numerous times with different values and observe the resulting
189 differences. It can also be helpful to set these parameters to the size of the element you wish to
190 capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap
191 to the smallest intron you expect to obtain clusters that could represent single exons (depending
192 of course of the similarity of the two sequences).
193
194 Reference: http://mummer.sourceforge.net/manual/#mgaps
195
196 **Output format**
197
198 Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different
199 header formatting than gaps to allow for multiple query sequences and multiple clusters. The output
200 of mgaps run on both forward and reverse complement matches is as follows:
201 ::
202
203 > ID41
204 > ID41 Reverse
205 5177399 1 232 none - -
206 5177632 234 6794 none 1 1
207 5184433 7035 24 none 7 7
208 5184468 7069 23 none 11 10
209 > ID42
210 10181 43 1521 none - -
211 > ID42 Reverse
212 4654536 17 36 none - -
213 4654578 57 298 none 6 4
214 4654877 356 226 none 1 1
215 #
216 4655139 845 28 none - -
217 4655178 884 694 none 11 11
218 4655873 1579 20 none 1 1
219 #
220 4850044 17 1492 none - -
221 4851537 1510 711 none 1 1
222 4852249 2222 42 none 1 1
223 (output continues ...)
224
225
226 Headers containing the ID for each query sequence are listed after the '>' characters, and a
227 following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters
228 for each sequence are separated by a '#' character, and the six columns are exactly the same as the
229 gaps output (see the gaps section for more details).
230
231
232 |
233 |
234
235 </help>
236 </tool>
237