Mercurial > repos > abossers > mummer_toolsuite
comparison MUMmer/mummer_clustering.xml @ 0:59f302448cf6
Migrated tool version 1.0.0 from old tool shed archive to new tool shed repository
author | abossers |
---|---|
date | Tue, 07 Jun 2011 17:22:27 -0400 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:59f302448cf6 |
---|---|
1 <tool id="mummer_clustering" name="MUMmer Clustering" version="0.9.alx" force_history_refresh="True"> | |
2 <description>: order sequence matches in clusters</description> | |
3 <command> | |
4 <!-- update this path to the installed location --> | |
5 /opt/MUMmer/MUMmer/$tool.cmd | |
6 #if $tool.cmd=="gaps": | |
7 $in_reference | |
8 #if $tool.gaps_r=="yes": | |
9 -r | |
10 #end if | |
11 #end if | |
12 #if $tool.cmd=="mgaps": | |
13 #if $tool.cmd_C=="yes": | |
14 -C | |
15 #end if | |
16 -d $tool.cmd_d | |
17 #if $tool.cmd_e=="yes": | |
18 -e | |
19 #end if | |
20 -f $tool.cmd_f | |
21 -l $tool.cmd_l | |
22 -s $tool.cmd_s | |
23 #end if | |
24 < $tool.in_match_list | |
25 > $out_tool | |
26 | |
27 </command> | |
28 <inputs> | |
29 <conditional name="tool"> | |
30 <param name="cmd" type="select" label="MUMmer maximal matching" help="Algorithms are run with default parameters (none). For specific args see help below" > | |
31 <option value="gaps" selected="true">gaps</option> | |
32 <option value="mgaps">mgaps</option> | |
33 </param> | |
34 <when value="gaps"> | |
35 <param name="in_reference" type="data" format="fasta" label="Reference FastA file" /> | |
36 <param name="gaps_r" type="select" label="Use reversed [-r]" > | |
37 <option value="no" selected="true">No</option> | |
38 <option value="yes">Yes</option> | |
39 </param> | |
40 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> | |
41 </when> | |
42 <when value="mgaps"> | |
43 <param name="in_match_list" type="data" format="text" label="MUMmer match list" help="See help for more details" /> | |
44 <param name="cmd_C" type="select" label="Check input header labels have reversed keyword [-C]" > | |
45 <option value="no" selected="true">No</option> | |
46 <option value="yes">Yes</option> | |
47 </param> | |
48 <param name="cmd_d" type="integer" size="5" value="5" label="Max fixed diagonal difference [-d]" /> | |
49 <param name="cmd_e" type="select" label="Use extent of cluster [-e]" > | |
50 <option value="no" selected="true">No</option> | |
51 <option value="yes">Yes</option> | |
52 </param> | |
53 <param name="cmd_f" type="float" size="5" value="0.05" label="Max fraction separation for diagonal difference [-f]" /> | |
54 <param name="cmd_l" type="integer" size="5" value="200" label="Min cluster length [-l]" /> | |
55 <param name="cmd_s" type="integer" size="5" value="1000" label="Max separation adjecent matches in cluster [-s]" /> | |
56 </when> | |
57 </conditional> | |
58 </inputs> | |
59 <outputs> | |
60 <data name="out_tool" format="text" label="Clustering output" /> | |
61 </outputs> | |
62 <requirements> | |
63 <requirement type="binary">gaps</requirement> | |
64 <requirement type="binary">mgaps</requirement> | |
65 </requirements> | |
66 <tests> | |
67 <test> | |
68 </test> | |
69 </tests> | |
70 <help> | |
71 | | |
72 | |
73 | |
74 **Reference** | |
75 ============= | |
76 | |
77 - **MUMmer clustering Galaxy tool wrapper:** Alex Bossers, CVI of Wageningen UR, The Netherlands. | |
78 | |
79 - **MUMmer suite v3.22:** http://mummer.sourceforge.net | |
80 | |
81 - **MUMmer tutorials:** http://mummer.sourceforge.net/examples/ | |
82 | |
83 If you found these tools/wrappers usefull in your research, please acknowledge our work. If you improve | |
84 or modify the wrappers please add instead of substitute yourself into the acknowlegement section :) | |
85 | |
86 | |
87 **MUMmer Clustering** | |
88 ===================== | |
89 | |
90 MUMmer's clustering algorithms attempt to order small individual matches into larger match clusters | |
91 in order to make the output of mummer more intelligible. A dot plot makes it easy to spot alignment | |
92 regions from a match list, however when examining the data without graphic aids, it is very difficult | |
93 to draw any reasonable conclusions from the simple flat file list of matches. Clustering the matches | |
94 together into larger groups of neighboring matches makes this process much easier by ordering the | |
95 data and removing spurious matches. | |
96 | |
97 | |
98 Gaps | |
99 ---- | |
100 | |
101 *gaps* is the primary clustering algorithm for run-mummer1, and although classified as a "clustering" | |
102 step, gaps is more of a sorting routine. It implements the LIS (longest increasing subset) algorithm | |
103 to extract the longest consistent set of matches between two sequences, and generates a single | |
104 cluster that represents the best "straight-line" arrangement of matches between the sequences. By | |
105 straight-line, we mean no rearrangements or inversions, just a simple path of agreeing matches | |
106 between the two sequences. This limits the usability of this program to the alignment of genomes | |
107 that are very similar and with no large scale mutations. *gaps* is best suited for the comparison of | |
108 near identical sequences with the goal of finding minor mutations like SNPs and small indels. | |
109 | |
110 Input can be filtered mummer output. The strange syntax is a result of a legacy issue described in | |
111 the Known problems (manual) section, and requires the header be stripped from the mummer output. In | |
112 addition, gaps is only designed to handle a single reference and a single query sequence, thus the | |
113 preceding mummer run must also follow this constraint. The -r is optional and designates the incoming | |
114 matches as reverse complement matches which must reference the reverse complement of the sequence, | |
115 therefore forcing mummer to be run without the -c option. | |
116 | |
117 Reference: http://mummer.sourceforge.net/manual/#gaps | |
118 | |
119 **Output:** | |
120 :: | |
121 | |
122 > /home/aphillip/data/GHP.1con Consistent matches | |
123 183 17 22 none - - | |
124 238 72 108 none 33 33 | |
125 347 181 92 none 1 1 | |
126 458 292 50 none 19 19 | |
127 705 539 44 none 1 1 | |
128 750 584 38 none 1 1 | |
129 807 641 23 -16 0 4 | |
130 (output continues ...) | |
131 > Wrap around | |
132 334398 329917 47 none - 225 | |
133 334446 329965 62 none 1 1 | |
134 334539 330058 20 none 31 31 | |
135 334560 330079 92 none 1 1 | |
136 334653 330172 77 none 1 1 | |
137 334740 330259 41 none 10 10 | |
138 (output continues ...) | |
139 > /home/aphillip/data/GHP.1con Other matches | |
140 1317231 4891 21 none - - | |
141 1317275 4927 21 none - - | |
142 1317804 5399 25 none 508 451 | |
143 947580 5436 36 none - - | |
144 23406 5518 34 none - - | |
145 333079 6592 32 none - - | |
146 (output continues ...) | |
147 | |
148 Where the first line is the location of the reference file, and the first three columns are the same | |
149 as the three column match format described in the mummer section. The final three columns are the | |
150 overlap between this match and the previous match, the gap between the start of this match and the | |
151 end of the previous match in the reference, and the gap between the start of this match and the end | |
152 of the previous match in the query respectively. | |
153 | |
154 | |
155 mgaps | |
156 ----- | |
157 | |
158 *mgaps* was introduced into the MUMmer pipeline in an effort to better handle large-scale | |
159 rearrangements and duplications. Unlike gaps, mgaps is a full clustering algorithm that is capable | |
160 of generating multiple groups of consistently ordered matches. Clustering is controlled by a set of | |
161 command-line parameters that adjust the minimum cluster size, maximum gap between matches, etc. Only | |
162 matches that were included in clusters will appear in the output, so by adjusting the command-line | |
163 parameters it is possible to filter out many of the spurious matches, thus leaving only the larger | |
164 areas of conservation between the input sequences. The major advantage of mgaps is its ability to | |
165 identify these "islands" of conservation. This frees the user from the single LIS restraints of the | |
166 gaps program and allows for the identification of large-scale rearrangements, duplications, gene | |
167 families and so on. | |
168 | |
169 Gaps can fail to identify clusters because they were not consistent with the LIS. However, by using | |
170 mgaps, all regions of conservation can now been identified. The only fallback being the increased | |
171 complexity of the output, where you once had only one cluster for the whole comparison, you usually | |
172 now get more. Because of this, it can sometimes be difficult separating the repetitive clusters from | |
173 "correct" clusters, *making mgaps more suited for global alignments instead of localized error detection*. | |
174 | |
175 Input can be raw mummer output. *mgaps* is only designed to handle a single reference and one or | |
176 more query sequences, thus the preceding mummer run must also follow this constraint. Please refer | |
177 to the run-mummer3 script (see online manual) for an example of how to use this program in an | |
178 alignment pipeline. Note that in order to cluster reverse complement matches, the reverse complement | |
179 matches must reference the reverse complement strand of the query sequence, therefore forcing mummer | |
180 to be run without the -c option. A rewrite of this algorithm to handle multiple reference sequences | |
181 and a better coordinate system (forward coordinates for reverse complement matches) is doubtful but | |
182 may eventually appear. | |
183 | |
184 The -d option can be interpreted as the number of insertions allowed between two matches in the same | |
185 cluster, while the -f option is a fraction equal to (diagonal difference / match separation) where | |
186 a higher value will increase the indel tolerance. Minimum cluster length is the sum of the contained | |
187 matches unless the -e option is used. The best way to get a feel for what each parameter controls | |
188 is to cluster the same data set numerous times with different values and observe the resulting | |
189 differences. It can also be helpful to set these parameters to the size of the element you wish to | |
190 capture, i.e. set the minimum cluster size to say the smallest exon you expect and set the max gap | |
191 to the smallest intron you expect to obtain clusters that could represent single exons (depending | |
192 of course of the similarity of the two sequences). | |
193 | |
194 Reference: http://mummer.sourceforge.net/manual/#mgaps | |
195 | |
196 **Output format** | |
197 | |
198 Output of *mgaps* shares much in common with the output of mummer and gaps, with a slightly different | |
199 header formatting than gaps to allow for multiple query sequences and multiple clusters. The output | |
200 of mgaps run on both forward and reverse complement matches is as follows: | |
201 :: | |
202 | |
203 > ID41 | |
204 > ID41 Reverse | |
205 5177399 1 232 none - - | |
206 5177632 234 6794 none 1 1 | |
207 5184433 7035 24 none 7 7 | |
208 5184468 7069 23 none 11 10 | |
209 > ID42 | |
210 10181 43 1521 none - - | |
211 > ID42 Reverse | |
212 4654536 17 36 none - - | |
213 4654578 57 298 none 6 4 | |
214 4654877 356 226 none 1 1 | |
215 # | |
216 4655139 845 28 none - - | |
217 4655178 884 694 none 11 11 | |
218 4655873 1579 20 none 1 1 | |
219 # | |
220 4850044 17 1492 none - - | |
221 4851537 1510 711 none 1 1 | |
222 4852249 2222 42 none 1 1 | |
223 (output continues ...) | |
224 | |
225 | |
226 Headers containing the ID for each query sequence are listed after the '>' characters, and a | |
227 following Reverse keyword identifies the reverse matches for that query sequence. Individual clusters | |
228 for each sequence are separated by a '#' character, and the six columns are exactly the same as the | |
229 gaps output (see the gaps section for more details). | |
230 | |
231 | |
232 | | |
233 | | |
234 | |
235 </help> | |
236 </tool> | |
237 |