Mercurial > repos > peterjc > blast_top_hit_species
annotate README.rst @ 4:7d768ff419c0 draft default tip
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
author | peterjc |
---|---|
date | Wed, 09 Sep 2020 15:13:58 +0000 |
parents | a66c358bfbcf |
children |
rev | line source |
---|---|
0 | 1 Introduction |
2 ============ | |
3 | |
4 Galaxy is a web-based platform for biological data analysis, supporting | |
5 extension with additional tools (often wrappers for existing command line | |
6 tools) and datatypes. See http://www.galaxyproject.org/ and the public | |
7 server at http://usegalaxy.org for an example. | |
8 | |
9 The NCBI BLAST suite is a widely used set of tools for biological sequence | |
10 comparison. It is available as standalone binaries for use at the command | |
11 line, and via the NCBI website for smaller searches. For more details see | |
12 http://blast.ncbi.nlm.nih.gov/Blast.cgi | |
13 | |
14 This is an example workflow using the Galaxy wrappers for NCBI BLAST+, | |
15 see https://github.com/peterjc/galaxy_blast | |
16 | |
1 | 17 |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
18 Galaxy workflow for counting species of top BLAST hits |
0 | 19 ====================================================== |
20 | |
21 This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an | |
22 initial assessment of a transcriptome assembly to give a crude indication of | |
2 | 23 any major contamination present based on the species of the top BLAST hit |
0 | 24 of 1000 representative sequences. |
25 | |
26 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png | |
27 | |
28 In words, the workflow proceeds as follows: | |
29 | |
30 1. Upload/import your transcriptome assembly or any nucleotide FASTA file. | |
31 2. Samples 1000 representative sequences, selected uniformly/evenly though | |
32 the file. | |
33 3. Convert the sampled FASTA file into a three column tabular file. | |
34 4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr`` | |
35 database (assuming this is already available setup on your local Galaxy | |
36 under the alias ``nr``), requesting tabular output including the taxonomy | |
37 fields, and at most one matching target sequence. | |
38 5. Remove any duplicate alignments (multiple HSPs for the same match). | |
39 6. Combine the filtered BLAST output with the tabular version of the 1000 | |
40 sequences to give a new tabular file with exactly 1000 lines, adding | |
41 ``None`` for sequences missing a BLAST hit. | |
42 7. Count the BLAST species names in this file. | |
43 8. Sort the counts. | |
44 | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
45 Finally we would suggest visualising the sorted tally table as a Pie Chart, |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
46 as in the example below. |
0 | 47 |
48 | |
49 Sample Data | |
50 =========== | |
51 | |
52 As an example, you can upload the transcriptome assembly of the nematode | |
53 *Nacobbus abberans* from Eves van den Akker *et al.* (2015), | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
54 https://doi.org/10.1093/gbe/evu171 using this URL: |
0 | 55 |
56 http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip | |
57 | |
58 Running this workflow with a copy of the NCBI non-redundant ``nr`` database | |
59 from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave | |
60 the following results - note 609 out of the 1000 sequences gave no BLAST hit. | |
61 | |
62 ===== ================== | |
63 Count Subject Blast Name | |
64 ----- ------------------ | |
65 609 None | |
66 244 nematodes | |
67 30 ascomycetes | |
68 27 eukaryotes | |
69 8 basidiomycetes | |
70 6 aphids | |
71 5 eudicots | |
72 5 flies | |
73 ... ... | |
74 ===== ================== | |
75 | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
76 As you might guess from the filename ``N.abberans_reference_no_contam.fasta``, |
0 | 77 this transcriptome assembly has already had obvious contamination removed. |
78 | |
79 At the time of writing, Galaxy's visualizations could not be included in | |
80 a workflow. You can generate a pie chart from the final count file using | |
81 the counts (c1) and labels (c2), like this: | |
82 | |
83 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png | |
84 | |
85 Note the nematode count in this image was shown as a mouse-over effect. | |
86 | |
87 | |
1 | 88 Disclaimer |
89 ========== | |
90 | |
91 Species assignment by top BLAST hit is not suitable for any in depth | |
2 | 92 analysis. It is particularly prone to false positives where contaminants |
93 in public datasets are mislabelled. See for example Ed Yong (2015), | |
1 | 94 "There's No Plague on the NYC Subway. No Platypuses Either.": |
95 | |
96 http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/ | |
97 | |
98 | |
99 Known Issues | |
100 ============ | |
101 | |
3 | 102 Counts |
103 ------ | |
1 | 104 |
3 | 105 This workflow uses the Galaxy "Count" tool (tool id ``Count1``) version |
106 1.0.0, as shipped with the current stable release (Galaxy v15.03, i.e. | |
107 March 2015). | |
108 | |
109 The updated "Count" tool version 1.0.1 included a fix not to remove spaces | |
1 | 110 in the fields being counted. In the example above, while the top hits are |
111 not affected, minor entries like "cellular slime molds" are shown as | |
3 | 112 "cellularslimemolds" instead (look closely at the Pie Chart key). |
1 | 113 |
3 | 114 The updated "Count" tool version 1.0.2 added a new option to sort the |
115 output, which would allow skipping the final sorting step in the current | |
116 version of this workflow. | |
1 | 117 |
118 A future update to this workflow will use the revised "Count" tool, once | |
119 this is included in the next stable Galaxy release - or migrated to the | |
120 Galaxy Tool Shed. | |
121 | |
3 | 122 NCBI nr database |
123 ---------------- | |
124 | |
125 The use of external datasets within Galaxy via the ``*.loc`` configuration | |
126 files undermines provenance tracking within Galaxy. This is exacerbated | |
127 by the lack of officially versioned BLAST database releases by the NCBI. | |
128 | |
129 This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc`` | |
130 (the configuration file listing locally installed BLAST databases external | |
131 to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details), | |
132 and that this points to a mirror of the latest NCBI "non-redundant" database | |
133 from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | |
134 | |
135 i.e. The workflow is intended to be used against the *latest* nr database, | |
136 and thus is not reproducible over the long term as the database changes. | |
137 | |
138 Note that if your ``blastdb_p.loc`` is missing an entry ``nr`` then the | |
139 workflow should abort. However as of Galaxy v15.03 (March 2015) there is | |
140 a problem with how this is handled: https://trello.com/c/lkYlW14W/ | |
141 | |
1 | 142 |
0 | 143 Availability |
144 ============ | |
145 | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
146 This workflow is available from myExperiment: |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
147 |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
148 http://www.myexperiment.org/workflows/4637 |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
149 |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
150 You can also download and/or install it from the main Galaxy Tool Shed: |
0 | 151 |
152 http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species | |
153 | |
154 Test releases (which should not normally be used) are on the Test Tool Shed: | |
155 | |
156 http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species | |
157 | |
158 Development is being done on github here: | |
159 | |
160 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species | |
161 | |
162 | |
163 Citation | |
164 ======== | |
165 | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
166 Please cite the following paper: |
0 | 167 |
168 NCBI BLAST+ integrated into Galaxy. | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
169 P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo. |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
170 GigaScience 2015, 4:1. |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
171 https://doi.org/10.1186/s13742-015-0080-7 |
0 | 172 |
173 You should also cite Galaxy, and the NCBI BLAST+ tools: | |
174 | |
175 BLAST+: architecture and applications. | |
176 C. Camacho et al. BMC Bioinformatics 2009, 10:421. | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
177 https://doi.org/10.1186/1471-2105-10-421 |
0 | 178 |
179 | |
180 Automated Installation | |
181 ====================== | |
182 | |
183 Installation via the Galaxy Tool Shed should take care of the dependencies | |
184 on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries. | |
185 | |
186 However, this workflow requires a current version of the NCBI nr protein | |
187 BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower | |
188 case). | |
189 | |
190 | |
191 History | |
192 ======= | |
193 | |
194 ======= ====================================================================== | |
195 Version Changes | |
196 ------- ---------------------------------------------------------------------- | |
4
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
197 v0.1.0 - Initial MyExperiment and Tool Shed release. |
7d768ff419c0
"planemo upload for repository https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species commit 3f9f39ad808325a11d9967980d2cb82c96d69324"
peterjc
parents:
3
diff
changeset
|
198 - Targetting NCBI BLAST+ 2.2.29 |
0 | 199 ======= ====================================================================== |
200 | |
201 | |
202 Developers | |
203 ========== | |
204 | |
205 This workflow is under source code control here: | |
206 | |
207 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species | |
208 | |
3 | 209 To prepare the tar-ball for uploading to the Tool Shed, I use this:: |
0 | 210 |
211 $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png | |
212 | |
3 | 213 Check this:: |
0 | 214 |
215 $ tar -tzf blast_top_hit_species.tar.gz | |
216 README.rst | |
217 repository_dependencies.xml | |
218 blast_top_hit_species.ga | |
219 blast_top_hit_species.png | |
220 N_abberans_piechart_mouseover.png | |
221 | |
222 | |
223 Licence (MIT) | |
224 ============= | |
225 | |
226 Permission is hereby granted, free of charge, to any person obtaining a copy | |
227 of this software and associated documentation files (the "Software"), to deal | |
228 in the Software without restriction, including without limitation the rights | |
229 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
230 copies of the Software, and to permit persons to whom the Software is | |
231 furnished to do so, subject to the following conditions: | |
232 | |
233 The above copyright notice and this permission notice shall be included in | |
234 all copies or substantial portions of the Software. | |
235 | |
236 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
237 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
238 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
239 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
240 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
241 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | |
242 THE SOFTWARE. |