Mercurial > repos > peterjc > secreted_protein_workflow
comparison README.rst @ 10:2c8931827fa5 draft
Uploaded with note about NR versioning
author | peterjc |
---|---|
date | Mon, 30 Mar 2015 11:46:13 -0400 |
parents | 3b5eecc9551e |
children | 99209ed2ec87 |
comparison
equal
deleted
inserted
replaced
9:3b5eecc9551e | 10:2c8931827fa5 |
---|---|
1 This is package is a Galaxy workflow for the identification of candidate | 1 Introduction |
2 secreted proteins from a given protein FASTA file. | 2 ============ |
3 | 3 |
4 It runs SignalP v3.0 (Bendtsen et al. 2004) and selects only proteins with a | 4 Galaxy is a web-based platform for biological data analysis, supporting |
5 strong predicted signal peptide, and then runs TMHMM v2.0 (Krogh et al. 2001) | 5 extension with additional tools (often wrappers for existing command line |
6 on those, and selects only proteins without a predicted trans-membrane helix. | 6 tools) and datatypes. See http://www.galaxyproject.org/ and the public |
7 This workflow was used in Kikuchi et al. (2011), and is a simplification of | 7 server at http://usegalaxy.org for an example. |
8 the candidate effector protocol described in Jones et al. (2009). | 8 |
9 | 9 The NCBI BLAST suite is a widely used set of tools for biological sequence |
10 See http://www.galaxyproject.org for information about the Galaxy Project. | 10 comparison. It is available as standalone binaries for use at the command |
11 line, and via the NCBI website for smaller searches. For more details see | |
12 http://blast.ncbi.nlm.nih.gov/Blast.cgi | |
13 | |
14 This is an example workflow using the Galaxy wrappers for NCBI BLAST+, | |
15 see https://github.com/peterjc/galaxy_blast | |
16 | |
17 | |
18 Galaxy workflow for counting species of top BLAST hits | |
19 ====================================================== | |
20 | |
21 This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an | |
22 initial assessment of a transcriptome assembly to give a crude indication of | |
23 any major contamination present based on the species of the top BLAST hit | |
24 of 1000 representative sequences. | |
25 | |
26 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png | |
27 | |
28 In words, the workflow proceeds as follows: | |
29 | |
30 1. Upload/import your transcriptome assembly or any nucleotide FASTA file. | |
31 2. Samples 1000 representative sequences, selected uniformly/evenly though | |
32 the file. | |
33 3. Convert the sampled FASTA file into a three column tabular file. | |
34 4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr`` | |
35 database (assuming this is already available setup on your local Galaxy | |
36 under the alias ``nr``), requesting tabular output including the taxonomy | |
37 fields, and at most one matching target sequence. | |
38 5. Remove any duplicate alignments (multiple HSPs for the same match). | |
39 6. Combine the filtered BLAST output with the tabular version of the 1000 | |
40 sequences to give a new tabular file with exactly 1000 lines, adding | |
41 ``None`` for sequences missing a BLAST hit. | |
42 7. Count the BLAST species names in this file. | |
43 8. Sort the counts. | |
44 | |
45 Finally we would suggest visualising the sorted tally table as a Pie Chart. | |
46 | |
47 | |
48 Sample Data | |
49 =========== | |
50 | |
51 As an example, you can upload the transcriptome assembly of the nematode | |
52 *Nacobbus abberans* from Eves van den Akker *et al.* (2015), | |
53 http://dx.doi.org/10.1093/gbe/evu171 using this URL: | |
54 | |
55 http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip | |
56 | |
57 Running this workflow with a copy of the NCBI non-redundant ``nr`` database | |
58 from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave | |
59 the following results - note 609 out of the 1000 sequences gave no BLAST hit. | |
60 | |
61 ===== ================== | |
62 Count Subject Blast Name | |
63 ----- ------------------ | |
64 609 None | |
65 244 nematodes | |
66 30 ascomycetes | |
67 27 eukaryotes | |
68 8 basidiomycetes | |
69 6 aphids | |
70 5 eudicots | |
71 5 flies | |
72 ... ... | |
73 ===== ================== | |
74 | |
75 As you might guess from the filename ``N.abberans_reference_no_contam.fasta``, | |
76 this transcriptome assembly has already had obvious contamination removed. | |
77 | |
78 At the time of writing, Galaxy's visualizations could not be included in | |
79 a workflow. You can generate a pie chart from the final count file using | |
80 the counts (c1) and labels (c2), like this: | |
81 | |
82 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png | |
83 | |
84 Note the nematode count in this image was shown as a mouse-over effect. | |
85 | |
86 | |
87 Disclaimer | |
88 ========== | |
89 | |
90 Species assignment by top BLAST hit is not suitable for any in depth | |
91 analysis. It is particularly prone to false positives where contaminants | |
92 in public datasets are mislabelled. See for example Ed Yong (2015), | |
93 "There's No Plague on the NYC Subway. No Platypuses Either.": | |
94 | |
95 http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/ | |
96 | |
97 | |
98 Known Issues | |
99 ============ | |
100 | |
101 Counts | |
102 ------ | |
103 | |
104 This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with | |
105 the current stable release (Galaxy v15.03, i.e. March 2015). | |
106 | |
107 The updated "Count" tool version 1.0.1 includes a fix not to remove spaces | |
108 in the fields being counted. In the example above, while the top hits are | |
109 not affected, minor entries like "cellular slime molds" are shown as | |
110 "cellularslimemolds" instead (look closely at the Pie Chart key).. | |
111 | |
112 The updated "Count" tool version 1.0.1 also adds a new option to sort the | |
113 output, which avoids the additional sorting step in the current version of | |
114 the workflow. | |
115 | |
116 A future update to this workflow will use the revised "Count" tool, once | |
117 this is included in the next stable Galaxy release - or migrated to the | |
118 Galaxy Tool Shed. | |
119 | |
120 NCBI nr database | |
121 ---------------- | |
122 | |
123 The use of external datasets within Galaxy via the ``*.loc`` configuration | |
124 files undermines provenance tracking within Galaxy. This is exacerbated | |
125 by the lack of officially versioned BLAST database releases by the NCBI. | |
126 | |
127 This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc`` | |
128 (the configuration file listing locally installed BLAST databases external | |
129 to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details), | |
130 and that this points to a mirror of the latest NCBI "non-redundant" database | |
131 from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ | |
132 | |
133 i.e. The workflow is intended to be used against the *latest* nr database, | |
134 and thus is not reproducible over the long term as the database changes. | |
11 | 135 |
12 | 136 |
13 Availability | 137 Availability |
14 ============ | 138 ============ |
15 | 139 |
16 This workflow is available to download and/or install from the main | 140 This workflow is available to download and/or install from the main Galaxy Tool Shed: |
17 Galaxy Tool Shed: | 141 |
18 | 142 http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species |
19 http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow | |
20 | 143 |
21 Test releases (which should not normally be used) are on the Test Tool Shed: | 144 Test releases (which should not normally be used) are on the Test Tool Shed: |
22 | 145 |
23 http://testtoolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow | 146 http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species |
24 | 147 |
25 Development is being done on github here: | 148 Development is being done on github here: |
26 | 149 |
27 https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow | 150 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species |
28 | |
29 | |
30 Sample Data | |
31 =========== | |
32 | |
33 This workflow was developed and run on several nematode species. For example, | |
34 try the protein set for *Bursaphelenchus xylophilus* (Kikuchi et al. 2011): | |
35 | |
36 ftp://ftp.sanger.ac.uk/pub/pathogens/Bursaphelenchus/xylophilus/Assembly-v1.2/BUX.v1.2.genedb.protein.fa.gz | |
37 | |
38 You can upload this directly into Galaxy via this URL. Galaxy will handle | |
39 removing the gzip compression to give you the FASTA protein file which has | |
40 18,074 sequences. The expected result (selecting organism type Eukaryote) | |
41 is a FASTA protein file of 2,297 predicted secreted protein sequences. | |
42 | 151 |
43 | 152 |
44 Citation | 153 Citation |
45 ======== | 154 ======== |
46 | 155 |
47 If you use this workflow directly, or a derivative of it, in work leading | 156 Please cite the following paper (currently available as a preprint): |
48 to a scientific publication, please cite: | 157 |
49 | 158 NCBI BLAST+ integrated into Galaxy. |
50 Cock, P.J.A. and Pritchard, L. (2014). Galaxy as a platform for identifying | 159 P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo |
51 candidate pathogen effectors. Chapter 1 in "Plant-Pathogen Interactions: | 160 bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint) |
52 Methods and Protocols (Second Edition)"; P. Birch, J. Jones, and J.I. Bos, eds. | 161 |
53 Methods in Molecular Biology. Humana Press, Springer. ISBN 978-1-62703-985-7. | 162 You should also cite Galaxy, and the NCBI BLAST+ tools: |
54 http://www.springer.com/life+sciences/plant+sciences/book/978-1-62703-985-7 | 163 |
55 | 164 BLAST+: architecture and applications. |
56 Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). | 165 C. Camacho et al. BMC Bioinformatics 2009, 10:421. |
57 Galaxy tools and workflows for sequence analysis with applications | 166 DOI: http://dx.doi.org/10.1186/1471-2105-10-421 |
58 in molecular plant pathology. PeerJ 1:e167 | 167 |
59 http://dx.doi.org/10.7717/peerj.167 | 168 |
60 | 169 Automated Installation |
61 Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S. (2004) | 170 ====================== |
62 Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783–95. | 171 |
63 http://dx.doi.org/10.1016/j.jmb.2004.05.028 | 172 Installation via the Galaxy Tool Shed should take care of the dependencies |
64 | 173 on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries. |
65 Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. (2001) | 174 |
66 Predicting transmembrane protein topology with a hidden Markov model: | 175 However, this workflow requires a current version of the NCBI nr protein |
67 application to complete genomes. J Mol Biol 305: 567- 580. | 176 BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower |
68 http://dx.doi.org/10.1006/jmbi.2000.4315 | 177 case). |
69 | |
70 | |
71 Additional References | |
72 ===================== | |
73 | |
74 Kikuchi, T., Cotton, J.A., Dalzell, J.J., Hasegawa. K., et al. (2011) | |
75 Genomic insights into the origin of parasitism in the emerging plant | |
76 pathogen *Bursaphelenchus xylophilus*. PLoS Pathog 7: e1002219. | |
77 http://dx.doi.org/10.1371/journal.ppat.1002219 | |
78 | |
79 Jones, J.T., Kumar, A., Pylypenko, L.A., Thirugnanasambandam, A., et al. (2009) | |
80 Identification and functional characterization of effectors in expressed | |
81 sequence tags from various life cycle stages of the potato cyst nematode | |
82 *Globodera pallida*. Mol Plant Pathol 10: 815–28. | |
83 http://dx.doi.org/10.1111/j.1364-3703.2009.00585.x | |
84 | |
85 | |
86 Dependencies | |
87 ============ | |
88 | |
89 These dependencies should be resolved automatically via the Galaxy Tool Shed: | |
90 | |
91 * http://toolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp | |
92 * http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id | |
93 | |
94 However, at the time of writing those Galaxy tools have their own | |
95 dependencies required for this workflow which require manual | |
96 installation (SignalP v3.0 and TMHMM v2.0). | |
97 | 178 |
98 | 179 |
99 History | 180 History |
100 ======= | 181 ======= |
101 | 182 |
102 ======= ====================================================================== | 183 ======= ====================================================================== |
103 Version Changes | 184 Version Changes |
104 ------- ---------------------------------------------------------------------- | 185 ------- ---------------------------------------------------------------------- |
105 v0.0.1 - Initial release to Tool Shed (May, 2013) | 186 v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29 |
106 - Expanded README file to include example data | |
107 v0.0.2 - Updated versions of the tools used, inclulding core Galaxy Filter | |
108 tool to avoid warning about new ``header_lines`` parameter. | |
109 - Added link to Tool Shed in the workflow annotation explaining there | |
110 is a README file with sample data, and a requested citation. | |
111 ======= ====================================================================== | 187 ======= ====================================================================== |
112 | 188 |
113 | 189 |
114 Developers | 190 Developers |
115 ========== | 191 ========== |
116 | 192 |
117 This workflow is under source code control here: | 193 This workflow is under source code control here: |
118 | 194 |
119 https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow | 195 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species |
120 | 196 |
121 To prepare the tar-ball for uploading to the Tool Shed, I use this: | 197 To prepare the tar-ball for uploading to the Tool Shed, I use this: |
122 | 198 |
123 $ tar -cf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml secreted_protein_workflow.ga | 199 $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png |
124 | 200 |
125 Check this, | 201 Check this, |
126 | 202 |
127 $ tar -tzf secreted_protein_workflow.tar.gz | 203 $ tar -tzf blast_top_hit_species.tar.gz |
128 README.rst | 204 README.rst |
129 repository_dependencies.xml | 205 repository_dependencies.xml |
130 secreted_protein_workflow.ga | 206 blast_top_hit_species.ga |
207 blast_top_hit_species.png | |
208 N_abberans_piechart_mouseover.png | |
209 | |
210 | |
211 Licence (MIT) | |
212 ============= | |
213 | |
214 Permission is hereby granted, free of charge, to any person obtaining a copy | |
215 of this software and associated documentation files (the "Software"), to deal | |
216 in the Software without restriction, including without limitation the rights | |
217 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | |
218 copies of the Software, and to permit persons to whom the Software is | |
219 furnished to do so, subject to the following conditions: | |
220 | |
221 The above copyright notice and this permission notice shall be included in | |
222 all copies or substantial portions of the Software. | |
223 | |
224 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | |
225 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | |
226 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | |
227 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | |
228 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | |
229 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | |
230 THE SOFTWARE. |