comparison README.rst @ 10:2c8931827fa5 draft

Uploaded with note about NR versioning
author peterjc
date Mon, 30 Mar 2015 11:46:13 -0400
parents 3b5eecc9551e
children 99209ed2ec87
comparison
equal deleted inserted replaced
9:3b5eecc9551e 10:2c8931827fa5
1 This is package is a Galaxy workflow for the identification of candidate 1 Introduction
2 secreted proteins from a given protein FASTA file. 2 ============
3 3
4 It runs SignalP v3.0 (Bendtsen et al. 2004) and selects only proteins with a 4 Galaxy is a web-based platform for biological data analysis, supporting
5 strong predicted signal peptide, and then runs TMHMM v2.0 (Krogh et al. 2001) 5 extension with additional tools (often wrappers for existing command line
6 on those, and selects only proteins without a predicted trans-membrane helix. 6 tools) and datatypes. See http://www.galaxyproject.org/ and the public
7 This workflow was used in Kikuchi et al. (2011), and is a simplification of 7 server at http://usegalaxy.org for an example.
8 the candidate effector protocol described in Jones et al. (2009). 8
9 9 The NCBI BLAST suite is a widely used set of tools for biological sequence
10 See http://www.galaxyproject.org for information about the Galaxy Project. 10 comparison. It is available as standalone binaries for use at the command
11 line, and via the NCBI website for smaller searches. For more details see
12 http://blast.ncbi.nlm.nih.gov/Blast.cgi
13
14 This is an example workflow using the Galaxy wrappers for NCBI BLAST+,
15 see https://github.com/peterjc/galaxy_blast
16
17
18 Galaxy workflow for counting species of top BLAST hits
19 ======================================================
20
21 This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an
22 initial assessment of a transcriptome assembly to give a crude indication of
23 any major contamination present based on the species of the top BLAST hit
24 of 1000 representative sequences.
25
26 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png
27
28 In words, the workflow proceeds as follows:
29
30 1. Upload/import your transcriptome assembly or any nucleotide FASTA file.
31 2. Samples 1000 representative sequences, selected uniformly/evenly though
32 the file.
33 3. Convert the sampled FASTA file into a three column tabular file.
34 4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr``
35 database (assuming this is already available setup on your local Galaxy
36 under the alias ``nr``), requesting tabular output including the taxonomy
37 fields, and at most one matching target sequence.
38 5. Remove any duplicate alignments (multiple HSPs for the same match).
39 6. Combine the filtered BLAST output with the tabular version of the 1000
40 sequences to give a new tabular file with exactly 1000 lines, adding
41 ``None`` for sequences missing a BLAST hit.
42 7. Count the BLAST species names in this file.
43 8. Sort the counts.
44
45 Finally we would suggest visualising the sorted tally table as a Pie Chart.
46
47
48 Sample Data
49 ===========
50
51 As an example, you can upload the transcriptome assembly of the nematode
52 *Nacobbus abberans* from Eves van den Akker *et al.* (2015),
53 http://dx.doi.org/10.1093/gbe/evu171 using this URL:
54
55 http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip
56
57 Running this workflow with a copy of the NCBI non-redundant ``nr`` database
58 from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave
59 the following results - note 609 out of the 1000 sequences gave no BLAST hit.
60
61 ===== ==================
62 Count Subject Blast Name
63 ----- ------------------
64 609 None
65 244 nematodes
66 30 ascomycetes
67 27 eukaryotes
68 8 basidiomycetes
69 6 aphids
70 5 eudicots
71 5 flies
72 ... ...
73 ===== ==================
74
75 As you might guess from the filename ``N.abberans_reference_no_contam.fasta``,
76 this transcriptome assembly has already had obvious contamination removed.
77
78 At the time of writing, Galaxy's visualizations could not be included in
79 a workflow. You can generate a pie chart from the final count file using
80 the counts (c1) and labels (c2), like this:
81
82 .. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png
83
84 Note the nematode count in this image was shown as a mouse-over effect.
85
86
87 Disclaimer
88 ==========
89
90 Species assignment by top BLAST hit is not suitable for any in depth
91 analysis. It is particularly prone to false positives where contaminants
92 in public datasets are mislabelled. See for example Ed Yong (2015),
93 "There's No Plague on the NYC Subway. No Platypuses Either.":
94
95 http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/
96
97
98 Known Issues
99 ============
100
101 Counts
102 ------
103
104 This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with
105 the current stable release (Galaxy v15.03, i.e. March 2015).
106
107 The updated "Count" tool version 1.0.1 includes a fix not to remove spaces
108 in the fields being counted. In the example above, while the top hits are
109 not affected, minor entries like "cellular slime molds" are shown as
110 "cellularslimemolds" instead (look closely at the Pie Chart key)..
111
112 The updated "Count" tool version 1.0.1 also adds a new option to sort the
113 output, which avoids the additional sorting step in the current version of
114 the workflow.
115
116 A future update to this workflow will use the revised "Count" tool, once
117 this is included in the next stable Galaxy release - or migrated to the
118 Galaxy Tool Shed.
119
120 NCBI nr database
121 ----------------
122
123 The use of external datasets within Galaxy via the ``*.loc`` configuration
124 files undermines provenance tracking within Galaxy. This is exacerbated
125 by the lack of officially versioned BLAST database releases by the NCBI.
126
127 This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc``
128 (the configuration file listing locally installed BLAST databases external
129 to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details),
130 and that this points to a mirror of the latest NCBI "non-redundant" database
131 from ftp://ftp.ncbi.nlm.nih.gov/blast/db/
132
133 i.e. The workflow is intended to be used against the *latest* nr database,
134 and thus is not reproducible over the long term as the database changes.
11 135
12 136
13 Availability 137 Availability
14 ============ 138 ============
15 139
16 This workflow is available to download and/or install from the main 140 This workflow is available to download and/or install from the main Galaxy Tool Shed:
17 Galaxy Tool Shed: 141
18 142 http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
19 http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow
20 143
21 Test releases (which should not normally be used) are on the Test Tool Shed: 144 Test releases (which should not normally be used) are on the Test Tool Shed:
22 145
23 http://testtoolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow 146 http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
24 147
25 Development is being done on github here: 148 Development is being done on github here:
26 149
27 https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow 150 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
28
29
30 Sample Data
31 ===========
32
33 This workflow was developed and run on several nematode species. For example,
34 try the protein set for *Bursaphelenchus xylophilus* (Kikuchi et al. 2011):
35
36 ftp://ftp.sanger.ac.uk/pub/pathogens/Bursaphelenchus/xylophilus/Assembly-v1.2/BUX.v1.2.genedb.protein.fa.gz
37
38 You can upload this directly into Galaxy via this URL. Galaxy will handle
39 removing the gzip compression to give you the FASTA protein file which has
40 18,074 sequences. The expected result (selecting organism type Eukaryote)
41 is a FASTA protein file of 2,297 predicted secreted protein sequences.
42 151
43 152
44 Citation 153 Citation
45 ======== 154 ========
46 155
47 If you use this workflow directly, or a derivative of it, in work leading 156 Please cite the following paper (currently available as a preprint):
48 to a scientific publication, please cite: 157
49 158 NCBI BLAST+ integrated into Galaxy.
50 Cock, P.J.A. and Pritchard, L. (2014). Galaxy as a platform for identifying 159 P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo
51 candidate pathogen effectors. Chapter 1 in "Plant-Pathogen Interactions: 160 bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint)
52 Methods and Protocols (Second Edition)"; P. Birch, J. Jones, and J.I. Bos, eds. 161
53 Methods in Molecular Biology. Humana Press, Springer. ISBN 978-1-62703-985-7. 162 You should also cite Galaxy, and the NCBI BLAST+ tools:
54 http://www.springer.com/life+sciences/plant+sciences/book/978-1-62703-985-7 163
55 164 BLAST+: architecture and applications.
56 Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). 165 C. Camacho et al. BMC Bioinformatics 2009, 10:421.
57 Galaxy tools and workflows for sequence analysis with applications 166 DOI: http://dx.doi.org/10.1186/1471-2105-10-421
58 in molecular plant pathology. PeerJ 1:e167 167
59 http://dx.doi.org/10.7717/peerj.167 168
60 169 Automated Installation
61 Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S. (2004) 170 ======================
62 Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783–95. 171
63 http://dx.doi.org/10.1016/j.jmb.2004.05.028 172 Installation via the Galaxy Tool Shed should take care of the dependencies
64 173 on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries.
65 Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. (2001) 174
66 Predicting transmembrane protein topology with a hidden Markov model: 175 However, this workflow requires a current version of the NCBI nr protein
67 application to complete genomes. J Mol Biol 305: 567- 580. 176 BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower
68 http://dx.doi.org/10.1006/jmbi.2000.4315 177 case).
69
70
71 Additional References
72 =====================
73
74 Kikuchi, T., Cotton, J.A., Dalzell, J.J., Hasegawa. K., et al. (2011)
75 Genomic insights into the origin of parasitism in the emerging plant
76 pathogen *Bursaphelenchus xylophilus*. PLoS Pathog 7: e1002219.
77 http://dx.doi.org/10.1371/journal.ppat.1002219
78
79 Jones, J.T., Kumar, A., Pylypenko, L.A., Thirugnanasambandam, A., et al. (2009)
80 Identification and functional characterization of effectors in expressed
81 sequence tags from various life cycle stages of the potato cyst nematode
82 *Globodera pallida*. Mol Plant Pathol 10: 815–28.
83 http://dx.doi.org/10.1111/j.1364-3703.2009.00585.x
84
85
86 Dependencies
87 ============
88
89 These dependencies should be resolved automatically via the Galaxy Tool Shed:
90
91 * http://toolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp
92 * http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id
93
94 However, at the time of writing those Galaxy tools have their own
95 dependencies required for this workflow which require manual
96 installation (SignalP v3.0 and TMHMM v2.0).
97 178
98 179
99 History 180 History
100 ======= 181 =======
101 182
102 ======= ====================================================================== 183 ======= ======================================================================
103 Version Changes 184 Version Changes
104 ------- ---------------------------------------------------------------------- 185 ------- ----------------------------------------------------------------------
105 v0.0.1 - Initial release to Tool Shed (May, 2013) 186 v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29
106 - Expanded README file to include example data
107 v0.0.2 - Updated versions of the tools used, inclulding core Galaxy Filter
108 tool to avoid warning about new ``header_lines`` parameter.
109 - Added link to Tool Shed in the workflow annotation explaining there
110 is a README file with sample data, and a requested citation.
111 ======= ====================================================================== 187 ======= ======================================================================
112 188
113 189
114 Developers 190 Developers
115 ========== 191 ==========
116 192
117 This workflow is under source code control here: 193 This workflow is under source code control here:
118 194
119 https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow 195 https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
120 196
121 To prepare the tar-ball for uploading to the Tool Shed, I use this: 197 To prepare the tar-ball for uploading to the Tool Shed, I use this:
122 198
123 $ tar -cf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml secreted_protein_workflow.ga 199 $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png
124 200
125 Check this, 201 Check this,
126 202
127 $ tar -tzf secreted_protein_workflow.tar.gz 203 $ tar -tzf blast_top_hit_species.tar.gz
128 README.rst 204 README.rst
129 repository_dependencies.xml 205 repository_dependencies.xml
130 secreted_protein_workflow.ga 206 blast_top_hit_species.ga
207 blast_top_hit_species.png
208 N_abberans_piechart_mouseover.png
209
210
211 Licence (MIT)
212 =============
213
214 Permission is hereby granted, free of charge, to any person obtaining a copy
215 of this software and associated documentation files (the "Software"), to deal
216 in the Software without restriction, including without limitation the rights
217 to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
218 copies of the Software, and to permit persons to whom the Software is
219 furnished to do so, subject to the following conditions:
220
221 The above copyright notice and this permission notice shall be included in
222 all copies or substantial portions of the Software.
223
224 THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
225 IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
226 FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
227 AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
228 LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
229 OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
230 THE SOFTWARE.