diff README.rst @ 10:2c8931827fa5 draft

Uploaded with note about NR versioning
author peterjc
date Mon, 30 Mar 2015 11:46:13 -0400
parents 3b5eecc9551e
children 99209ed2ec87
line wrap: on
line diff
--- a/README.rst	Fri Oct 25 10:22:35 2013 -0400
+++ b/README.rst	Mon Mar 30 11:46:13 2015 -0400
@@ -1,99 +1,180 @@
-This is package is a Galaxy workflow for the identification of candidate
-secreted proteins from a given protein FASTA file.
+Introduction
+============
+
+Galaxy is a web-based platform for biological data analysis, supporting
+extension with additional tools (often wrappers for existing command line
+tools) and datatypes. See http://www.galaxyproject.org/ and the public
+server at http://usegalaxy.org for an example.
+
+The NCBI BLAST suite is a widely used set of tools for biological sequence
+comparison. It is available as standalone binaries for use at the command
+line, and via the NCBI website for smaller searches. For more details see
+http://blast.ncbi.nlm.nih.gov/Blast.cgi
+
+This is an example workflow using the Galaxy wrappers for NCBI BLAST+,
+see https://github.com/peterjc/galaxy_blast
+
+
+Galaxy workflow for counting species of top BLAST hits 
+======================================================
+
+This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an
+initial assessment of a transcriptome assembly to give a crude indication of
+any major contamination present based on the species of the top BLAST hit
+of 1000 representative sequences.
+
+.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png
+
+In words, the workflow proceeds as follows:
+
+1. Upload/import your transcriptome assembly or any nucleotide FASTA file.
+2. Samples 1000 representative sequences, selected uniformly/evenly though
+   the file.
+3. Convert the sampled FASTA file into a three column tabular file.
+4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr``
+   database (assuming this is already available setup on your local Galaxy
+   under the alias ``nr``), requesting tabular output including the taxonomy
+   fields, and at most one matching target sequence.
+5. Remove any duplicate alignments (multiple HSPs for the same match).
+6. Combine the filtered BLAST output with the tabular version of the 1000
+   sequences to give a new tabular file with exactly 1000 lines, adding
+   ``None`` for sequences missing a BLAST hit.
+7. Count the BLAST species names in this file.
+8. Sort the counts.
+
+Finally we would suggest visualising the sorted tally table as a Pie Chart.
+
+
+Sample Data
+===========
+
+As an example, you can upload the transcriptome assembly of the nematode
+*Nacobbus abberans* from Eves van den Akker *et al.* (2015),
+http://dx.doi.org/10.1093/gbe/evu171 using this URL:
+
+http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip
+
+Running this workflow with a copy of the NCBI non-redundant ``nr`` database
+from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave
+the following results - note 609 out of the 1000 sequences gave no BLAST hit.
 
-It runs SignalP v3.0 (Bendtsen et al. 2004) and selects only proteins with a
-strong predicted signal peptide, and then runs TMHMM v2.0 (Krogh et al. 2001)
-on those, and selects only proteins without a predicted trans-membrane helix.
-This workflow was used in Kikuchi et al. (2011), and is a simplification of
-the candidate effector protocol described in Jones et al. (2009).
+===== ==================
+Count Subject Blast Name
+----- ------------------
+  609 None
+  244 nematodes
+   30 ascomycetes
+   27 eukaryotes
+    8 basidiomycetes
+    6 aphids
+    5 eudicots
+    5 flies
+  ... ...
+===== ==================
+
+As you might guess from	the filename ``N.abberans_reference_no_contam.fasta``,
+this transcriptome assembly has already had obvious contamination removed.
+
+At the time of writing, Galaxy's visualizations could not be included in
+a workflow. You can generate a pie chart from the final count file using
+the counts (c1) and labels (c2), like this:
+
+.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png
+
+Note the nematode count in this image was shown as a mouse-over effect.
+
+
+Disclaimer
+==========
+
+Species assignment by top BLAST hit is not suitable for any in depth
+analysis. It is particularly prone to false positives where contaminants
+in public datasets are mislabelled. See for example Ed Yong (2015),
+"There's No Plague on the NYC Subway. No Platypuses Either.":
+
+http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/
+
 
-See http://www.galaxyproject.org for information about the Galaxy Project.
+Known Issues
+============
+
+Counts
+------
+
+This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with
+the current stable release (Galaxy v15.03, i.e. March 2015).
+
+The updated "Count" tool version 1.0.1 includes a fix not to remove spaces
+in the fields being counted. In the example above, while the top hits are
+not affected, minor entries like "cellular slime molds" are shown as
+"cellularslimemolds" instead (look closely at the Pie Chart key)..
+
+The updated "Count" tool version 1.0.1 also adds a new option to sort the
+output, which avoids the additional sorting step in the current version of
+the workflow.
+
+A future update to this workflow will use the revised "Count" tool, once
+this is included in the next stable Galaxy release - or migrated to the
+Galaxy Tool Shed.
+
+NCBI nr database
+----------------
+
+The use of external datasets within Galaxy via the ``*.loc`` configuration
+files undermines provenance tracking within Galaxy. This is exacerbated
+by the lack of officially versioned BLAST database releases by the NCBI.
+
+This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc``
+(the configuration file listing locally installed BLAST databases external
+to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details),
+and that this points to a mirror of the latest NCBI "non-redundant" database
+from ftp://ftp.ncbi.nlm.nih.gov/blast/db/
+
+i.e. The workflow is intended to be used against the *latest* nr database,
+and thus is not reproducible over the long term as the database changes.
 
 
 Availability
 ============
 
-This workflow is available to download and/or install from the main
-Galaxy Tool Shed:
+This workflow is available to download and/or install from the main Galaxy Tool Shed:
 
-http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow
+http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
 
 Test releases (which should not normally be used) are on the Test Tool Shed:
 
-http://testtoolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow
+http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species
 
 Development is being done on github here:
 
-https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow
-
-
-Sample Data
-===========
-
-This workflow was developed and run on several nematode species. For example,
-try the protein set for *Bursaphelenchus xylophilus* (Kikuchi et al. 2011):
-
-ftp://ftp.sanger.ac.uk/pub/pathogens/Bursaphelenchus/xylophilus/Assembly-v1.2/BUX.v1.2.genedb.protein.fa.gz
-
-You can upload this directly into Galaxy via this URL. Galaxy will handle
-removing the gzip compression to give you the FASTA protein file which has
-18,074 sequences. The expected result (selecting organism type Eukaryote)
-is a FASTA protein file of 2,297 predicted secreted protein sequences.
+https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
 
 
 Citation
 ========
 
-If you use this workflow directly, or a derivative of it, in work leading
-to a scientific publication, please cite:
+Please cite the following paper (currently available as a preprint):
 
-Cock, P.J.A. and Pritchard, L. (2014). Galaxy as a platform for identifying
-candidate pathogen effectors. Chapter 1 in "Plant-Pathogen Interactions:
-Methods and Protocols (Second Edition)"; P. Birch, J. Jones, and J.I. Bos, eds.
-Methods in Molecular Biology. Humana Press, Springer. ISBN 978-1-62703-985-7.
-http://www.springer.com/life+sciences/plant+sciences/book/978-1-62703-985-7
+NCBI BLAST+ integrated into Galaxy.
+P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo
+bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint)
 
-Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013).
-Galaxy tools and workflows for sequence analysis with applications
-in molecular plant pathology. PeerJ 1:e167
-http://dx.doi.org/10.7717/peerj.167
+You should also cite Galaxy, and the NCBI BLAST+ tools:
 
-Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S. (2004)
-Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783–95.
-http://dx.doi.org/10.1016/j.jmb.2004.05.028
-
-Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. (2001)
-Predicting transmembrane protein topology with a hidden Markov model:
-application to complete genomes. J Mol Biol 305: 567- 580.
-http://dx.doi.org/10.1006/jmbi.2000.4315
+BLAST+: architecture and applications.
+C. Camacho et al. BMC Bioinformatics 2009, 10:421.
+DOI: http://dx.doi.org/10.1186/1471-2105-10-421
 
 
-Additional References
-=====================
-
-Kikuchi, T., Cotton, J.A., Dalzell, J.J., Hasegawa. K., et al. (2011)
-Genomic insights into the origin of parasitism in the emerging plant
-pathogen *Bursaphelenchus xylophilus*. PLoS Pathog 7: e1002219.
-http://dx.doi.org/10.1371/journal.ppat.1002219
+Automated Installation
+======================
 
-Jones, J.T., Kumar, A., Pylypenko, L.A., Thirugnanasambandam, A., et al. (2009)
-Identification and functional characterization of effectors in expressed
-sequence tags from various life cycle stages of the potato cyst nematode
-*Globodera pallida*. Mol Plant Pathol 10: 815–28.
-http://dx.doi.org/10.1111/j.1364-3703.2009.00585.x
-
+Installation via the Galaxy Tool Shed should take care of the dependencies
+on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries.
 
-Dependencies
-============
-
-These dependencies should be resolved automatically via the Galaxy Tool Shed:
-
-* http://toolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp
-* http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id
-
-However, at the time of writing those Galaxy tools have their own
-dependencies required for this workflow which require manual
-installation (SignalP v3.0 and TMHMM v2.0).
+However, this workflow requires a current version of the NCBI nr protein
+BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower
+case).
 
 
 History
@@ -102,12 +183,7 @@
 ======= ======================================================================
 Version Changes
 ------- ----------------------------------------------------------------------
-v0.0.1  - Initial release to Tool Shed (May, 2013)
-        - Expanded README file to include example data
-v0.0.2  - Updated versions of the tools used, inclulding core Galaxy Filter
-          tool to avoid warning about new ``header_lines`` parameter.
-        - Added link to Tool Shed in the workflow annotation explaining there
-          is a README file with sample data, and a requested citation.
+v0.1.0  - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29
 ======= ======================================================================
 
 
@@ -116,15 +192,39 @@
 
 This workflow is under source code control here:
 
-https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow
+https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species
 
 To prepare the tar-ball for uploading to the Tool Shed, I use this:
 
-    $ tar -cf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml secreted_protein_workflow.ga
+    $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png
 
 Check this,
 
-    $ tar -tzf secreted_protein_workflow.tar.gz 
+    $ tar -tzf blast_top_hit_species.tar.gz
     README.rst
     repository_dependencies.xml
-    secreted_protein_workflow.ga
+    blast_top_hit_species.ga
+    blast_top_hit_species.png
+    N_abberans_piechart_mouseover.png
+
+
+Licence (MIT)
+=============
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.