Mercurial > repos > peterjc > secreted_protein_workflow
changeset 10:2c8931827fa5 draft
Uploaded with note about NR versioning
| author | peterjc | 
|---|---|
| date | Mon, 30 Mar 2015 11:46:13 -0400 | 
| parents | 3b5eecc9551e | 
| children | 99209ed2ec87 | 
| files | N_abberans_piechart_mouseover.png README.rst blast_top_hit_species.ga blast_top_hit_species.png repository_dependencies.xml secreted_protein_workflow.ga | 
| diffstat | 6 files changed, 518 insertions(+), 373 deletions(-) [+] | 
line wrap: on
 line diff
--- a/README.rst Fri Oct 25 10:22:35 2013 -0400 +++ b/README.rst Mon Mar 30 11:46:13 2015 -0400 @@ -1,99 +1,180 @@ -This is package is a Galaxy workflow for the identification of candidate -secreted proteins from a given protein FASTA file. +Introduction +============ + +Galaxy is a web-based platform for biological data analysis, supporting +extension with additional tools (often wrappers for existing command line +tools) and datatypes. See http://www.galaxyproject.org/ and the public +server at http://usegalaxy.org for an example. + +The NCBI BLAST suite is a widely used set of tools for biological sequence +comparison. It is available as standalone binaries for use at the command +line, and via the NCBI website for smaller searches. For more details see +http://blast.ncbi.nlm.nih.gov/Blast.cgi + +This is an example workflow using the Galaxy wrappers for NCBI BLAST+, +see https://github.com/peterjc/galaxy_blast + + +Galaxy workflow for counting species of top BLAST hits +====================================================== + +This Galaxy workflow (file ``blast_top_hit_species.ga``) is intended for an +initial assessment of a transcriptome assembly to give a crude indication of +any major contamination present based on the species of the top BLAST hit +of 1000 representative sequences. + +.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/blast_top_hit_species.png + +In words, the workflow proceeds as follows: + +1. Upload/import your transcriptome assembly or any nucleotide FASTA file. +2. Samples 1000 representative sequences, selected uniformly/evenly though + the file. +3. Convert the sampled FASTA file into a three column tabular file. +4. Runs NCBI BLASTX of the sampled FASTA file against the latest NCBI ``nr`` + database (assuming this is already available setup on your local Galaxy + under the alias ``nr``), requesting tabular output including the taxonomy + fields, and at most one matching target sequence. +5. Remove any duplicate alignments (multiple HSPs for the same match). +6. Combine the filtered BLAST output with the tabular version of the 1000 + sequences to give a new tabular file with exactly 1000 lines, adding + ``None`` for sequences missing a BLAST hit. +7. Count the BLAST species names in this file. +8. Sort the counts. + +Finally we would suggest visualising the sorted tally table as a Pie Chart. + + +Sample Data +=========== + +As an example, you can upload the transcriptome assembly of the nematode +*Nacobbus abberans* from Eves van den Akker *et al.* (2015), +http://dx.doi.org/10.1093/gbe/evu171 using this URL: + +http://nematode.net/Data/nacobbus_aberrans_transcript_assembly/N.abberans_reference_no_contam.zip + +Running this workflow with a copy of the NCBI non-redundant ``nr`` database +from 16 Oct 2014 (which did **not** contain this *N. abberans* dataset) gave +the following results - note 609 out of the 1000 sequences gave no BLAST hit. -It runs SignalP v3.0 (Bendtsen et al. 2004) and selects only proteins with a -strong predicted signal peptide, and then runs TMHMM v2.0 (Krogh et al. 2001) -on those, and selects only proteins without a predicted trans-membrane helix. -This workflow was used in Kikuchi et al. (2011), and is a simplification of -the candidate effector protocol described in Jones et al. (2009). +===== ================== +Count Subject Blast Name +----- ------------------ + 609 None + 244 nematodes + 30 ascomycetes + 27 eukaryotes + 8 basidiomycetes + 6 aphids + 5 eudicots + 5 flies + ... ... +===== ================== + +As you might guess from the filename ``N.abberans_reference_no_contam.fasta``, +this transcriptome assembly has already had obvious contamination removed. + +At the time of writing, Galaxy's visualizations could not be included in +a workflow. You can generate a pie chart from the final count file using +the counts (c1) and labels (c2), like this: + +.. image:: https://raw.githubusercontent.com/peterjc/galaxy_blast/master/workflows/blast_top_hit_species/N_abberans_piechart_mouseover.png + +Note the nematode count in this image was shown as a mouse-over effect. + + +Disclaimer +========== + +Species assignment by top BLAST hit is not suitable for any in depth +analysis. It is particularly prone to false positives where contaminants +in public datasets are mislabelled. See for example Ed Yong (2015), +"There's No Plague on the NYC Subway. No Platypuses Either.": + +http://phenomena.nationalgeographic.com/2015/02/10/theres-no-plague-on-the-nyc-subway-no-platypuses-either/ + -See http://www.galaxyproject.org for information about the Galaxy Project. +Known Issues +============ + +Counts +------ + +This workflow uses the Galaxy "Count" tool, version 1.0.0, as shipped with +the current stable release (Galaxy v15.03, i.e. March 2015). + +The updated "Count" tool version 1.0.1 includes a fix not to remove spaces +in the fields being counted. In the example above, while the top hits are +not affected, minor entries like "cellular slime molds" are shown as +"cellularslimemolds" instead (look closely at the Pie Chart key).. + +The updated "Count" tool version 1.0.1 also adds a new option to sort the +output, which avoids the additional sorting step in the current version of +the workflow. + +A future update to this workflow will use the revised "Count" tool, once +this is included in the next stable Galaxy release - or migrated to the +Galaxy Tool Shed. + +NCBI nr database +---------------- + +The use of external datasets within Galaxy via the ``*.loc`` configuration +files undermines provenance tracking within Galaxy. This is exacerbated +by the lack of officially versioned BLAST database releases by the NCBI. + +This workflow assumes that you have an entry ``nr`` in your ``blastdb_p.loc`` +(the configuration file listing locally installed BLAST databases external +to Galaxy - consult the NCBI BLAST+ wrapper documentation for more details), +and that this points to a mirror of the latest NCBI "non-redundant" database +from ftp://ftp.ncbi.nlm.nih.gov/blast/db/ + +i.e. The workflow is intended to be used against the *latest* nr database, +and thus is not reproducible over the long term as the database changes. Availability ============ -This workflow is available to download and/or install from the main -Galaxy Tool Shed: +This workflow is available to download and/or install from the main Galaxy Tool Shed: -http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow +http://toolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species Test releases (which should not normally be used) are on the Test Tool Shed: -http://testtoolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow +http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast_top_hit_species Development is being done on github here: -https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow - - -Sample Data -=========== - -This workflow was developed and run on several nematode species. For example, -try the protein set for *Bursaphelenchus xylophilus* (Kikuchi et al. 2011): - -ftp://ftp.sanger.ac.uk/pub/pathogens/Bursaphelenchus/xylophilus/Assembly-v1.2/BUX.v1.2.genedb.protein.fa.gz - -You can upload this directly into Galaxy via this URL. Galaxy will handle -removing the gzip compression to give you the FASTA protein file which has -18,074 sequences. The expected result (selecting organism type Eukaryote) -is a FASTA protein file of 2,297 predicted secreted protein sequences. +https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species Citation ======== -If you use this workflow directly, or a derivative of it, in work leading -to a scientific publication, please cite: +Please cite the following paper (currently available as a preprint): -Cock, P.J.A. and Pritchard, L. (2014). Galaxy as a platform for identifying -candidate pathogen effectors. Chapter 1 in "Plant-Pathogen Interactions: -Methods and Protocols (Second Edition)"; P. Birch, J. Jones, and J.I. Bos, eds. -Methods in Molecular Biology. Humana Press, Springer. ISBN 978-1-62703-985-7. -http://www.springer.com/life+sciences/plant+sciences/book/978-1-62703-985-7 +NCBI BLAST+ integrated into Galaxy. +P.J.A. Cock, J.M. Chilton, B. Gruening, J.E. Johnson, N. Soranzo +bioRxiv DOI: http://dx.doi.org/10.1101/014043 (preprint) -Peter J.A. Cock, Björn A. Grüning, Konrad Paszkiewicz and Leighton Pritchard (2013). -Galaxy tools and workflows for sequence analysis with applications -in molecular plant pathology. PeerJ 1:e167 -http://dx.doi.org/10.7717/peerj.167 +You should also cite Galaxy, and the NCBI BLAST+ tools: -Bendtsen, J.D., Nielsen, H., von Heijne, G., Brunak, S. (2004) -Improved prediction of signal peptides: SignalP 3.0. J Mol Biol 340: 783–95. -http://dx.doi.org/10.1016/j.jmb.2004.05.028 - -Krogh, A., Larsson, B., von Heijne, G., Sonnhammer, E. (2001) -Predicting transmembrane protein topology with a hidden Markov model: -application to complete genomes. J Mol Biol 305: 567- 580. -http://dx.doi.org/10.1006/jmbi.2000.4315 +BLAST+: architecture and applications. +C. Camacho et al. BMC Bioinformatics 2009, 10:421. +DOI: http://dx.doi.org/10.1186/1471-2105-10-421 -Additional References -===================== - -Kikuchi, T., Cotton, J.A., Dalzell, J.J., Hasegawa. K., et al. (2011) -Genomic insights into the origin of parasitism in the emerging plant -pathogen *Bursaphelenchus xylophilus*. PLoS Pathog 7: e1002219. -http://dx.doi.org/10.1371/journal.ppat.1002219 +Automated Installation +====================== -Jones, J.T., Kumar, A., Pylypenko, L.A., Thirugnanasambandam, A., et al. (2009) -Identification and functional characterization of effectors in expressed -sequence tags from various life cycle stages of the potato cyst nematode -*Globodera pallida*. Mol Plant Pathol 10: 815–28. -http://dx.doi.org/10.1111/j.1364-3703.2009.00585.x - +Installation via the Galaxy Tool Shed should take care of the dependencies +on Galaxy tools including the NCBI BLAST+ wrappers and associated binaries. -Dependencies -============ - -These dependencies should be resolved automatically via the Galaxy Tool Shed: - -* http://toolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp -* http://toolshed.g2.bx.psu.edu/view/peterjc/seq_filter_by_id - -However, at the time of writing those Galaxy tools have their own -dependencies required for this workflow which require manual -installation (SignalP v3.0 and TMHMM v2.0). +However, this workflow requires a current version of the NCBI nr protein +BLAST database to be listed in ``blastdb_p.loc`` with the key ``nr`` (lower +case). History @@ -102,12 +183,7 @@ ======= ====================================================================== Version Changes ------- ---------------------------------------------------------------------- -v0.0.1 - Initial release to Tool Shed (May, 2013) - - Expanded README file to include example data -v0.0.2 - Updated versions of the tools used, inclulding core Galaxy Filter - tool to avoid warning about new ``header_lines`` parameter. - - Added link to Tool Shed in the workflow annotation explaining there - is a README file with sample data, and a requested citation. +v0.1.0 - Initial Tool Shed release, targetting NCBI BLAST+ 2.2.29 ======= ====================================================================== @@ -116,15 +192,39 @@ This workflow is under source code control here: -https://github.com/peterjc/pico_galaxy/tree/master/workflows/secreted_protein_workflow +https://github.com/peterjc/galaxy_blast/tree/master/workflows/blast_top_hit_species To prepare the tar-ball for uploading to the Tool Shed, I use this: - $ tar -cf secreted_protein_workflow.tar.gz README.rst repository_dependencies.xml secreted_protein_workflow.ga + $ tar -cf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml blast_top_hit_species.ga blast_top_hit_species.png N_abberans_piechart_mouseover.png Check this, - $ tar -tzf secreted_protein_workflow.tar.gz + $ tar -tzf blast_top_hit_species.tar.gz README.rst repository_dependencies.xml - secreted_protein_workflow.ga + blast_top_hit_species.ga + blast_top_hit_species.png + N_abberans_piechart_mouseover.png + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/blast_top_hit_species.ga Mon Mar 30 11:46:13 2015 -0400 @@ -0,0 +1,331 @@ +{ + "a_galaxy_workflow": "true", + "annotation": "", + "format-version": "0.1", + "name": "Species of top BLAST hits", + "steps": { + "0": { + "annotation": "", + "id": 0, + "input_connections": {}, + "inputs": [ + { + "description": "", + "name": "Transcriptome FASTA file" + } + ], + "label": null, + "name": "Input dataset", + "outputs": [], + "position": { + "left": 242, + "top": 119 + }, + "tool_errors": null, + "tool_id": null, + "tool_state": "{\"name\": \"Transcriptome FASTA file\"}", + "tool_version": null, + "type": "data_input", + "user_outputs": [], + "uuid": "e445b44b-02a7-4fd1-8944-cd680f967062" + }, + "1": { + "annotation": "This workflow is deliberately a simple/crude assessment, and there is no need to run BLASTX on all the sequences - a sample of 1000 should be enough.", + "id": 1, + "input_connections": { + "input_file": { + "id": 0, + "output_name": "output" + } + }, + "inputs": [], + "label": null, + "name": "Sub-sample sequences files", + "outputs": [ + { + "name": "output_file", + "type": "input" + } + ], + "position": { + "left": 435, + "top": 119 + }, + "post_job_actions": { + "RenameDatasetActionoutput_file": { + "action_arguments": { + "newname": "1000 sequences from #{input_file}" + }, + "action_type": "RenameDatasetAction", + "output_name": "output_file" + } + }, + "tool_errors": null, + "tool_id": "toolshed.g2.bx.psu.edu/repos/peterjc/sample_seqs/sample_seqs/0.2.1", + "tool_state": "{\"__page__\": 0, \"input_file\": \"null\", \"__rerun_remap_job_id__\": null, \"sampling\": \"{\\\"count\\\": \\\"1000\\\", \\\"type\\\": \\\"desired_count\\\", \\\"__current_case__\\\": 2}\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"interleaved\": \"\\\"False\\\"\"}", + "tool_version": "0.2.1", + "type": "tool", + "user_outputs": [], + "uuid": "87ce69ef-5fb0-41b0-9575-d3b96544f8be" + }, + "2": { + "annotation": "We only want one line per query, so limit this to the best scoring target sequence. Assumes current NCBI nr database is available locally as \"nr\".", + "id": 2, + "input_connections": { + "query": { + "id": 1, + "output_name": "output_file" + } + }, + "inputs": [], + "label": null, + "name": "NCBI BLAST+ blastx", + "outputs": [ + { + "name": "output1", + "type": "tabular" + } + ], + "position": { + "left": 489, + "top": 263 + }, + "post_job_actions": { + "RenameDatasetActionoutput1": { + "action_arguments": { + "newname": "Top BLAST match" + }, + "action_type": "RenameDatasetAction", + "output_name": "output1" + } + }, + "tool_errors": null, + "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastx_wrapper/0.1.01", + "tool_state": "{\"evalue_cutoff\": \"\\\"0.001\\\"\", \"__page__\": 0, \"adv_opts\": \"{\\\"adv_optional_id_files_opts\\\": {\\\"adv_optional_id_files_opts_selector\\\": \\\"none\\\", \\\"__current_case__\\\": 0}, \\\"matrix\\\": \\\"BLOSUM62\\\", \\\"adv_opts_selector\\\": \\\"advanced\\\", \\\"ungapped\\\": \\\"False\\\", \\\"filter_query\\\": \\\"True\\\", \\\"word_size\\\": \\\"0\\\", \\\"__current_case__\\\": 1, \\\"parse_deflines\\\": \\\"False\\\", \\\"strand\\\": \\\"-strand both\\\", \\\"max_hits\\\": \\\"1\\\"}\", \"__rerun_remap_job_id__\": null, \"db_opts\": \"{\\\"db_opts_selector\\\": \\\"db\\\", \\\"subject\\\": \\\"\\\", \\\"histdb\\\": \\\"\\\", \\\"__current_case__\\\": 0, \\\"database\\\": \\\"nr\\\"}\", \"query_gencode\": \"\\\"1\\\"\", \"query\": \"null\", \"output\": \"{\\\"out_format\\\": \\\"cols\\\", \\\"std_cols\\\": [\\\"qseqid\\\", \\\"sseqid\\\", \\\"pident\\\", \\\"length\\\", \\\"mismatch\\\", \\\"gapopen\\\", \\\"qstart\\\", \\\"qend\\\", \\\"sstart\\\", \\\"send\\\", \\\"evalue\\\", \\\"bitscore\\\"], \\\"ids_cols\\\": null, \\\"tax_cols\\\": [\\\"staxids\\\", \\\"sscinames\\\", \\\"scomnames\\\", \\\"sblastnames\\\", \\\"sskingdoms\\\"], \\\"__current_case__\\\": 2, \\\"misc_cols\\\": null, \\\"ext_cols\\\": null}\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\"}", + "tool_version": "0.1.01", + "type": "tool", + "user_outputs": [], + "uuid": "1559a0b0-0b66-40f9-b777-2f062fcda4cc" + }, + "3": { + "annotation": "Having a tabular file of all 1000 sequences is used in the \"join\" step to count the sequences giving no BLAST hit.", + "id": 3, + "input_connections": { + "input": { + "id": 1, + "output_name": "output_file" + } + }, + "inputs": [], + "label": null, + "name": "FASTA-to-Tabular", + "outputs": [ + { + "name": "output", + "type": "tabular" + } + ], + "position": { + "left": 696, + "top": 139 + }, + "post_job_actions": { + "HideDatasetActionoutput": { + "action_arguments": {}, + "action_type": "HideDatasetAction", + "output_name": "output" + }, + "RenameDatasetActionoutput": { + "action_arguments": { + "newname": "1000 sequences as tabular" + }, + "action_type": "RenameDatasetAction", + "output_name": "output" + } + }, + "tool_errors": null, + "tool_id": "toolshed.g2.bx.psu.edu/repos/devteam/fasta_to_tabular/fasta2tab/1.1.0", + "tool_state": "{\"__page__\": 0, \"keep_first\": \"\\\"0\\\"\", \"descr_columns\": \"\\\"2\\\"\", \"input\": \"null\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"__rerun_remap_job_id__\": null}", + "tool_version": "1.1.0", + "type": "tool", + "user_outputs": [], + "uuid": "31f11208-b2bd-4d9d-9745-dc1a6ed7ccf9" + }, + "4": { + "annotation": "Some BLAST matches will give multiple HSPs, and thus multiple lines in the tabular output. We only want one line per query.", + "id": 4, + "input_connections": { + "input": { + "id": 2, + "output_name": "output1" + } + }, + "inputs": [], + "label": null, + "name": "Unique", + "outputs": [ + { + "name": "outfile", + "type": "input" + } + ], + "position": { + "left": 665, + "top": 376 + }, + "post_job_actions": { + "HideDatasetActionoutfile": { + "action_arguments": {}, + "action_type": "HideDatasetAction", + "output_name": "outfile" + }, + "RenameDatasetActionoutfile": { + "action_arguments": { + "newname": "One HSP per BLAST hit" + }, + "action_type": "RenameDatasetAction", + "output_name": "outfile" + } + }, + "tool_errors": null, + "tool_id": "toolshed.g2.bx.psu.edu/repos/bgruening/unique/bg_uniq/0.3", + "tool_state": "{\"__page__\": 0, \"ignore_case\": \"\\\"False\\\"\", \"adv_opts\": \"{\\\"column_end\\\": {\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": \\\"2\\\"}, \\\"column_start\\\": {\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": \\\"1\\\"}, \\\"adv_opts_selector\\\": \\\"advanced\\\", \\\"__current_case__\\\": 1}\", \"__rerun_remap_job_id__\": null, \"is_numeric\": \"\\\"False\\\"\", \"input\": \"null\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\"}", + "tool_version": "0.3", + "type": "tool", + "user_outputs": [], + "uuid": "acf948e3-71dc-4f35-8357-3998bd0abdd8" + }, + "5": { + "annotation": "We don't need all the columns in this join, but the key is to assign \"None\" to the sequences with no BLAST hits.", + "id": 5, + "input_connections": { + "input1": { + "id": 3, + "output_name": "output" + }, + "input2": { + "id": 4, + "output_name": "outfile" + } + }, + "inputs": [], + "label": null, + "name": "Join two Datasets", + "outputs": [ + { + "name": "out_file1", + "type": "input" + } + ], + "position": { + "left": 827, + "top": 263 + }, + "post_job_actions": { + "HideDatasetActionout_file1": { + "action_arguments": {}, + "action_type": "HideDatasetAction", + "output_name": "out_file1" + }, + "RenameDatasetActionout_file1": { + "action_arguments": { + "newname": "Top BLAST hits or None" + }, + "action_type": "RenameDatasetAction", + "output_name": "out_file1" + } + }, + "tool_errors": null, + "tool_id": "join1", + "tool_state": "{\"input2\": \"null\", \"__page__\": 0, \"field1\": \"{\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": \\\"1\\\"}\", \"partial\": \"\\\"\\\"\", \"field2\": \"{\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": \\\"1\\\"}\", \"__rerun_remap_job_id__\": null, \"fill_empty_columns\": \"{\\\"fill_empty_columns_switch\\\": \\\"fill_empty\\\", \\\"do_fill_empty_columns\\\": {\\\"column_fill_type\\\": \\\"single_fill_value\\\", \\\"fill_value\\\": \\\"None\\\", \\\"__current_case__\\\": 0}, \\\"fill_columns_by\\\": \\\"fill_unjoined_only\\\", \\\"__current_case__\\\": 1}\", \"unmatched\": \"\\\"-u\\\"\", \"input1\": \"null\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\"}", + "tool_version": "2.0.2", + "type": "tool", + "user_outputs": [], + "uuid": "4c280b0e-b4a6-4ae4-8a81-d6e93932ef71" + }, + "6": { + "annotation": "Here we make a tally table of the BLAST species name column", + "id": 6, + "input_connections": { + "input": { + "id": 5, + "output_name": "out_file1" + } + }, + "inputs": [], + "label": null, + "name": "Count", + "outputs": [ + { + "name": "out_file1", + "type": "tabular" + } + ], + "position": { + "left": 952, + "top": 398 + }, + "post_job_actions": { + "HideDatasetActionout_file1": { + "action_arguments": {}, + "action_type": "HideDatasetAction", + "output_name": "out_file1" + }, + "RenameDatasetActionout_file1": { + "action_arguments": { + "newname": "Top BLAST hit species counts (unsorted)" + }, + "action_type": "RenameDatasetAction", + "output_name": "out_file1" + } + }, + "tool_errors": null, + "tool_id": "Count1", + "tool_state": "{\"__page__\": 0, \"column\": \"{\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": [\\\"19\\\"]}\", \"__rerun_remap_job_id__\": null, \"delim\": \"\\\"T\\\"\", \"input\": \"null\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\"}", + "tool_version": "1.0.0", + "type": "tool", + "user_outputs": [], + "uuid": "d3322137-1911-426d-87a7-c82b5fc16825" + }, + "7": { + "annotation": "Sorting the counts makes the results easier to interpret directly.", + "id": 7, + "input_connections": { + "input": { + "id": 6, + "output_name": "out_file1" + } + }, + "inputs": [], + "label": null, + "name": "Sort", + "outputs": [ + { + "name": "out_file1", + "type": "input" + } + ], + "position": { + "left": 1056, + "top": 506 + }, + "post_job_actions": { + "RenameDatasetActionout_file1": { + "action_arguments": { + "newname": "Top BLAST hit species counts" + }, + "action_type": "RenameDatasetAction", + "output_name": "out_file1" + } + }, + "tool_errors": null, + "tool_id": "sort1", + "tool_state": "{\"__page__\": 0, \"style\": \"\\\"num\\\"\", \"column\": \"{\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": \\\"1\\\"}\", \"__rerun_remap_job_id__\": null, \"column_set\": \"[]\", \"input\": \"null\", \"chromInfo\": \"\\\"/mnt/galaxy/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"order\": \"\\\"DESC\\\"\"}", + "tool_version": "1.0.3", + "type": "tool", + "user_outputs": [], + "uuid": "c81cc61d-52a3-44ee-b646-b23e0e004c38" + } + }, + "uuid": "9fe8754a-3a87-4f6a-89a2-141b02b4793e" +} \ No newline at end of file
--- a/repository_dependencies.xml Fri Oct 25 10:22:35 2013 -0400 +++ b/repository_dependencies.xml Mon Mar 30 11:46:13 2015 -0400 @@ -1,7 +1,9 @@ <?xml version="1.0"?> -<repositories description="This requires my SignalP and TMHMM wrapers, and my FASTA filtering tool."> - <!-- Revision 15:6abd809cefdd on the main tool shed is v0.2.4, the current latest - but older should be OK --> - <repository changeset_revision="ee10017fcd80" name="tmhmm_and_signalp" owner="peterjc" toolshed="http://testtoolshed.g2.bx.psu.edu" /> - <!-- Revision 2:abdd608c869b on the main tool shed is v0.0.5, the current latest - but older should be OK --> - <repository changeset_revision="8a34c565a473" name="seq_filter_by_id" owner="peterjc" toolshed="http://testtoolshed.g2.bx.psu.edu" /> +<repositories description="This workflow requires the NCBI BLAST+ tools etc"> + <repository changeset_revision="5e9d5e536b79" name="ncbi_blast_plus" owner="devteam" toolshed="https://testtoolshed.g2.bx.psu.edu" /> + <repository changeset_revision="ae709fd50581" name="fasta_to_tabular" owner="devteam" toolshed="https://testtoolshed.g2.bx.psu.edu" /> + <repository changeset_revision="4231c585b6dd" name="sample_seqs" owner="peterjc" toolshed="https://testtoolshed.g2.bx.psu.edu" /> + <repository changeset_revision="2064ae2602b1" name="unique" owner="bgruening" toolshed="https://testtoolshed.g2.bx.psu.edu" /> + <!-- Also uses tool_id join1, Count1, and sort1 which are currently + still shipped with Galaxy itself rather than via the Tool Shed --> </repositories>
--- a/secreted_protein_workflow.ga Fri Oct 25 10:22:35 2013 -0400 +++ /dev/null Thu Jan 01 00:00:00 1970 +0000 @@ -1,288 +0,0 @@ -{ - "a_galaxy_workflow": "true", - "annotation": "Runs SignalP v3.0 and TMHMM v2.0 to look for secreted proteins.<br />\n<br />\nThis workflow is <a href=\"http://toolshed.g2.bx.psu.edu/view/peterjc/secreted_protein_workflow\" target=\"_blank\">available on the Galaxy Tool Shed</a> with a README file giving more information including sample data, and full citation details (Cock and Pritchard 2014).", - "format-version": "0.1", - "name": "Find secreted proteins with TMHMM and SignalP", - "steps": { - "0": { - "annotation": "", - "id": 0, - "input_connections": {}, - "inputs": [ - { - "description": "", - "name": "Input Dataset" - } - ], - "name": "Input dataset", - "outputs": [], - "position": { - "left": 200, - "top": 200 - }, - "tool_errors": null, - "tool_id": null, - "tool_state": "{\"name\": \"Input Dataset\"}", - "tool_version": null, - "type": "data_input", - "user_outputs": [] - }, - "1": { - "annotation": "", - "id": 1, - "input_connections": { - "fasta_file": { - "id": 0, - "output_name": "output" - } - }, - "inputs": [ - { - "description": "runtime parameter for tool SignalP 3.0", - "name": "organism" - } - ], - "name": "SignalP 3.0", - "outputs": [ - { - "name": "tabular_file", - "type": "tabular" - } - ], - "position": { - "left": 240, - "top": 341 - }, - "post_job_actions": { - "HideDatasetActiontabular_file": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "tabular_file" - } - }, - "tool_errors": null, - "tool_id": "signalp3", - "tool_state": "{\"__page__\": 0, \"truncate\": \"\\\"60\\\"\", \"chromInfo\": \"\\\"/opt/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"fasta_file\": \"null\", \"organism\": \"{\\\"__class__\\\": \\\"RuntimeValue\\\"}\", \"__rerun_remap_job_id__\": null}", - "tool_version": "0.0.12", - "type": "tool", - "user_outputs": [] - }, - "2": { - "annotation": "Select proteins with predicted signal peptide (SignalP NN D-Score or HMM)", - "id": 2, - "input_connections": { - "input": { - "id": 1, - "output_name": "tabular_file" - } - }, - "inputs": [], - "name": "Filter", - "outputs": [ - { - "name": "out_file1", - "type": "input" - } - ], - "position": { - "left": 323, - "top": 528 - }, - "post_job_actions": { - "HideDatasetActionout_file1": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "out_file1" - }, - "RenameDatasetActionout_file1": { - "action_arguments": { - "newname": "Filtered SignalP results" - }, - "action_type": "RenameDatasetAction", - "output_name": "out_file1" - } - }, - "tool_errors": null, - "tool_id": "Filter1", - "tool_state": "{\"__page__\": 0, \"__rerun_remap_job_id__\": null, \"cond\": \"\\\"c14=='Y' or c15=='S'\\\"\", \"input\": \"null\", \"header_lines\": \"\\\"0\\\"\", \"chromInfo\": \"\\\"/opt/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\"}", - "tool_version": "1.1.0", - "type": "tool", - "user_outputs": [] - }, - "3": { - "annotation": "Select those sequences with signal peptides.", - "id": 3, - "input_connections": { - "input_file": { - "id": 0, - "output_name": "output" - }, - "input_tabular": { - "id": 2, - "output_name": "out_file1" - } - }, - "inputs": [], - "name": "Filter sequences by ID", - "outputs": [ - { - "name": "output_pos", - "type": "fasta" - }, - { - "name": "output_neg", - "type": "fasta" - } - ], - "position": { - "left": 527, - "top": 200 - }, - "post_job_actions": { - "HideDatasetActionoutput_neg": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "output_neg" - }, - "HideDatasetActionoutput_pos": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "output_pos" - } - }, - "tool_errors": null, - "tool_id": "seq_filter_by_id", - "tool_state": "{\"__page__\": 0, \"output_choice_cond\": \"{\\\"output_choice\\\": \\\"pos\\\", \\\"__current_case__\\\": 1}\", \"input_file\": \"null\", \"__rerun_remap_job_id__\": null, \"input_tabular\": \"null\", \"chromInfo\": \"\\\"/opt/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"columns\": \"{\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": [\\\"1\\\"]}\"}", - "tool_version": "0.0.5", - "type": "tool", - "user_outputs": [] - }, - "4": { - "annotation": "", - "id": 4, - "input_connections": { - "fasta_file": { - "id": 3, - "output_name": "output_pos" - } - }, - "inputs": [], - "name": "TMHMM 2.0", - "outputs": [ - { - "name": "tabular_file", - "type": "tabular" - } - ], - "position": { - "left": 643, - "top": 443 - }, - "post_job_actions": { - "HideDatasetActiontabular_file": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "tabular_file" - } - }, - "tool_errors": null, - "tool_id": "tmhmm2", - "tool_state": "{\"__page__\": 0, \"fasta_file\": \"null\", \"chromInfo\": \"\\\"/opt/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"__rerun_remap_job_id__\": null}", - "tool_version": "0.0.11", - "type": "tool", - "user_outputs": [] - }, - "5": { - "annotation": "Select proteins with no predicted transmembrane helices.", - "id": 5, - "input_connections": { - "input": { - "id": 4, - "output_name": "tabular_file" - } - }, - "inputs": [], - "name": "Filter", - "outputs": [ - { - "name": "out_file1", - "type": "input" - } - ], - "position": { - "left": 729, - "top": 566 - }, - "post_job_actions": { - "HideDatasetActionout_file1": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "out_file1" - }, - "RenameDatasetActionout_file1": { - "action_arguments": { - "newname": "Filtered TMHMM results" - }, - "action_type": "RenameDatasetAction", - "output_name": "out_file1" - } - }, - "tool_errors": null, - "tool_id": "Filter1", - "tool_state": "{\"__page__\": 0, \"__rerun_remap_job_id__\": null, \"cond\": \"\\\"c5== 0\\\"\", \"input\": \"null\", \"header_lines\": \"\\\"0\\\"\", \"chromInfo\": \"\\\"/opt/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\"}", - "tool_version": "1.1.0", - "type": "tool", - "user_outputs": [] - }, - "6": { - "annotation": "Select those sequences with no transmembrane helices (from those with signal peptides).", - "id": 6, - "input_connections": { - "input_file": { - "id": 3, - "output_name": "output_pos" - }, - "input_tabular": { - "id": 5, - "output_name": "out_file1" - } - }, - "inputs": [], - "name": "Filter sequences by ID", - "outputs": [ - { - "name": "output_pos", - "type": "fasta" - }, - { - "name": "output_neg", - "type": "fasta" - } - ], - "position": { - "left": 893, - "top": 281 - }, - "post_job_actions": { - "HideDatasetActionoutput_neg": { - "action_arguments": {}, - "action_type": "HideDatasetAction", - "output_name": "output_neg" - }, - "RenameDatasetActionoutput_pos": { - "action_arguments": { - "newname": "Secreted proteins" - }, - "action_type": "RenameDatasetAction", - "output_name": "output_pos" - } - }, - "tool_errors": null, - "tool_id": "seq_filter_by_id", - "tool_state": "{\"__page__\": 0, \"output_choice_cond\": \"{\\\"output_choice\\\": \\\"pos\\\", \\\"__current_case__\\\": 1}\", \"input_file\": \"null\", \"__rerun_remap_job_id__\": null, \"input_tabular\": \"null\", \"chromInfo\": \"\\\"/opt/galaxy-dist/tool-data/shared/ucsc/chrom/?.len\\\"\", \"columns\": \"{\\\"__class__\\\": \\\"UnvalidatedValue\\\", \\\"value\\\": [\\\"1\\\"]}\"}", - "tool_version": "0.0.5", - "type": "tool", - "user_outputs": [] - } - } -} \ No newline at end of file
