Mercurial > repos > pjbriggs > amplicon_analysis_pipeline
view README.rst @ 4:013bf1e2cc8f draft
Updated to Amplicon_Analysis_pipeline v1.2.2.
| author | pjbriggs | 
|---|---|
| date | Wed, 13 Jun 2018 07:43:57 -0400 | 
| parents | a898ee628343 | 
| children | bbfc9638ba84 | 
line wrap: on
 line source
Amplicon_analysis-galaxy ======================== A Galaxy tool wrapper to Mauro Tutino's ``Amplicon_analysis`` pipeline script at https://github.com/MTutino/Amplicon_analysis The pipeline can analyse paired-end 16S rRNA data from Illumina Miseq (Casava >= 1.8) and performs the following operations: * QC and clean up of input data * Removal of singletons and chimeras and building of OTU table and phylogenetic tree * Beta and alpha diversity of analysis Usage documentation =================== Usage of the tool (including required inputs) is documented within the ``help`` section of the tool XML. Installing the tool in a Galaxy instance ======================================== The following sections describe how to install the tool files, dependencies and reference data, and how to configure the Galaxy instance to detect the dependencies and reference data correctly at run time. 1. Install the dependencies --------------------------- The ``install_tool_deps.sh`` script can be used to fetch and install the dependencies locally, for example:: install_tool_deps.sh /path/to/local_tool_dependencies This can take some time to complete. When finished it should have created a set of directories containing the dependencies under the specified top level directory. 2. Install the tool files ------------------------- The core tool is hosted on the Galaxy toolshed, so it can be installed directly from there (this is the recommended route): * https://toolshed.g2.bx.psu.edu/view/pjbriggs/amplicon_analysis_pipeline/ Alternatively it can be installed manually; in this case there are two files to install: * ``amplicon_analysis_pipeline.xml`` (the Galaxy tool definition) * ``amplicon_analysis_pipeline.py`` (the Python wrapper script) Put these in a directory that is visible to Galaxy (e.g. a ``tools/Amplicon_analysis/`` folder), and modify the ``tools_conf.xml`` file to tell Galaxy to offer the tool by adding the line e.g.:: <tool file="Amplicon_analysis/amplicon_analysis_pipeline.xml" /> 3. Install the reference data ----------------------------- The script ``References.sh`` from the pipeline package at https://github.com/MTutino/Amplicon_analysis can be run to install the reference data, for example:: cd /path/to/pipeline/data wget https://github.com/MTutino/Amplicon_analysis/raw/master/References.sh /bin/bash ./References.sh will install the data in ``/path/to/pipeline/data``. **NB** The final amount of data downloaded and uncompressed will be around 6GB. 4. Configure dependencies and reference data in Galaxy ------------------------------------------------------ The final steps are to make your Galaxy installation aware of the tool dependencies and reference data, so it can locate them both when the tool is run. To target the tool dependencies installed previously, add the following lines to the ``dependency_resolvers_conf.xml`` file in the Galaxy ``config`` directory:: <dependency_resolvers> ... <galaxy_packages base_path="/path/to/local_tool_dependencies" /> <galaxy_packages base_path="/path/to/local_tool_dependencies" versionless="true" /> ... </dependency_resolvers> (NB it is recommended to place these *before* the ``<conda ... />`` resolvers) (If you're not familiar with dependency resolvers in Galaxy then see the documentation at https://docs.galaxyproject.org/en/master/admin/dependency_resolvers.html for more details.) The tool locates the reference data via an environment variable called ``AMPLICON_ANALYSIS_REF_DATA_PATH``, which needs to set to the parent directory where the reference data has been installed. There are various ways to do this, depending on how your Galaxy installation is configured: * **For local instances:** add a line to set it in the ``config/local_env.sh`` file of your Galaxy installation, e.g.:: export AMPLICON_ANALYSIS_REF_DATA_PATH=/path/to/pipeline/data * **For production instances:** set the value in the ``job_conf.xml`` configuration file, e.g.:: <destination id="amplicon_analysis"> <env id="AMPLICON_ANALYSIS_REF_DATA_PATH">/path/to/pipeline/data</env> </destination> and then specify that the pipeline tool uses this destination:: <tool id="amplicon_analysis_pipeline" destination="amplicon_analysis"/> (For more about job destinations see the Galaxy documentation at https://galaxyproject.org/admin/config/jobs/#job-destinations) 5. Enable rendering of HTML outputs from pipeline ------------------------------------------------- To ensure that HTML outputs are displayed correctly in Galaxy (for example the Vsearch OTU table heatmaps), Galaxy needs to be configured not to sanitize the outputs from the ``Amplicon_analysis`` tool. Either: * **For local instances:** set ``sanitize_all_html = False`` in ``config/galaxy.ini`` (nb don't do this on production servers or public instances!); or * **For production instances:** add the ``Amplicon_analysis`` tool to the display whitelist in the Galaxy instance: - Set ``sanitize_whitelist_file = config/whitelist.txt`` in ``config/galaxy.ini`` and restart Galaxy; - Go to ``Admin>Manage Display Whitelist``, check the box for ``Amplicon_analysis`` (hint: use your browser's 'find-in-page' search function to help locate it) and click on ``Submit new whitelist`` to update the settings. Additional details ================== Some other things to be aware of: * Note that using the Silva database requires a minimum of 18Gb RAM Known problems ============== * Only the ``VSEARCH`` pipeline in Mauro's script is currently available via the Galaxy tool; the ``USEARCH`` and ``QIIME`` pipelines have yet to be implemented. * The images in the tool help section are not visible if the tool has been installed locally, or if it has been installed in a Galaxy instance which is served from a subdirectory. These are both problems with Galaxy and not the tool, see https://github.com/galaxyproject/galaxy/issues/4490 and https://github.com/galaxyproject/galaxy/issues/1676 Appendix: availability of tool dependencies =========================================== The tool takes its dependencies from the underlying pipeline script (see https://github.com/MTutino/Amplicon_analysis/blob/master/README.md for details). As noted above, currently the ``install_tool_deps.sh`` script can be used to manually install the dependencies for a local tool install. In principle these should also be available if the tool were installed from a toolshed. However it would be preferrable in this case to get as many of the dependencies as possible via the ``conda`` dependency resolver. The following are known to be available via conda, with the required version: - cutadapt 1.8.1 - sickle-trim 1.33 - bioawk 1.0 - fastqc 0.11.3 - R 3.2.0 Some dependencies are available but with the "wrong" versions: - spades (need 3.5.0) - qiime (need 1.8.0) - blast (need 2.2.26) - vsearch (need 1.1.3) The following dependencies are currently unavailable: - fasta_number (need 02jun2015) - fasta-splitter (need 0.2.4) - rdp_classifier (need 2.2) - microbiomeutil (need r20110519) (NB usearch 6.1.544 and 8.0.1623 are special cases which must be handled outside of Galaxy's dependency management systems.) History ======= ========== ====================================================================== Version Changes ---------- ---------------------------------------------------------------------- 1.2.2.0 Updated to Amplicon_Analysis_Pipeline version 1.2.2 (removes jackknifed analysis which is not captured by Galaxy tool) 1.2.1.0 Updated to Amplicon_Analysis_Pipeline version 1.2.1 (adds option to use the Human Oral Microbiome Database v15.1, and updates SILVA database to v123) 1.1.0 First official version on Galaxy toolshed. 1.0.6 Expand inline documentation to provide detailed usage guidance. 1.0.5 Updates including: - Capture read counts from quality control as new output dataset - Capture FastQC per-base quality boxplots for each sample as new output dataset - Add support for -l option (sliding window length for trimming) - Default for -L set to "200" 1.0.4 Various updates: - Additional outputs are captured when a "Categories" file is supplied (alpha diversity rarefaction curves and boxplots) - Sample names derived from Fastqs in a collection of pairs are trimmed to SAMPLE_S* (for Illumina-style Fastq filenames) - Input Fastqs can now be of more general ``fastq`` type - Log file outputs are captured in new output dataset - User can specify a "title" for the job which is copied into the dataset names (to distinguish outputs from different runs) - Improved detection and reporting of problems with input Metatable 1.0.3 Take the sample names from the collection dataset names when using collection as input (this is now the default input mode); collect additional output dataset; disable ``usearch``-based pipelines (i.e. ``UPARSE`` and ``QIIME``). 1.0.2 Enable support for FASTQs supplied via dataset collections and fix some broken output datasets. 1.0.1 Initial version ========== ======================================================================
