Mercurial > repos > eschen42 > w4mclassfilter
diff README.md @ 14:1d36ecf93e67 draft
"planemo upload commit 9639dde5737c9aa2330bb603c2299345939407cf"
author | eschen42 |
---|---|
date | Thu, 11 Mar 2021 20:44:23 +0000 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.md Thu Mar 11 20:44:23 2021 +0000 @@ -0,0 +1,238 @@ +[](https://zenodo.org/badge/latestdoi/90571457) Latest public release + +[](https://travis-ci.org/HegemanLab/w4mclassfilter_galaxy_wrapper) Current build status for master branch on GitHub + +[Repository 'w4mclassfilter' in Galaxy Toolshed](https://toolshed.g2.bx.psu.edu/repository?repository_id=5f24951d82ab40fa) + +# W4M Data Subset + +#### A Galaxy tool to select a subset of Workflow4Metabolomics data + +*W4M Data Subset* is [Galaxy tool-wrapper](https://docs.galaxyproject.org/en/latest/dev/schema.htm) to wrap the +[w4mclassfilter R package](https://github.com/HegemanLab/w4mclassfilter) for use with the +[Workflow4Metabolomics](http://workflow4metabolomics.org/) flavor of +[Galaxy](https://galaxyproject.org/). +This tool was developed and tested with [planemo](http://planemo.readthedocs.io/en/latest/). + +#### Author + +Arthur Eschenlauer (University of Minnesota, esch0041@umn.edu) + +#### R package wrapped by this tool + +The *w4mclassfilter* package is available from the Hegeman lab github repository [https://github.com/HegemanLab/w4mclassfilter/releases](https://github.com/HegemanLab/w4mclassfilter/releases). + +#### Tool in Galaxy Toolshed + +The "w4mclassfilter" Galaxy tool, built from this repository, is in the main Galaxy Toolshed at [https://toolshed.g2.bx.psu.edu/repository?repository_id=5f24951d82ab40fa](https://toolshed.g2.bx.psu.edu/repository?repository_id=5f24951d82ab40fa) + +#### Tool updates + +See the **NEWS** section at the bottom of this page + +## Motivation + +LC-MS metabolomics experiments seek to resolve "features", i.e., species that have distinct chromatographic retention time ("rt") and (after ionization) mass-to-charge ratio ("*m/z*" or "mz"). +(If a chemical is fragmented or may have a variety of adducts, several features will result.) +Data for a sample are collected as mass-spectral intensities, each of +which is associated with a position on a 2D plane with dimensions of rt +and *m/z*. Ideally, features would be sufficiently reproducible among +sample-runs to distinguish features that are similar among samples from +those that differ. + +For liquid chromatography, the retention time for a species can vary considerably from one chromatography +run to the next. The Workflow4Metabolomics suite of Galaxy tools +(W4M, [Giacomoni *et al.*, 2014, Guitton *et al.* 2017]) +uses the XCMS preprocessing tools [Smith *et al.*, 2006] for "retention-time correction" to +align features among samples. Features may be better aligned if pooled +samples and blanks are included. + +Multivariate statistical tools may be used to discover clusters of +similar samples [Thévenot *et al.*, 2015]. +However, once retention-time alignment of features has been achieved +among samples in LC-MS datasets: + +- The presence of pools and blanks may confound identification and separation of sample clusters. +- Multivariate statistical algorithms may be impacted by missing values or dimensions that have zero variance. + + +## Description + +The W4M Data Subset tool selects subsets of samples, features, or data values and conditions the data for further analysis. + +- The tool takes as input the *dataMatrix*, *sampleMetadata*, and *variableMetadata* datasets produced by W4M\'s XCMS and CAMERA [Kuhl *et al.*, 2012] tools. +- The tool produces the same trio of output datasets, modified as described below. + +This tool can perform several operations to reduce the number samples or features to be analyzed (although *this should be done only in a statistically sound manner* consistent with the nature of the experiment): + +- *Sample filtering:* Samples may be selected by designating a "sample class" column in *sampleMetadata* and specifying criteria to include or exclude samples based on the contents of this column. +- *Feature filtering:* Features may be selected by specifying minimum or maximum value (or both) allowable in columns of *variableMetadata*. +- *Intensity filtering:* To exclude minimal features from consideration, a lower bound may be specified for the maximum intensity for a feature across all samples (i.e., for a row in *dataMatrix*). + +This tool also conditions data for statistical analysis: + +- Samples that are missing from either *sampleMetadata* or *dataMatrix* are eliminated. +- Features that are missing from either *variableMetadata* or *dataMatrix* are eliminated. +- Features and samples that have zero variance are eliminated. +- Samples and features are ordered consistently in *variableMetadata*, *sampleMetadata*, and *dataMatrix*. (The columns for sorting *variableMetadata* or *sampleMetadata* may be specified.) +- The names of the first columns of *variableMetadata* and *sampleMetadata* are set respectively to "variableMetadata" and "sampleMetadata". +- If desired, the values in the *dataMatrix* may be log-transformed. +- Negative intensities become missing values (before missing-value replacement is performed). +- If desired, each missing value in *dataMatrix* may be replaced with zero or the median value observed for the corresponding feature. +- If desired, a "center" for each treatment can be computed in lieu of the samples for that treatment. + +This tool may be applied several times sequentially, which may be useful for: + +- analyzing subsets of samples for progressively smaller sets of treatment levels, or +- choosing subsets of samples or features, respectively based on criteria in columns of *sampleMetadata* or *variableMetadata*. + +## NEWS + +### Changes in version 0.98.18 + +#### New features + +* Enhancement: Added option "compute center for each treatment" [https://github.com/HegemanLab/w4mclassfilter/issues/6](https://github.com/HegemanLab/w4mclassfilter/issues/6). +* Enhancement: Added option "enable sorting on multiple columns of metadata" [https://github.com/HegemanLab/w4mclassfilter/issues/7](https://github.com/HegemanLab/w4mclassfilter/issues/7). +* Enhancement: Added option "always treat negative intensities as missing values" [https://github.com/HegemanLab/w4mclassfilter\_galaxy\_wrapper/issues/7](https://github.com/HegemanLab/w4mclassfilter_galaxy_wrapper/issues/7). + +#### Internal modifications + +* Use v0.98.18 of the [w4mclassfilter bioconda package](https://bioconda.github.io/recipes/w4mclassfilter/README.html). + + +### (Version numbers 0.98.15-0.98.17 skipped) + + +### Changes in version 0.98.14 + +#### New features + +* Enhancement [https://github.com/HegemanLab/w4mclassfilter\_galaxy\_wrapper/issues/6](https://github.com/HegemanLab/w4mclassfilter_galaxy_wrapper/issues/6) - "Provide sort options for features and samples". + +#### Internal modifications + +* Use v0.98.14 of the [w4mclassfilter bioconda package](https://bioconda.github.io/recipes/w4mclassfilter/README.html). + +### Changes in version 0.98.13 + +#### New features + +* Support enhancement https://github.com/HegemanLab/w4mclassfilter/issues/4 - "add and test no-imputation and centering-imputation functions": + - Support no imputation. + - Support imputing missing feature-intensities as median intensity for the corresponding feature. + +#### Internal modifications + +* Use v0.98.13 of the [w4mclassfilter bioconda package](https://bioconda.github.io/recipes/w4mclassfilter/README.html). + + +### (Version number 0.98.12 skipped) + + +### Changes in version 0.98.11 + +#### New features + +* none + +#### Internal modifications + +* Use v0.98.8 of the [w4mclassfilter bioconda package](https://bioconda.github.io/recipes/w4mclassfilter/README.html). + +### Changes in version 0.98.10 + +#### New features + +* none + +#### Internal modifications + +* Quality-assurance improvements - Changes for IUC conformance and automated Planemo testing on Travis CI. +* Forbid hyphens in sample and variable names because R does not permit them. + +### Changes in version 0.98.9 + +#### New features + +* none + +#### Internal modifications + +* Added missing support for hyphen character in regular expressions + +### Changes in version 0.98.8 + +#### New features + +* The tool now appears in Galaxy with a new, more representative name: "W4M Data Subset". (Earlier versions of this tool appeared in Galaxy with the name "Sample Subset".) +* Option was added to log-transform data matrix values. +* Output datasets are named in conformance with the W4M convention of appending the name of each preprocessing tool to the input dataset name. +* Superfluous "Column that names the sample" input parameter was eliminated. +* Some documentation was updated or clarified. + +#### Internal modifications + +* None + +### Changes in version 0.98.7 + +#### New features + +* First column of output variableMetadata (that has feature names) now is always named "variableMetadata" +* First column of output sampleMetadata now (that has sample names) is always named "sampleMetadata" + +#### Internal modifications + +* Now uses w4mclassfilter R package v0.98.7. + +### Changes in version 0.98.6 + +#### New features + +* Added support for filtering out features whose attributes fall outside specified ranges. For more detail, see "Variable-range filters" above. + +#### Internal modifications + +* Now uses w4mclassfilter R package v0.98.6. +* Now sorts sample names and feature names in output files because some statistical tools expect the same order in dataMatrix row and column names as in the corresponding metadata files. + +### Changes in version 0.98.3 + +#### New features + +* Improved reference-list. + +#### Internal modifications + +* Improved input handling. +* Now uses w4mclassfilter R package v0.98.3, although that version has no functional implications for this tool. + +### Changes in version 0.98.1 + +#### New features + +* First release - Wrap the w4mclassfilter R package that implements filtering of W4M data matrix, variable metadata, and sample metadata by class of sample. +* *dataMatrix* *is* modified by the tool, so it *does* appear as an output file +* *sampleMetadata* *is* modified by the tool, so it *does* appear as an output file +* *variableMetadata* *is* modified by the tool, so it *does* appear as an output file + +#### Internal modifications + +* none + +## Citations + +Benjamini, Yoav and Hochberg, Yosef (1995) **Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.** In *Journal of the royal statistical society. Series B (Methodological), 1 (57), pp. pp. 289-300.* [http://www.jstor.org/stable/2346101](http://www.jstor.org/stable/2346101) + +Kuhl, Carsten and Tautenhahn, Ralf and Böttcher, Christoph and Larson, Tony R. and Neumann, Steffen (2011). **CAMERA: An Integrated Strategy for Compound Spectra Extraction and Annotation of Liquid Chromatography/Mass Spectrometry Data Sets.** In *Analytical Chemistry, 84 (1), pp. 283-289.* [doi:10.1021/ac202450g](http://dx.doi.org/10.1021/ac202450g) + +Giacomoni, F. and Le Corguille, G. and Monsoor, M. and Landi, M. and Pericard, P. and Petera, M. and Duperier, C. and Tremblay-Franco, M. and Martin, J.-F. and Jacob, D. and *et al.* (2014). **Workflow4Metabolomics: a collaborative research infrastructure for computational metabolomics.** In *Bioinformatics, 31 (9), pp. 1493–1495.* [doi:10.1093/bioinformatics/btu813](http://dx.doi.org/10.1093/bioinformatics/btu813) + +Guitton, Yann and Tremblay-Franco, Marie and Le Corguillé, Gildas and Martin, Jean-François and Pétéra, Mélanie and Roger-Mele, Pierrick and Delabrière, Alexis and Goulitquer, Sophie and Monsoor, Misharl and Duperier, Christophe and *et al.* (2017). **Create, run, share, publish, and reference your LC–MS, FIA–MS, GC–MS, and NMR data analysis workflows with the Workflow4Metabolomics 3.0 Galaxy online infrastructure for metabolomics.** In *The International Journal of Biochemistry & Cell Biology, pp. 89-101.* [doi:10.1016/j.biocel.2017.07.002](http://dx.doi.org/10.1016/j.biocel.2017.07.002) + +Smith, Colin A. and Want, Elizabeth J. and O'Maille, Grace and Abagyan, Ruben and Siuzdak, Gary (2006). **XCMS: Processing Mass Spectrometry Data for Metabolite Profiling Using Nonlinear Peak Alignment, Matching, and Identification.** In *Analytical Chemistry, 78 (3), pp. 779–787.* [doi:10.1021/ac051437y](http://dx.doi.org/10.1021/ac051437y) + +Thévenot, Etienne A. and Roux, Aurélie and Xu, Ying and Ezan, Eric and Junot, Christophe (2015). **Analysis of the Human Adult Urinary Metabolome Variations with Age, Body Mass Index, and Gender by Implementing a Comprehensive Workflow for Univariate and OPLS Statistical Analyses.** In *Journal of Proteome Research, 14 (8), pp. 3322–3335.* [doi:10.1021/acs.jproteome.5b00354](http://dx.doi.org/10.1021/acs.jproteome.5b00354) + +Yekutieli, Daniel and Benjamini, Yoav (2001) **The control of the false discovery rate in multiple testing under dependency.** In *The Annals of Statistics, 29 (4), pp. 1165-1188.* [doi:10.1214/aos/1013699998](http://dx.doi.org/10.1214/aos/1013699998)