Mercurial > repos > pjbriggs > motif_tools
changeset 0:b42da9dc4507 draft
Uploaded initial version 1.0.1.
| author | pjbriggs | 
|---|---|
| date | Wed, 21 Mar 2018 05:44:12 -0400 | 
| parents | |
| children | 2f34d5e91bc7 | 
| files | CountUniqueIDs.xml README.rst Scan_IUPAC_output_each_match.xml Scan_IUPAC_output_matches_per_seq.xml TFBScluster_candidates_2TFBS.xml TFBScluster_candidates_3TFBS.xml motif_tools_macros.xml | 
| diffstat | 7 files changed, 543 insertions(+), 0 deletions(-) [+] | 
line wrap: on
 line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/CountUniqueIDs.xml Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,45 @@ +<?xml version="1.0" encoding="utf-8"?> +<tool id="gff_unique_count" name="Count unique seq in GFF" version="@VERSION@"> + <description>Gives the non-redundant count of sequences</description> + <macros> + <import>motif_tools_macros.xml</import> + </macros> + <expand macro="requirements" /> + <command><![CDATA[ + perl $__tool_directory__/CountUniqueIDs.pl $input $output + ]]></command> + <inputs> + <param format="gff" name="input" type="data" label="GFF file" help="Select a GFF file."/> + </inputs> + <outputs> + <data format="txt" name="output" /> + </outputs> + + <help> +.. class:: infomark + +**What it does** + +This tool counts the number non-redundant sequence identifiers (seqname) in a GFF file. The tool was originally written to read a GFF file containing set of motif matches and report the number of sequences that contain one or more instances of the scanned motif. + +---- + +.. class:: infomark + +**Options** + +A GFF formated file is required. + +---- + +.. class:: infomark + +**Credits** + +This Galaxy tool has been developed within the Bioinformatics Core Facility at the University of Manchester. It runs the CountUniqueIDs.pl Perl script that was written by Ian Donaldson. + +Please kindly acknowledge both this Galaxy tool and CountUniqueIDs.pl if you use it. + </help> + +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/README.rst Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,77 @@ +motif_tools +=========== + +Galaxy tools for various motif-finding utilities developed by Ian Donaldson. + +There are five tools available: + + * **IUPAC scan and output each match** Returns all matches to a given IUPAC in + GFF format + + * **IUPAC scan and output matches per seq** Counts the matches to a given IUPAC + + * **Count unique seq in GFF** Gives the non-redundant count of sequences + + * **TFBScluster two TFBS** Identifies clusters of two TFBS + + * **TFBScluster three TFBS** Identifies clusters of three TFBS + +Automated installation +====================== + +Installation via the Galaxy Tool Shed will take of installing the tools +and the underlying dependencies. + +Manual Installation +=================== + +To add these to Galaxy put the following lines in tool_conf.xml for each: +tool that you want: + + <tool file="motif_tools/Scan_IUPAC_output_each_match.xml" /> + <tool file="motif_tools/Scan_IUPAC_output_matches_per_seq.xml" /> + <tool file="motif_tools/CountUniqueIDs.xml" /> + <tool file="motif_tools/TFBScluster_candidates_2TFBS.xml" /> + <tool file="motif_tools/TFBScluster_candidates_3TFBS.xml" /> + +The tools also require Perl and ``Bioperl`` to be installed. + +History +======= + +========== ===================================================================== += +Version Changes +---------- --------------------------------------------------------------------- +- 1.0.1 Updates to use conda dependency resolution and tidy up XML +- 1.0.0 Initial version +========== ===================================================================== += + +Developers +========== + +This tool is developed on the following GitHub repository: +https://github.com/fls-bioinformatics-core/galaxy-tools/tree/master/tools/macs21 + + +Licence (MIT) +============= + +Permission is hereby granted, free of charge, to any person obtaining a copy +of this software and associated documentation files (the "Software"), to deal +in the Software without restriction, including without limitation the rights +to use, copy, modify, merge, publish, distribute, sublicense, and/or sell +copies of the Software, and to permit persons to whom the Software is +furnished to do so, subject to the following conditions: + +The above copyright notice and this permission notice shall be included in +all copies or substantial portions of the Software. + +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN +THE SOFTWARE.
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Scan_IUPAC_output_each_match.xml Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,87 @@ +<?xml version="1.0" encoding="utf-8"?> +<tool id="fasta_scan_iupac_each" name="IUPAC scan and output each match" version="@VERSION@"> + <description>Returns all matches to a given IUPAC in GFF format</description> + <macros> + <import>motif_tools_macros.xml</import> + </macros> + <expand macro="requirements" /> + <command><![CDATA[ + perl $__tool_directory__/Scan_IUPAC_output_each_match.pl $iupac $fasta $output $label $strand + ]]></command> + <inputs> + <param name="iupac" type="text" label="IUPAC string" value="e.g. WGATAR" help="Enter an IUPAC string." size="20"/> + <param format="fasta" name="fasta" type="data" label="FASTA file" help="Select a FASTA file containing the sequences to be scanned."/> + <param name="label" type="text" label="Attribute in GFF output" value="IUPAC_or_name" help="The label will be included at the end (attibute) section of each GFF line. This could be the IUPAC string used or the name of the motif." size="20"/> + <param name="strand" type="select" label="Select sequence strands to scan" help="Scan either both strands or only the forward strand."> + <option value="0">Scan both strands</option> + <option value="1">Only scan forward strand</option> + </param> + </inputs> + <outputs> + <data format="gff" name="output" /> + </outputs> + + <help> +.. class:: infomark + +**What it does** + +This tool will find all matches to a DNA pattern in the input DNA sequence, represented by an IUPAC string. The matches are non-overlapping, so searching with 'TTTT' in 'TTTTTTTT' will find two hits to the IUPAC. The output is in GFF format and the last 'attribute' field can be specified using the 'Label' option. + +IUPAC = Nucleotide(s): + +A = A + +C = C + +G = G + +T = T + +M = A/C + +R = A/G + +W = A/T + +S = C/G + +Y = C/T + +K = G/T + +V = A/C/G + +H = A/C/T + +D = A/G/T + +B = C/G/T + +N = A/C/G/T + +---- + +.. class:: infomark + +**Options** + +'IUPAC string' - can be entered as upper- or lower-case as the tool will force them to become upper-case, but will only accept the IUPAC codes listed above. + +'Attribute in GFF output' - the last field of each GFF line 'attribute' can be specified using the 'Label' option, this should only include letters/numbers, but without spaces. + +'Select sequence strands to scan' - Only scanning the forward strand of the input sequence is useful if the IUPAC is a palindrome (e.g. CANNTG). + +---- + +.. class:: infomark + +**Credits** + +This Galaxy tool has been developed within the Bioinformatics Core Facility at the University of Manchester. It runs the Scan_IUPAC_output_each_match.pl Perl script that was written by Ian Donaldson. + +Please kindly acknowledge both this Galaxy tool and Scan_IUPAC_output_each_match.pl if you use it. + </help> + +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/Scan_IUPAC_output_matches_per_seq.xml Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,84 @@ +<?xml version="1.0" encoding="utf-8"?> +<tool id="fasta_scan_iupac_per_seq" name="IUPAC scan and output matches per seq" version="@VERSION@"> + <description>Counts the matches to a given IUPAC</description> + <macros> + <import>motif_tools_macros.xml</import> + </macros> + <expand macro="requirements" /> + <command><![CDATA[ + perl $__tool_directory__/Scan_IUPAC_output_matches_per_seq.pl $iupac $fasta $output $strand + ]]></command> + <inputs> + <param name="iupac" type="text" label="IUPAC string" value="e.g. WGATAR" help="Enter an IUPAC string." size="20"/> + <param format="fasta" name="fasta" type="data" label="FASTA file" help="Select a FASTA file containing the sequences to be scanned."/> + <param name="strand" type="select" label="Select sequence strands to scan" help="Scan either both strands or only the forward strand."> + <option value="0">Scan both strands</option> + <option value="1">Only scan forward strand</option> + </param> + </inputs> + <outputs> + <data format="tabular" name="output" /> + </outputs> + + <help> +.. class:: infomark + +**What it does** + +This tool will find all matches to a DNA pattern in the input DNA sequence, represented by an IUPAC string. The matches are non-overlapping, so searching with 'TTTT' in 'TTTTTTTT' will find two hits to the IUPAC. The output is a table that gives the seqname and the number of matches to the IUPAC per sequence. This version is useful if you want to get a count of IUPAC matches per sequence (e.g. a binding region) and paste the numbers back into a spreadsheet. + +IUPAC = Nucleotide(s): + +A = A + +C = C + +G = G + +T = T + +M = A/C + +R = A/G + +W = A/T + +S = C/G + +Y = C/T + +K = G/T + +V = A/C/G + +H = A/C/T + +D = A/G/T + +B = C/G/T + +N = A/C/G/T + +---- + +.. class:: infomark + +**Options** + +'IUPAC string' - can be entered as upper- or lower-case as the tool will force them to become upper-case, but will only accept the IUPAC codes listed above. + +'Select sequence strands to scan' - Only scanning the forward strand if the input sequence is useful if the IUPAC is a palindrome (e.g. CANNTG). + +---- + +.. class:: infomark + +**Credits** + +This Galaxy tool has been developed within the Bioinformatics Core Facility at the University of Manchester. It runs the Scan_IUPAC_output_matches_per_seq.pl Perl script that was written by Ian Donaldson. + +Please kindly acknowledge both this Galaxy tool and Scan_IUPAC_output_matches_per_seq.pl if you use it. + </help> + +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/TFBScluster_candidates_2TFBS.xml Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,116 @@ +<?xml version="1.0" encoding="utf-8"?> +<tool id="tfbscluster2" name="TFBScluster two TFBS" version="@VERSION@"> + <description>Identifies clusters of two TFBS</description> + <macros> + <import>motif_tools_macros.xml</import> + </macros> + <expand macro="requirements" /> + <command><![CDATA[ + perl $__tool_directory__/TFBScluster_candidates.pl + + ##TF libraries (comma delimited NO SPACES) + $lib1,$lib2 + + ##Number of flanking 'N's for subject files (comma delimited NO SPACES) + 0,0 + + ##Minimum number of occurences (comma delimited NO SPACES) + $occ1,$occ2 + + ##TF IDs (comma delimited NO SPACES) + $id1,$id2 + + ##Single range value in bp (+/-) query start and end values + $range + + ##Include overlapping TFBSs (include/exclude) + $overlap + + ##Output file + $output + + > $output_log + + ]]></command> + <inputs> + <!-- TFBS GFF libraries --> + <param format="gff" name="lib1" type="data" label="TFBS #1 GFF file" help="Select the first GFF file containing TFBS positions."/> + <param format="gff" name="lib2" type="data" label="TFBS #2 GFF file" help="Select the second GFF file containing TFBS positions."/> + + <!-- Min occurrences --> + <param name="occ1" type="select" label="Minimum occurrence of TFBS #1" help="Select the minimum number of times that an instance of TFBS #1 should be present in a cluster."> + <option value="1">1</option> + <option value="2">2</option> + <option value="3">3</option> + <option value="4">4</option> + <option value="5">5</option> + </param> + <param name="occ2" type="select" label="Minimum occurrence of TFBS #2" help="Select the minimum number of times that an instance of TFBS #2 should be present in a cluster."> + <option value="1">1</option> + <option value="2">2</option> + <option value="3">3</option> + <option value="4">4</option> + <option value="5">5</option> + </param> + + <!-- TFBS identifiers --> + <param name="id1" type="text" label="Identifier for TFBS #1" value="TFBS1" help="Enter an identifier for TFBS #1." size="20"/> + <param name="id2" type="text" label="Identifier for TFBS #2" value="TFBS2" help="Enter an identifier for TFBS #2." size="20"/> + + <!-- Cluster length --> + <param name="range" type="text" label="Minimum length of clusters" value="50" help="Enter a number for the minimum length of the clusters, for example 50bp (start to end)" size="5"/> + + <!-- Allow overlapping TFBS? --> + <param name="overlap" type="select" label="Include or exclude overlapping TFBS" help="Decide whether to allow TFBS binding sites to overlap."> + <option value="exclude">Exclude overlapping TFBS</option> + <option value="include">Include overlapping TFBS</option> + </param> + </inputs> + + <outputs> + <data format="gff" name="output" label="TFBScluster on ${on_string} (clusters)"/> + <data format="txt" name="output_log" label="TFBScluster on ${on_string} (log file)"/> + </outputs> + + <help> +.. class:: infomark + +**What it does** + +This tool takes two GFF files containing the positions genomic features, typically transcription factor binding sites (TFBS) and looks for clusters with certain properties. The GFF file input could be different TFBS (e.g. combinatorial binding of different factors) or the same TFBS (clustering of multiple instances of the same factor). + +The cluster properties are explained in more detail in the **Options** section. + +---- + +.. class:: infomark + +**Options** + +'TFBS GFF files' - Each file contains genomic coordinates, typically matches between an IUPAC string representing a TFBS and a set of target sequences, such as those from a ChIP-seq experiment. However, the positions could be for any genomic feature over the whole genome. The important thing is that the different files have the same genome build in common. + +'Minimum occurrence of TFBS' - When clusters are determined you can ensure that a minimum number off occurrences from each TFBS are present. + +'Identifier for TFBS' - This allows information about the different TFBS sets to be propogated through to the output. The identifier could be the TFBS name or the IUPAC used to search for the sites, this should only include letters/numbers, but without spaces. + +'Minimum length of clusters' - The length is a window of sequence in which the specified number of TFBS must be located. Initially TFBScluster will identify all cluster matching the input criteria. It will then merge any overlapping clusters, which can result in lengths greater than the input length. + +'Include or exclude overlapping TFBS' - You can choose to exclude any TFBS that overlaps with another when counting the number of co-occurring TFBS. By default such TFBS are excluded as a basic assumption about co-occuring/cooperative TFBS in a module is that both factors can bind at the same time, which they are unlikely to do if their binding sites overlap. + +---- + +.. class:: infomark + +**Credits** + +This Galaxy tool has been developed within the Bioinformatics Core Facility at the University of Manchester. It runs the TFBScluster_candidate.pl Perl script that was written by Ian Donaldson, which is a modification of the script from the original web tool. Articles below: + +http://www.ncbi.nlm.nih.gov/pubmed/15855248 + +http://www.ncbi.nlm.nih.gov/pubmed/16845063 + +Please kindly acknowledge both this Galaxy tool and TFBScluster articles if you use it. + </help> + +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/TFBScluster_candidates_3TFBS.xml Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,125 @@ +<?xml version="1.0" encoding="utf-8"?> +<tool id="tfbscluster3" name="TFBScluster three TFBS" version="@VERSION@"> + <description>Identifies clusters of three TFBS</description> + <macros> + <import>motif_tools_macros.xml</import> + </macros> + <expand macro="requirements" /> + <command><![CDATA[ + perl $__tool_directory__/TFBScluster_candidates.pl + + ##TF libraries (comma delimited NO SPACES) + $lib1,$lib2,$lib3 + + ##Number of flanking 'N's for subject files (comma delimited NO SPACES) + 0,0,0 + + ##Minimum number of occurences (comma delimited NO SPACES) + $occ1,$occ2,$occ3 + + ##TF IDs (comma delimited NO SPACES) + $id1,$id2,$id3 + + ##Single range value in bp (+/-) query start and end values + $range + + ##Include overlapping TFBSs (include/exclude) + $overlap + + ##Output file + $output + + > $output_log + + ]]></command> + <inputs> + <!-- TFBS GFF libraries --> + <param format="gff" name="lib1" type="data" label="TFBS #1 GFF file" help="Select the first GFF file containing TFBS positions."/> + <param format="gff" name="lib2" type="data" label="TFBS #2 GFF file" help="Select the second GFF file containing TFBS positions."/> + <param format="gff" name="lib3" type="data" label="TFBS #3 GFF file" help="Select the third GFF file containing TFBS positions."/> + + <!-- Min occurrences --> + <param name="occ1" type="select" label="Minimum occurrence of TFBS #1" help="Select the minimum number of times that an instance of TFBS #1 should be present in a cluster."> + <option value="1">1</option> + <option value="2">2</option> + <option value="3">3</option> + <option value="4">4</option> + <option value="5">5</option> + </param> + <param name="occ2" type="select" label="Minimum occurrence of TFBS #2" help="Select the minimum number of times that an instance of TFBS #2 should be present in a cluster."> + <option value="1">1</option> + <option value="2">2</option> + <option value="3">3</option> + <option value="4">4</option> + <option value="5">5</option> + </param> + <param name="occ3" type="select" label="Minimum occurrence of TFBS #3" help="Select the minimum number of times that an instance of TFBS #3 should be present in a cluster."> + <option value="1">1</option> + <option value="2">2</option> + <option value="3">3</option> + <option value="4">4</option> + <option value="5">5</option> + </param> + + <!-- TFBS identifiers --> + <param name="id1" type="text" label="Identifier for TFBS #1" value="TFBS1" help="Enter an identifier for TFBS #1." size="20"/> + <param name="id2" type="text" label="Identifier for TFBS #2" value="TFBS2" help="Enter an identifier for TFBS #2." size="20"/> + <param name="id3" type="text" label="Identifier for TFBS #3" value="TFBS3" help="Enter an identifier for TFBS #3." size="20"/> + + <!-- Cluster length --> + <param name="range" type="text" label="Minimum length of clusters" value="50" help="Enter a number for the minimum length of the clusters, for example 50bp (start to end)" size="5"/> + + <!-- Allow overlapping TFBS? --> + <param name="overlap" type="select" label="Include or exclude overlapping TFBS" help="Decide whether to allow TFBS binding sites to overlap."> + <option value="exclude">Exclude overlapping TFBS</option> + <option value="include">Include overlapping TFBS</option> + </param> + </inputs> + + <outputs> + <data format="gff" name="output" label="TFBScluster on ${on_string} (clusters)"/> + <data format="txt" name="output_log" label="TFBScluster on ${on_string} (log file)"/> + </outputs> + + <help> +.. class:: infomark + +**What it does** + +This tool takes three GFF files containing the positions genomic features, typically transcription factor binding sites (TFBS) and looks for clusters with certain properties. The GFF file input could be different TFBS (e.g. combinatorial binding of different factors) or the same TFBS (clustering of multiple instances of the same factor). + +The cluster properties are explained in more detail in the **Options** section. + +---- + +.. class:: infomark + +**Options** + +'TFBS GFF files' - Each file contains genomic coordinates, typically matches between an IUPAC string representing a TFBS and a set of target sequences, such as those from a ChIP-seq experiment. However, the positions could be for any genomic feature over the whole genome. The important thing is that the different files have the same genome build in common. + +'Minimum occurrence of TFBS' - When clusters are determined you can ensure that a minimum number off occurrences from each TFBS are present. + +'Identifier for TFBS' - This allows information about the different TFBS sets to be propogated through to the output. The identifier could be the TFBS name or the IUPAC used to search for the sites, this should only include letters/numbers, but without spaces. + +'Minimum length of clusters' - The length is a window of sequence in which the specified number of TFBS must be located. Initially TFBScluster will identify all cluster matching the input criteria. It will then merge any overlapping clusters, which can result in lengths greater than the input length. + +'Include or exclude overlapping TFBS' - You can choose to exclude any TFBS that overlaps with another when counting the number of co-occurring TFBS. By default such TFBS are excluded as a basic assumption about co-occuring/cooperative TFBS in a module is that both factors can bind at the same time, which they are unlikely to do if their binding sites overlap. + +---- + +.. class:: infomark + +**Credits** + +This Galaxy tool has been developed within the Bioinformatics Core Facility at the University of Manchester. It runs the TFBScluster_candidate.pl Perl script that was written by Ian Donaldson, which is a modification of the script from the original web tool. Articles below: + +http://www.ncbi.nlm.nih.gov/pubmed/15855248 + +http://www.ncbi.nlm.nih.gov/pubmed/16845063 + +Please kindly acknowledge both this Galaxy tool and TFBScluster articles if you use it. + </help> + +</tool> +
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/motif_tools_macros.xml Wed Mar 21 05:44:12 2018 -0400 @@ -0,0 +1,9 @@ +<macros> + <token name="@VERSION@">1.0.1</token> + <xml name="requirements"> + <requirements> + <requirement type="package" version="1.6.924">perl-bioperl</requirement +> + </requirements> + </xml> +</macros>
