comparison transit_gumbel.xml @ 5:dfd652f412bd draft

planemo upload for repository https://github.com/galaxyproject/tools-iuc/tree/master/tools/transit/ commit 73c6b2baf9dda26c6809a4f36582f7cbdb161ea1
author iuc
date Mon, 22 Apr 2019 14:39:34 -0400
parents b2f6cbdc5858
children 6d55d88fb999
comparison
equal deleted inserted replaced
4:b2f6cbdc5858 5:dfd652f412bd
34 <output name="sites" file="gumbel-sites1.txt" ftype="tabular" compare="sim_size" /> 34 <output name="sites" file="gumbel-sites1.txt" ftype="tabular" compare="sim_size" />
35 </test> 35 </test>
36 </tests> 36 </tests>
37 37
38 <help><![CDATA[ 38 <help><![CDATA[
39 .. class:: infomark
40
41 **What it does**
42
43 -------------------
44
45 The **Gumbel** method can be used to determine which genes are essential in a single condition. It does a gene-by-gene analysis of the insertions at TA sites with each gene, makes a call based on the longest consecutive sequence of TA sites without insertion in the genes, calculates the probability of this using a Bayesian model.
46
47 Note : Intended only for Himar1 datasets.
48
49 -------------------
50
51 **Inputs**
52
53 -------------------
54
55 Input files for HMM need to be:
56
57 - .wig files: Tabulated files containing one column with the TA site coordinate and one column with the read count at this site.
58 - annotation .prot_table: Annotation file generated by the `Convert Gff3 to prot_table for TRANSIT` tool.
39 59
40 60
41 .. class:: infomark 61 -------------------
42 62
43 **What it does** 63 **Parameters**
44 64
45 ------------------- 65 -------------------
66
67 Optional Arguments:
68 -s <integer> := Number of samples. Default: -s 10000
69 -b <integer> := Number of Burn-in samples. Default -b 500
70 -m <integer> := Smallest read-count to consider. Default: -m 1
71 -t <integer> := Trims all but every t-th value. Default: -t 1
72 -r <string> := How to handle replicates. Sum or Mean. Default: -r Sum
73 --iN <float> := Ignore TAs occuring at given fraction of the N terminus. Default: -iN 0.0
74 --iC <float> := Ignore TAs occuring at given fraction of the C terminus. Default: -iC 0.0
46 75
47 76
48 The **Gumbel** method can be used to determine which genes are essential in a single condition. It does a gene-by-gene analysis of the insertions at TA sites with each gene, makes a call based on the longest consecutive sequence of TA sites without insertion in the genes, calculates the probability of this using a Bayesian model. 77 - Samples: Gumbel uses Metropolis-Hastings (MH) to generate samples of posterior distributions. The default setting is to run the simulation for 10,000 iterations. This is usually enough to assure convergence of the sampler and to provide accurate estimates of posterior probabilities. Less iterations may work, but at the risk of lower accuracy.
78 - Burn-In: Because the MH sampler many not have stabilized in the first few iterations, a “burn-in” period is defined. Samples obtained in this “burn-in” period are discarded, and do not count towards estimates.
79 - Trim: The MH sampler produces Markov samples that are correlated. This parameter dictates how many samples must be attempted for every sampled obtained. Increasing this parameter will decrease the auto-correlation, at the cost of dramatically increasing the run-time. For most situations, this parameter should be left at the default of “1”.
80 - Minimum Read: The minimum read count that is considered a true read. Because the Gumbel method depends on determining gaps of TA sites lacking insertions, it may be susceptible to spurious reads (e.g. errors). The default value of 1 will consider all reads as true reads. A value of 2, for example, will ignore read counts of 1.
81 - Replicates: Determines how to deal with replicates by averaging the read-counts or summing read counts across datasets. This should not have an affect for the Gumbel method, aside from potentially affecting spurious reads.
49 82
50 Note : Intended only for Himar1 datasets. 83
84 -------------------
85
86 **Outputs**
87
88 -------------------
89
90 ============================================= ========================================================================================================================
91 **Column Header** **Column Definition**
92 --------------------------------------------- ------------------------------------------------------------------------------------------------------------------------
93 Orf Gene ID
94 Name Gene Name
95 Desc Gene Description
96 k Number of Transposon Insertions Observed within the ORF.
97 n Total Number of TA dinucleotides within the ORF.
98 r Span of nucleotides for the Maximum Run of Non-Insertions.
99 s Span of nucleotides for the Maximum Run of Non-Insertions.
100 zbar Posterior Probability of Essentiality.
101 State Call Essentiality call for the gene. Depends on FDR corrected thresholds. E=Essential U=Uncertain, NE=Non-Essential, S=too short
102 ============================================= ========================================================================================================================
51 103
52 104
53 105
54 ------------------- 106 Note: Technically, Bayesian models are used to calculate posterior probabilities, not p-values (which is a concept associated with the frequentist framework). However, we have implemented a method for computing the approximate false-discovery rate (FDR) that serves a similar purpose. This determines a threshold for significance on the posterior probabilities that is corrected for multiple tests. The actual thresholds used are reported in the headers of the output file (and are near 1 for essentials and near 0 for non-essentials). There can be many genes that score between the two thresholds (t1 < zbar < t2). This reflects intrinsic uncertainty associated with either low read counts, sparse insertion density, or small genes. If the insertion_density is too low (< ~30%), the method may not work as well, and might indicate an unusually large number of Uncertain or Essential genes.
55 107
56 **Inputs** 108 -------------------
57 109
58 ------------------- 110 **More Information**
59 111
60 Input files for HMM need to be: 112 -------------------
61 113
62 - .wig files : Tabulated files containing one column with the TA site coordinate and one column with the read count at this site. 114 See `TRANSIT documentation`
63 - annotation .prot_table : Annotation file generated by the `Convert Gff3 to prot_table for TRANSIT` tool.
64 115
65 116 - TRANSIT: https://transit.readthedocs.io/en/latest/index.html
66 ------------------- 117 - `TRANSIT Gumbel`: https://transit.readthedocs.io/en/latest/transit_methods.html#gumbel
67
68 **Parameters**
69
70 -------------------
71
72 Optional Arguments:
73 -s <integer> := Number of samples. Default: -s 10000
74 -b <integer> := Number of Burn-in samples. Default -b 500
75 -m <integer> := Smallest read-count to consider. Default: -m 1
76 -t <integer> := Trims all but every t-th value. Default: -t 1
77 -r <string> := How to handle replicates. Sum or Mean. Default: -r Sum
78 --iN <float> := Ignore TAs occuring at given fraction of the N terminus. Default: -iN 0.0
79 --iC <float> := Ignore TAs occuring at given fraction of the C terminus. Default: -iC 0.0
80
81 - Samples: Gumbel uses Metropolis-Hastings (MH) to generate samples of posterior distributions. The default setting is to run the simulation for 10,000 iterations. This is usually enough to assure convergence of the sampler and to provide accurate estimates of posterior probabilities. Less iterations may work, but at the risk of lower accuracy.
82 - Burn-In: Because the MH sampler many not have stabilized in the first few iterations, a “burn-in” period is defined. Samples obtained in this “burn-in” period are discarded, and do not count towards estimates.
83 - Trim: The MH sampler produces Markov samples that are correlated. This parameter dictates how many samples must be attempted for every sampled obtained. Increasing this parameter will decrease the auto-correlation, at the cost of dramatically increasing the run-time. For most situations, this parameter should be left at the default of “1”.
84 - Minimum Read: The minimum read count that is considered a true read. Because the Gumbel method depends on determining gaps of TA sites lacking insertions, it may be susceptible to spurious reads (e.g. errors). The default value of 1 will consider all reads as true reads. A value of 2, for example, will ignore read counts of 1.
85 - Replicates: Determines how to deal with replicates by averaging the read-counts or summing read counts across datasets. This should not have an affect for the Gumbel method, aside from potentially affecting spurious reads.
86
87
88 -------------------
89
90 **Outputs**
91
92 -------------------
93
94 ============================================= ========================================================================================================================
95 **Column Header** **Column Definition**
96 --------------------------------------------- ------------------------------------------------------------------------------------------------------------------------
97 Orf Gene ID
98 Name Gene Name
99 Desc Gene Description
100 k Number of Transposon Insertions Observed within the ORF.
101 n Total Number of TA dinucleotides within the ORF.
102 r Span of nucleotides for the Maximum Run of Non-Insertions.
103 s Span of nucleotides for the Maximum Run of Non-Insertions.
104 zbar Posterior Probability of Essentiality.
105 State Call Essentiality call for the gene. Depends on FDR corrected thresholds. E=Essential U=Uncertain, NE=Non-Essential, S=too short
106 ============================================= ========================================================================================================================
107
108
109
110 Note: Technically, Bayesian models are used to calculate posterior probabilities, not p-values (which is a concept associated with the frequentist framework). However, we have implemented a method for computing the approximate false-discovery rate (FDR) that serves a similar purpose. This determines a threshold for significance on the posterior probabilities that is corrected for multiple tests. The actual thresholds used are reported in the headers of the output file (and are near 1 for essentials and near 0 for non-essentials). There can be many genes that score between the two thresholds (t1 < zbar < t2). This reflects intrinsic uncertainty associated with either low read counts, sparse insertion density, or small genes. If the insertion_density is too low (< ~30%), the method may not work as well, and might indicate an unusually large number of Uncertain or Essential genes.
111
112 -------------------
113
114 **More Information**
115
116 -------------------
117
118 See `TRANSIT documentation`
119
120 - TRANSIT: https://transit.readthedocs.io/en/latest/index.html
121 - `TRANSIT Gumbel`: https://transit.readthedocs.io/en/latest/transit_methods.html#gumbel
122
123
124
125
126
127 ]]></help> 118 ]]></help>
128 119
129 <expand macro="citations" /> 120 <expand macro="citations" />
130 121
131
132 </tool> 122 </tool>