Mercurial > repos > matthias > dada2_learnerrors
comparison notes.txt @ 0:56d5be6c03b9 draft
planemo upload for repository https://github.com/bernt-matthias/mb-galaxy-tools/tree/topic/dada2/tools/dada2 commit d63c84012410608b3b5d23e130f0beff475ce1f8-dirty
| author | matthias |
|---|---|
| date | Fri, 08 Mar 2019 06:30:11 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:56d5be6c03b9 |
|---|---|
| 1 TODO | |
| 2 ==== | |
| 3 | |
| 4 | |
| 5 | |
| 6 If we make a monolithic tool: | |
| 7 | |
| 8 * implement sanity checks between important compute intensive steps (user definable criteria, abort if violated) | |
| 9 | |
| 10 If we keep separate tools: | |
| 11 | |
| 12 - make Rdata data types specific (like xmcs https://github.com/workflow4metabolomics/xcms/tree/dev/datatypes) | |
| 13 * alternatively the data set types could be derived from tabular and the Rdata could be attached via | |
| 14 `.extra_files_path` this way the user could have some intermediate output that he could look at. | |
| 15 | |
| 16 | |
| 17 In both cases: | |
| 18 | |
| 19 * allow input of single end data, single pair, single pair in separate data sets, ... | |
| 20 * add mergePairsByID functionality to mergePairs tool | |
| 21 | |
| 22 | |
| 23 Datatypes: | |
| 24 ========== | |
| 25 | |
| 26 **derep-class**: list w 3 members | |
| 27 - uniques: Named integer vector. Named by the unique sequence, valued by abundance. | |
| 28 • quals: Numeric matrix of average quality scores by position for each unique. Uniques are | |
| 29 rows, positions are cols. | |
| 30 * map: Integer vector of length the number of reads, and value the index (in uniques) of the | |
| 31 unique to which that read was assigned. | |
| 32 | |
| 33 **learnErrorsOutput**: A named list with three entries | |
| 34 - err_out: A numeric matrix with the learned error rates. | |
| 35 - err_in: The initialization error rates (unimportant). | |
| 36 - trans: A feature table of observed transitions for each type (eg. A->C) and quality score. | |
| 37 | |
| 38 **dada-class**: A multi-item List with the following named values... | |
| 39 • denoised: Integer vector, named by sequence valued by abundance, of the denoised sequences. | |
| 40 • clustering: An informative data.frame containing information on each cluster. | |
| 41 • sequence: A character vector of each denoised sequence. Identical to names(denoised). | |
| 42 • quality: The average quality scores for each cluster (row) by position (col). | |
| 43 • map: Integer vector that maps the unique (index of derep.unique) to the denoised sequence (index of dada.denoised). | |
| 44 • birth_subs: A data.frame containing the substitutions at the birth of each new cluster. | |
| 45 • trans: The matrix of transitions by type (row), eg. A2A, A2C..., and quality score (col) | |
| 46 observed in the final output of the dada algorithm. | |
| 47 • err_in: The err matrix used for this invocation of dada. | |
| 48 • err_out: The err matrix estimated from the output of dada. NULL if err_function not provided. | |
| 49 • opts: A list of the dada_opts used for this invocation of dada. | |
| 50 • call: The function call used for this invocation of dada. | |
| 51 | |
| 52 **uniques**: derep, dada, mergepairs(or data frame w sequenc and abundance columns) | |
| 53 | |
| 54 **mergepairs**: | |
| 55 | |
| 56 data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, and the following columns: | |
| 57 • abundance: Number of reads corresponding to this forward/reverse combination. | |
| 58 • sequence: The merged sequence. | |
| 59 • forward: The index of the forward denoised sequence. | |
| 60 • reverse: The index of the reverse denoised sequence. | |
| 61 • nmatch: Number of matches nts in the overlap region. | |
| 62 • nmismatch: Number of mismatches in the overlap region. | |
| 63 • nindel: Number of indels in the overlap region. | |
| 64 • prefer: The sequence used for the overlap region. 1=forward; 2=reverse. | |
| 65 • accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. | |
| 66 • ...: Additional columns specified in propagateCol | |
| 67 | |
| 68 | |
| 69 | |
| 70 Tools: | |
| 71 ====== | |
| 72 | |
| 73 • Quality filtering | |
| 74 | |
| 75 filterAndTrim IO=(fastq -> fastq) | |
| 76 | |
| 77 • Dereplication | |
| 78 | |
| 79 derepFastq (fastq -> derep-class object) | |
| 80 | |
| 81 • Learn error rates | |
| 82 | |
| 83 learnErrors + plotErrors | |
| 84 - in: input list, or vector, of file names (or a list of derep-class objects WHY .. learning should be done on full data) | |
| 85 - out: named list w entries | |
| 86 - \$err\_out: A numeric matrix with the learned error rates. | |
| 87 - \$err\_in: The initialization error rates (unimportant). | |
| 88 - \$trans: A feature table of observed transitions for each type (eg. A->C) and quality score | |
| 89 | |
| 90 • Sample Inference (dada) | |
| 91 in: (list of) derep-class object | |
| 92 out: (list of) dada-class object | |
| 93 | |
| 94 • Chimera Removal | |
| 95 | |
| 96 removeBimeraDenovo | |
| 97 | |
| 98 in: A uniques-vector or any object that can be coerced into one with getUniques. | |
| 99 out: A uniques vector, or an object of matching class if a data.frame or sequence table is provided | |
| 100 | |
| 101 • Merging of Paired Reads | |
| 102 | |
| 103 mergePairs | |
| 104 in: 2x dada-class object(s), 2x derep-class object(s) | |
| 105 out: A data.frame, or a list of data.frames. | |
| 106 - The return data.frame(s) has a row for each unique pairing of forward/reverse denoised sequences, | |
| 107 - cols | |
| 108 - \$abundance: Number of reads corresponding to this forward/reverse combination. | |
| 109 - \$sequence: The merged sequence. | |
| 110 - \$forward: The index of the forward denoised sequence. | |
| 111 - \$reverse: The index of the reverse denoised sequence. | |
| 112 - \$nmatch: Number of matches nts in the overlap region. | |
| 113 - \$nmismatch: Number of mismatches in the overlap region. | |
| 114 - \$nindel: Number of indels in the overlap region. | |
| 115 - \$prefer: The sequence used for the overlap region. 1=forward; 2=reverse. | |
| 116 - \$accept: TRUE if overlap between forward and reverse denoised sequences was at least minOverlap and had at most maxMismatch differences. FALSE otherwise. | |
| 117 - \$...: Additional columns specified in propagateCol. | |
| 118 | |
| 119 • Taxonomic Classification (assignTaxonomy, assignSpecies) | |
| 120 | |
| 121 * Other | |
| 122 | |
| 123 makeSequenceTable | |
| 124 in A list of the samples to include in the sequence table. Samples can be provided in any format that can be processed by getUniques | |
| 125 out Named integer matrix (row for each sample, column for each unique sequence) | |
| 126 | |
| 127 mergeSequenceTables | |
| 128 | |
| 129 uniquesToFasta | |
| 130 in: A uniques-vector or any object that can be coerced into one with getUniques. | |
| 131 | |
| 132 getSequences | |
| 133 | |
| 134 extracts the sequences from several different data objects: including including dada-class | |
| 135 and derep-class objects, as well as data.frame objects that have both \$sequence and \$abun- | |
| 136 dance columns. | |
| 137 | |
| 138 getUniques | |
| 139 | |
| 140 extracts the uniques-vector from several different data objects, including dada-class | |
| 141 and derep-class objects, as well as data.frame objects that have both \$sequence and \$abundance | |
| 142 columns | |
| 143 | |
| 144 plotQualityProfile | |
| 145 | |
| 146 seqComplexity | |
| 147 | |
| 148 setDadaOpt(...) |
