| 0 | 1 #+TITLE: TAREAN output description | 
|  | 2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" /> | 
|  | 3 #+LANGUAGE: en | 
|  | 4 | 
|  | 5 * Introduction | 
|  | 6 TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories: | 
|  | 7 + high confidence satellites | 
|  | 8 + low confidence satellites | 
|  | 9 + potential LTR elements | 
|  | 10 + rDNA | 
|  | 11 + other clusters | 
|  | 12 Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report. | 
|  | 13 | 
|  | 14 * Main HTML report | 
|  | 15 This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads) | 
|  | 16 ** Table legend | 
|  | 17 + Cluster ::  Cluster identifier | 
|  | 18 + Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/ | 
|  | 19 + Size :: Number of reads in the cluster | 
|  | 20 + Satellite probability :: Empirical probability estimate that cluster sequences | 
|  | 21      are derived from satellite repeat. This estimate is based on analysis of more | 
|  | 22      than xxx clusters including yyy manually anotated and zzz experimentaly | 
|  | 23      validated satellite repeats | 
|  | 24 + Consensus :: Consensus sequence is outcome of kmer-based | 
|  | 25      analysis and represents the most probable satellite monomer | 
|  | 26      sequence | 
|  | 27 + Kmer analysis :: | 
|  | 28      link to analysis report for individual clusters | 
|  | 29 + Graph layout :: Graph-based visualization of similarities among sequence | 
|  | 30      reads | 
|  | 31 + Connected component index :: Proportion of nodes of the graph which are part | 
|  | 32      of the the largest strongly connected component | 
|  | 33 + Pair completeness index ::  Proportion of reads with available | 
|  | 34      mate-pair within the same cluster | 
|  | 35 + Kmer coverage :: Sum of relative frequencies of all kmers used for consensus | 
|  | 36      sequence reconstruction | 
|  | 37 + |V| :: Number of vertices of the graph | 
|  | 38 + |E| :: Number of edges of the graph | 
|  | 39 + PBS score :: Primer binding site detection score | 
|  | 40 + The longest ORF length :: Length of the longest open reading frame found in | 
|  | 41      any of the possible six reading frames. Search was done on dimer of | 
|  | 42      consensus so ORFs can be longer than 'monomer' length | 
|  | 43 + Similarity-based annotation :: Annotation based on | 
|  | 44      similarity search using blastn/blastx against database of known | 
|  | 45      repeats. | 
|  | 46 * Detailed cluster report | 
|  | 47 Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent). | 
|  | 48 ** Table legend | 
|  | 49 - kmer :: length of kmer used for consensus reconstruction. | 
|  | 50 - variant :: identifier of consensus variant. | 
|  | 51 - total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction. | 
|  | 52 - monomer length :: length of the consensus | 
|  | 53 - consensus :: consensus sequence without ambiguous bases. | 
|  | 54 - graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of | 
|  | 55      vertices corresponds to k-mer frequencies, Paths in the graph which was used | 
|  | 56      for reconstruction of consensus sequences is gray colored. | 
|  | 57 - logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices. | 
|  | 58 | 
|  | 59 * Structure of the output archive | 
|  | 60 Complete results from TAREAN analysis can by downloaded as zip archive which contains the following | 
|  | 61 files and directories: | 
|  | 62 | 
|  | 63 #+BEGIN_SRC files & directories | 
|  | 64 . | 
|  | 65 . | 
|  | 66 ├── clusters_info.csv <------------ list of clusters in tab delimited format | 
|  | 67 ├── index.html        <------------ main html report | 
|  | 68 ├── seqclust | 
|  | 69 │   ├── assembly                  # not implemented yet | 
|  | 70 │   ├── blastn        <------------ results of read comparison with DNA database | 
|  | 71 │   ├── blastx        <------------ results of read comparison with protein database | 
|  | 72 │   ├── clustering | 
|  | 73 │   │   ├── clusters | 
|  | 74 │   │   │   ├── dir_CL0001  <----┐- detailed information about clusters | 
|  | 75 │   │   │   ├── dir_CL0002  <----│ | 
|  | 76 │   │   │   ├── dir_CL0003  <----│ | 
|  | 77 │   │   │   ....            <----┘ | 
|  | 78 │   │   │ | 
|  | 79 │   │   └── hitsort.cls  <--------- list of reads in individual clusters | 
|  | 80 │   ├── mgblast | 
|  | 81 │   ├── prerun | 
|  | 82 │   └── sequences        <--------- input reads | 
|  | 83 ├── summary                       # not implemented yet | 
|  | 84 ├── TR_consensus_rank_1_.fasta  <-- reconstructed monomer sequences for HIGH confidence satellites | 
|  | 85 ├── TR_consensus_rank_2_.fasta  <-- reconstructed monomer sequences for LOW confidence satellites | 
|  | 86 ├── TR_consensus_rank_3_.fasta  <-- reconstructed sequences of potential LTR elements | 
|  | 87 └── TR_consensus_rank_4_.fasta  <-- reconstructed consensus for rDNA | 
|  | 88 | 
|  | 89 #+END_SRC | 
|  | 90 | 
|  | 91 List of all clusters which is available in HTML file =index.html= is also | 
|  | 92 available in tab delimited format in the file =clusters_info.csv= which can be | 
|  | 93 easily viewed and edited in spreadsheet editing programs. List of all clusters | 
|  | 94 and the corresponding reads is in the file =hitsort.cls= which has the following | 
|  | 95 format: | 
|  | 96 | 
|  | 97   :  >CL1    11 | 
|  | 98   :  134234r 55494f  85525f  136746r 96742f  91926f  239729r 105445f 222518r 136402r 9013 | 
|  | 99   :  >CL2    10 | 
|  | 100   :  76205r  120735r 69527r  12235r  176778f 189307f 131952f 163507f 100038r 178475r | 
|  | 101   :  >CL3    6 | 
|  | 102   :  99835r  222598f 29715r  102023f 99524r  30116f | 
|  | 103   :  >CL4    6 | 
|  | 104   :  51723r  69073r  218774r 146425f 136314r 41744f | 
|  | 105   :  >CL5    5 | 
|  | 106   :  70686f  65565f  234078r 50430r  68247r | 
|  | 107 | 
|  | 108 where =CL1 11= is the cluster ID followed by number of reads in the cluster; | 
|  | 109 next line contains list of all read names belonging to the cluster. | 
|  | 110 ** structure of cluster directories | 
|  | 111 | 
|  | 112 Detailed information for each cluster is stored is subdirectories: | 
|  | 113 | 
|  | 114 #+BEGIN_SRC folder directories | 
|  | 115 dir_CL0011 | 
|  | 116 ├── blast.csv        <------------tab delimited file, all-to-all comparison od reads within cluster | 
|  | 117 ├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object | 
|  | 118 ├── CL11.GL     <-----------------undirected graph representation of cluster saved as R igraph object | 
|  | 119 ├── CL11.png         <-----------┐- images with graph visualization | 
|  | 120 ├── CL11_tmb.png     <-----------┘ | 
|  | 121 ├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats | 
|  | 122 ├── reads_all.fas   <---------------- all reads included in the cluster in fasta format | 
|  | 123 ├── reads.fas      <---------------- subset of reads used for monomer reconstruction | 
|  | 124 ├── reads_oriented.fas <------------ subset of reads all in the same orientation | 
|  | 125 └── tarean | 
|  | 126     ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants | 
|  | 127     ├── ggmin.RData | 
|  | 128     ├── img | 
|  | 129     │   ├── graph_11mer_1.png  <-----┐ | 
|  | 130     │   ├── graph_11mer_2.png  <-----│ | 
|  | 131     │   ├── graph_15mer_2.png  <-----│ | 
|  | 132     │   ├── graph_15mer_3.png  <-----│ | 
|  | 133     │   ├── graph_15mer_4.png  <-----│ images of kmer-based graphs used for reconstruction of | 
|  | 134     │   ├── graph_19mer_2.png  <-----│ monomer variants | 
|  | 135     │   ├── graph_19mer_4.png  <-----│ | 
|  | 136     │   ├── graph_19mer_5.png  <-----│ | 
|  | 137     │   ├── graph_23mer_2.png  <-----│ | 
|  | 138     │   ├── graph_27mer_3.png  <-----┘ | 
|  | 139     │   │ | 
|  | 140     │   ├── logo_11mer_1.png  <-----┐ | 
|  | 141     │   ├── logo_11mer_2.png  <-----│ | 
|  | 142     │   ├── logo_15mer_2.png  <-----│ | 
|  | 143     │   ├── logo_15mer_3.png  <-----│ | 
|  | 144     │   ├── logo_15mer_4.png  <-----│ images with DNA logos representing consensus sequences | 
|  | 145     │   ├── logo_19mer_2.png  <-----│ of monomer variants | 
|  | 146     │   ├── logo_19mer_4.png  <-----│ | 
|  | 147     │   ├── logo_19mer_5.png  <-----│ | 
|  | 148     │   ├── logo_23mer_2.png  <-----│ | 
|  | 149     │   └── logo_27mer_3.png  <-----┘ | 
|  | 150     │ | 
|  | 151     ├── ppm_11mer_1.csv  <-----┐ | 
|  | 152     ├── ppm_11mer_2.csv  <-----│ | 
|  | 153     ├── ppm_15mer_2.csv  <-----│ | 
|  | 154     ├── ppm_15mer_3.csv  <-----│ | 
|  | 155     ├── ppm_15mer_4.csv  <-----│ position probability matrices for individual monomer | 
|  | 156     ├── ppm_19mer_2.csv  <-----│ variants derived from k-mer frequencies | 
|  | 157     ├── ppm_19mer_4.csv  <-----│ | 
|  | 158     ├── ppm_19mer_5.csv  <-----│ | 
|  | 159     ├── ppm_23mer_2.csv  <-----│ | 
|  | 160     ├── ppm_27mer_3.csv  <-----┘ | 
|  | 161     │ | 
|  | 162     ├── reads_oriented.fas_11.kmers  <-----┐ | 
|  | 163     ├── reads_oriented.fas_15.kmers  <-----│ | 
|  | 164     ├── reads_oriented.fas_19.kmers  <-----│ k-mer frequencies calculated on oriented reads | 
|  | 165     ├── reads_oriented.fas_23.kmers  <-----│ for k-mer lengths 11 - 27 | 
|  | 166     ├── reads_oriented.fas_27.kmers  <-----┘ | 
|  | 167     ├── reads_oriented.fasblast_out.cvs  <---------┐results of blastn search against database of tRNA | 
|  | 168     ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection | 
|  | 169     ├── reads_oriented.fasblast_out.cvs_R.csv <----┘ | 
|  | 170     └── report.html       <--- cluster analysisHTML summary | 
|  | 171 #+END_SRC | 
|  | 172 | 
|  | 173 | 
|  | 174 |