| 
0
 | 
     1 #+TITLE: TAREAN output description
 | 
| 
 | 
     2 #+HTML_HEAD_EXTRA: <link rel="stylesheet" type="text/css" href="style1.css" />
 | 
| 
 | 
     3 #+LANGUAGE: en
 | 
| 
 | 
     4 
 | 
| 
 | 
     5 * Introduction
 | 
| 
 | 
     6 TAREAN output includes *HTML report* with list of all analyzed clusters; the clusters are classified into five categories:
 | 
| 
 | 
     7 + high confidence satellites
 | 
| 
 | 
     8 + low confidence satellites
 | 
| 
 | 
     9 + potential LTR elements
 | 
| 
 | 
    10 + rDNA
 | 
| 
 | 
    11 + other clusters
 | 
| 
 | 
    12 Each cluster for which consensus sequences was reconstructed has also its own detailed report, linked to the main report.
 | 
| 
 | 
    13 
 | 
| 
 | 
    14 * Main HTML report
 | 
| 
 | 
    15 This report contains basic information about all clusters larger than specified threshold (default value is 0.01% of analyzed reads)
 | 
| 
 | 
    16 ** Table legend
 | 
| 
 | 
    17 + Cluster ::  Cluster identifier
 | 
| 
 | 
    18 + Genome Proportion[%] :: /(Number of sequences in cluster/Number of sequences in clustering) x 100%/
 | 
| 
 | 
    19 + Size :: Number of reads in the cluster
 | 
| 
 | 
    20 + Satellite probability :: Empirical probability estimate that cluster sequences
 | 
| 
 | 
    21      are derived from satellite repeat. This estimate is based on analysis of more
 | 
| 
 | 
    22      than xxx clusters including yyy manually anotated and zzz experimentaly
 | 
| 
 | 
    23      validated satellite repeats
 | 
| 
 | 
    24 + Consensus :: Consensus sequence is outcome of kmer-based
 | 
| 
 | 
    25      analysis and represents the most probable satellite monomer
 | 
| 
 | 
    26      sequence
 | 
| 
 | 
    27 + Kmer analysis ::
 | 
| 
 | 
    28      link to analysis report for individual clusters
 | 
| 
 | 
    29 + Graph layout :: Graph-based visualization of similarities among sequence
 | 
| 
 | 
    30      reads
 | 
| 
 | 
    31 + Connected component index :: Proportion of nodes of the graph which are part
 | 
| 
 | 
    32      of the the largest strongly connected component
 | 
| 
 | 
    33 + Pair completeness index ::  Proportion of reads with available
 | 
| 
 | 
    34      mate-pair within the same cluster
 | 
| 
 | 
    35 + Kmer coverage :: Sum of relative frequencies of all kmers used for consensus
 | 
| 
 | 
    36      sequence reconstruction
 | 
| 
 | 
    37 + |V| :: Number of vertices of the graph
 | 
| 
 | 
    38 + |E| :: Number of edges of the graph
 | 
| 
 | 
    39 + PBS score :: Primer binding site detection score
 | 
| 
 | 
    40 + The longest ORF length :: Length of the longest open reading frame found in
 | 
| 
 | 
    41      any of the possible six reading frames. Search was done on dimer of
 | 
| 
 | 
    42      consensus so ORFs can be longer than 'monomer' length
 | 
| 
 | 
    43 + Similarity-based annotation :: Annotation based on
 | 
| 
 | 
    44      similarity search using blastn/blastx against database of known
 | 
| 
 | 
    45      repeats.
 | 
| 
 | 
    46 * Detailed cluster report
 | 
| 
 | 
    47 Cluster report includes a list of major monomer sequence varinats reconstructed from the most frequent k-mers. The reconstructed consensus sequences are sorted based on their significance (that is, what proportion of k-mer they represent).
 | 
| 
 | 
    48 ** Table legend
 | 
| 
 | 
    49 - kmer :: length of kmer used for consensus reconstruction.
 | 
| 
 | 
    50 - variant :: identifier of consensus variant.
 | 
| 
 | 
    51 - total score :: measure of significance of consensus variant. Score is calculated as a sum of weights of all k-mers used for consensus reconstruction.
 | 
| 
 | 
    52 - monomer length :: length of the consensus
 | 
| 
 | 
    53 - consensus :: consensus sequence without ambiguous bases. 
 | 
| 
 | 
    54 - graph image :: part of de-Bruijn graph based on the abundant k-mers. Size of
 | 
| 
 | 
    55      vertices corresponds to k-mer frequencies, Paths in the graph which was used
 | 
| 
 | 
    56      for reconstruction of consensus sequences is gray colored.
 | 
| 
 | 
    57 - logo image :: consensus sequences shown as DNA logo. Height of letters corresponds to kmer frequencies. Logo images are linked to corresponding position probability matrices.
 | 
| 
 | 
    58 
 | 
| 
 | 
    59 * Structure of the output archive
 | 
| 
 | 
    60 Complete results from TAREAN analysis can by downloaded as zip archive which contains the following
 | 
| 
 | 
    61 files and directories:
 | 
| 
 | 
    62 
 | 
| 
 | 
    63 #+BEGIN_SRC files & directories
 | 
| 
 | 
    64 .
 | 
| 
 | 
    65 .
 | 
| 
 | 
    66 ├── clusters_info.csv <------------ list of clusters in tab delimited format 
 | 
| 
 | 
    67 ├── index.html        <------------ main html report
 | 
| 
 | 
    68 ├── seqclust
 | 
| 
 | 
    69 │   ├── assembly                  # not implemented yet
 | 
| 
 | 
    70 │   ├── blastn        <------------ results of read comparison with DNA database
 | 
| 
 | 
    71 │   ├── blastx        <------------ results of read comparison with protein database
 | 
| 
 | 
    72 │   ├── clustering
 | 
| 
 | 
    73 │   │   ├── clusters
 | 
| 
 | 
    74 │   │   │   ├── dir_CL0001  <----┐- detailed information about clusters
 | 
| 
 | 
    75 │   │   │   ├── dir_CL0002  <----│
 | 
| 
 | 
    76 │   │   │   ├── dir_CL0003  <----│
 | 
| 
 | 
    77 │   │   │   ....            <----┘
 | 
| 
 | 
    78 │   │   │   
 | 
| 
 | 
    79 │   │   └── hitsort.cls  <--------- list of reads in individual clusters
 | 
| 
 | 
    80 │   ├── mgblast
 | 
| 
 | 
    81 │   ├── prerun
 | 
| 
 | 
    82 │   └── sequences        <--------- input reads
 | 
| 
 | 
    83 ├── summary                       # not implemented yet
 | 
| 
 | 
    84 ├── TR_consensus_rank_1_.fasta  <-- reconstructed monomer sequences for HIGH confidence satellites
 | 
| 
 | 
    85 ├── TR_consensus_rank_2_.fasta  <-- reconstructed monomer sequences for LOW confidence satellites
 | 
| 
 | 
    86 ├── TR_consensus_rank_3_.fasta  <-- reconstructed sequences of potential LTR elements
 | 
| 
 | 
    87 └── TR_consensus_rank_4_.fasta  <-- reconstructed consensus for rDNA
 | 
| 
 | 
    88 
 | 
| 
 | 
    89 #+END_SRC
 | 
| 
 | 
    90 
 | 
| 
 | 
    91 List of all clusters which is available in HTML file =index.html= is also
 | 
| 
 | 
    92 available in tab delimited format in the file =clusters_info.csv= which can be
 | 
| 
 | 
    93 easily viewed and edited in spreadsheet editing programs. List of all clusters
 | 
| 
 | 
    94 and the corresponding reads is in the file =hitsort.cls= which has the following
 | 
| 
 | 
    95 format:
 | 
| 
 | 
    96 
 | 
| 
 | 
    97   :  >CL1    11
 | 
| 
 | 
    98   :  134234r 55494f  85525f  136746r 96742f  91926f  239729r 105445f 222518r 136402r 9013
 | 
| 
 | 
    99   :  >CL2    10
 | 
| 
 | 
   100   :  76205r  120735r 69527r  12235r  176778f 189307f 131952f 163507f 100038r 178475r 
 | 
| 
 | 
   101   :  >CL3    6
 | 
| 
 | 
   102   :  99835r  222598f 29715r  102023f 99524r  30116f 
 | 
| 
 | 
   103   :  >CL4    6
 | 
| 
 | 
   104   :  51723r  69073r  218774r 146425f 136314r 41744f 
 | 
| 
 | 
   105   :  >CL5    5
 | 
| 
 | 
   106   :  70686f  65565f  234078r 50430r  68247r 
 | 
| 
 | 
   107 
 | 
| 
 | 
   108 where =CL1 11= is the cluster ID followed by number of reads in the cluster;
 | 
| 
 | 
   109 next line contains list of all read names belonging to the cluster.
 | 
| 
 | 
   110 ** structure of cluster directories
 | 
| 
 | 
   111 
 | 
| 
 | 
   112 Detailed information for each cluster is stored is subdirectories:
 | 
| 
 | 
   113 
 | 
| 
 | 
   114 #+BEGIN_SRC folder directories
 | 
| 
 | 
   115 dir_CL0011
 | 
| 
 | 
   116 ├── blast.csv        <------------tab delimited file, all-to-all comparison od reads within cluster            
 | 
| 
 | 
   117 ├── CL11_directed_graph.RData <----directed graph representation of cluster saved as R igraph object
 | 
| 
 | 
   118 ├── CL11.GL     <-----------------undirected graph representation of cluster saved as R igraph object
 | 
| 
 | 
   119 ├── CL11.png         <-----------┐- images with graph visualization
 | 
| 
 | 
   120 ├── CL11_tmb.png     <-----------┘
 | 
| 
 | 
   121 ├── dna_database_annotation.csv <-- annotation of cluster reads based on the DNA database of repeats
 | 
| 
 | 
   122 ├── reads_all.fas   <---------------- all reads included in the cluster in fasta format
 | 
| 
 | 
   123 ├── reads.fas      <---------------- subset of reads used for monomer reconstruction
 | 
| 
 | 
   124 ├── reads_oriented.fas <------------ subset of reads all in the same orientation
 | 
| 
 | 
   125 └── tarean
 | 
| 
 | 
   126     ├── consensus.fasta <----------- fasta file with tandem repeat consensus variants
 | 
| 
 | 
   127     ├── ggmin.RData
 | 
| 
 | 
   128     ├── img
 | 
| 
 | 
   129     │   ├── graph_11mer_1.png  <-----┐  
 | 
| 
 | 
   130     │   ├── graph_11mer_2.png  <-----│
 | 
| 
 | 
   131     │   ├── graph_15mer_2.png  <-----│
 | 
| 
 | 
   132     │   ├── graph_15mer_3.png  <-----│
 | 
| 
 | 
   133     │   ├── graph_15mer_4.png  <-----│ images of kmer-based graphs used for reconstruction of
 | 
| 
 | 
   134     │   ├── graph_19mer_2.png  <-----│ monomer variants
 | 
| 
 | 
   135     │   ├── graph_19mer_4.png  <-----│
 | 
| 
 | 
   136     │   ├── graph_19mer_5.png  <-----│
 | 
| 
 | 
   137     │   ├── graph_23mer_2.png  <-----│
 | 
| 
 | 
   138     │   ├── graph_27mer_3.png  <-----┘
 | 
| 
 | 
   139     │   │
 | 
| 
 | 
   140     │   ├── logo_11mer_1.png  <-----┐  
 | 
| 
 | 
   141     │   ├── logo_11mer_2.png  <-----│
 | 
| 
 | 
   142     │   ├── logo_15mer_2.png  <-----│
 | 
| 
 | 
   143     │   ├── logo_15mer_3.png  <-----│
 | 
| 
 | 
   144     │   ├── logo_15mer_4.png  <-----│ images with DNA logos representing consensus sequences
 | 
| 
 | 
   145     │   ├── logo_19mer_2.png  <-----│ of monomer variants
 | 
| 
 | 
   146     │   ├── logo_19mer_4.png  <-----│
 | 
| 
 | 
   147     │   ├── logo_19mer_5.png  <-----│
 | 
| 
 | 
   148     │   ├── logo_23mer_2.png  <-----│
 | 
| 
 | 
   149     │   └── logo_27mer_3.png  <-----┘
 | 
| 
 | 
   150     │
 | 
| 
 | 
   151     ├── ppm_11mer_1.csv  <-----┐
 | 
| 
 | 
   152     ├── ppm_11mer_2.csv  <-----│
 | 
| 
 | 
   153     ├── ppm_15mer_2.csv  <-----│
 | 
| 
 | 
   154     ├── ppm_15mer_3.csv  <-----│
 | 
| 
 | 
   155     ├── ppm_15mer_4.csv  <-----│ position probability matrices for individual monomer
 | 
| 
 | 
   156     ├── ppm_19mer_2.csv  <-----│ variants derived from k-mer frequencies
 | 
| 
 | 
   157     ├── ppm_19mer_4.csv  <-----│
 | 
| 
 | 
   158     ├── ppm_19mer_5.csv  <-----│
 | 
| 
 | 
   159     ├── ppm_23mer_2.csv  <-----│
 | 
| 
 | 
   160     ├── ppm_27mer_3.csv  <-----┘
 | 
| 
 | 
   161     │
 | 
| 
 | 
   162     ├── reads_oriented.fas_11.kmers  <-----┐
 | 
| 
 | 
   163     ├── reads_oriented.fas_15.kmers  <-----│
 | 
| 
 | 
   164     ├── reads_oriented.fas_19.kmers  <-----│ k-mer frequencies calculated on oriented reads
 | 
| 
 | 
   165     ├── reads_oriented.fas_23.kmers  <-----│ for k-mer lengths 11 - 27
 | 
| 
 | 
   166     ├── reads_oriented.fas_27.kmers  <-----┘
 | 
| 
 | 
   167     ├── reads_oriented.fasblast_out.cvs  <---------┐results of blastn search against database of tRNA
 | 
| 
 | 
   168     ├── reads_oriented.fasblast_out.cvs_L.csv <----│for purposes of LTR detection 
 | 
| 
 | 
   169     ├── reads_oriented.fasblast_out.cvs_R.csv <----┘ 
 | 
| 
 | 
   170     └── report.html       <--- cluster analysisHTML summary
 | 
| 
 | 
   171 #+END_SRC
 | 
| 
 | 
   172 
 | 
| 
 | 
   173 
 | 
| 
 | 
   174 
 |