0
|
1 # sickle - A windowed adaptive trimming tool for FASTQ files using quality
|
|
2
|
|
3 ## About
|
|
4
|
|
5 Most modern sequencing technologies produce reads that have
|
|
6 deteriorating quality towards the 3'-end and some towards the 5'-end as well. Incorrectly called bases
|
|
7 in both regions negatively impact assembles, mapping, and downstream
|
|
8 bioinformatics analyses.
|
|
9
|
|
10 Sickle is a tool that uses sliding windows along with quality and
|
|
11 length thresholds to determine when quality is sufficiently low to
|
|
12 trim the 3'-end of reads and also determines when the quality is
|
|
13 sufficiently high enough to trim the 5'-end of reads. It will also discard reads based upon the
|
|
14 length threshold. It takes the quality values and slides a window
|
|
15 across them whose length is 0.1 times the length of the read. If this
|
|
16 length is less than 1, then the window is set to be equal to the
|
|
17 length of the read. Otherwise, the window slides along the quality
|
|
18 values until the average quality in the window rises above the threshold, at
|
|
19 which point the algorithm determines where within the window the rise occurs
|
|
20 and cuts the read and quality there for the 5'-end cut. Then when the average quality
|
|
21 in the window drops below the threshold, the algorithm determines where in the window
|
|
22 the drop occurs and cuts both the read and quality strings there for the 3'-end cut.
|
|
23 However, if the length of the remaining sequence is less than the minimum length threshold,
|
|
24 then the read is discarded entirely. 5'-end trimming can be disabled.
|
|
25
|
|
26 Sickle also has an option to discard reads with any Ns in them.
|
|
27
|
|
28 Sickle supports three types of quality values: Illumina, Solexa,
|
|
29 and Sanger. Note that the Solexa quality setting is an approximation
|
|
30 (the actual conversion is a non-linear transformation). The end
|
|
31 approximation is close. Illumina quality refers to qualities encoded
|
|
32 with the CASAVA pipeline between versions 1.3 and 1.7. Illumina quality
|
|
33 using CASAVA >= 1.8 is Sanger encoded.
|
|
34
|
|
35 Note that Sickle will remove the 2nd fastq record header (on the "+" line) and replace it
|
|
36 with simply a "+". This is the default format for CASAVA >= 1.8.
|
|
37
|
|
38 Sickle also supports gzipped file inputs. There is also a sickle.xml file
|
|
39 included in the package that can be used to add sickle to your local [Galaxy](http://galaxy.psu.edu/) server.
|
|
40
|
|
41 ## Requirements
|
|
42
|
|
43 Sickle requires a C compiler; GCC or clang are recommended. Sickle
|
|
44 relies on Heng Li's kseq.h, which is bundled with the source.
|
|
45
|
|
46 Sickle also requires Zlib, which can be obtained at
|
|
47 <http://www.zlib.net/>.
|
|
48
|
|
49 ## Building and Installing Sickle
|
|
50
|
|
51 To build Sickle, enter:
|
|
52
|
|
53 make
|
|
54
|
|
55 Then, copy or move "sickle" to a directory in your $PATH.
|
|
56
|
|
57 ## Usage
|
|
58
|
|
59 Sickle has two modes to work with both paired-end and single-end
|
|
60 reads: `sickle se` and `sickle pe`.
|
|
61
|
|
62 Running sickle by itself will print the help:
|
|
63
|
|
64 sickle
|
|
65
|
|
66 Running sickle with either the "se" or "pe" commands will give help
|
|
67 specific to those commands:
|
|
68
|
|
69 sickle se
|
|
70 sickle pe
|
|
71
|
|
72 ### Sickle Single End (`sickle se`)
|
|
73
|
|
74 `sickle se` takes an input fastq file and outputs a trimmed version of
|
|
75 that file. It also has options to change the length and quality
|
|
76 thresholds for trimming, as well as disabling 5'-trimming and enabling removal
|
|
77 of sequences with Ns.
|
|
78
|
|
79 #### Examples
|
|
80
|
|
81 sickle se -f input_file.fastq -t illumina -o trimmed_output_file.fastq
|
|
82 sickle se -f input_file.fastq -t illumina -o trimmed_output_file.fastq -q 33 -l 40
|
|
83 sickle se -f input_file.fastq -t illumina -o trimmed_output_file.fastq -x -n
|
|
84
|
|
85 ### Sickle Paired End (`sickle pe`)
|
|
86
|
|
87 `sickle pe` takes two paired-end files as input and outputs two
|
|
88 trimmed paired-end files as well as a "singles" file. The "singles"
|
|
89 file contains reads that passed filter in one of the paired-end files
|
|
90 but not the other. You can also change the length and quality
|
|
91 thresholds for trimming, as well as disable 5'-trimming and enable removal
|
|
92 of sequences with Ns.
|
|
93
|
|
94 #### Examples
|
|
95
|
|
96 sickle pe -f input_file1.fastq -r input_file2.fastq -t sanger \
|
|
97 -o trimmed_output_file1.fastq -p trimmed_output_file2.fastq \
|
|
98 -s trimmed_singles_file.fastq
|
|
99
|
|
100 sickle pe -f input_file1.fastq -r input_file2.fastq -t sanger \
|
|
101 -o trimmed_output_file1.fastq -p trimmed_output_file2.fastq \
|
|
102 -s trimmed_singles_file.fastq -q 12 -l 15
|
|
103
|
|
104 sickle pe -f input_file1.fastq -r input_file2.fastq -t sanger \
|
|
105 -o trimmed_output_file1.fastq -p trimmed_output_file2.fastq \
|
|
106 -s trimmed_singles_file.fastq -n
|
|
107
|