Mercurial > repos > bebatut > qiime

<tool id="qiime_split_libraries" name="Split libraries" version="1.9.1">
    <description>according to barcodes specified in mapping file</description>

    <macros>
        <import>macros.xml</import>
    </macros>

    <expand macro="requirements" />

    <version_command><![CDATA[
        split_libraries.py --version
    ]]></version_command>

    <command><![CDATA[
        split_libraries.py
            -m $mapping_fp
            -o split_libraries

            #set $seq_files = ''
            #set $sep = ''
            #for $file in $input_files_fasta
                #set $seq_files += $sep + str($file)
                #set $sep = ','
            #end for
            -f $seq_files

            #if str($input_files_qual) != 'None':
                set $files = ''
                #set $sep = ''
                #for $file in $input_files_qual
                    #set $files += $sep + str($file)
                    #set $sep = ','
                #end for
                -q $files
            #end if

            -l $min_seq_length
            -L $max_seq_length
            $trim_seq_length

            $keep_primer
            $keep_barcode

            -a $max_ambig
            -H $max_homopolymer
            -M $max_primer_mismatch

            #if str( $barcode_type.barcode_selector ) != "custom_length"
                -b $barcode_type.barcode_selector
            #else
                -b $barcode_type.barcode_length
            #end if

            -e $max_barcode_errors
            -n $start_numbering_at
            $retain_unassigned_reads
            $disable_bc_correction

            #if str($input_files_qual) != 'None':
                -s $min_qual_score
                -w $qual_score_window
                $discard_bad_windows
                $record_qual_scores
            #end if

            $disable_primers

            $reverse_primers.reverse_primers_test
            #if str($reverse_primers.reverse_primers_test) == '--reverse_primers':
                --reverse_primer_mismatches $reverse_primers.reverse_primer_mismatches
            #end if

            #if str($median_length_filtering):
                -i $median_length_filtering
            #end if

            #if str($added_demultiplex_field):
                -j $added_demultiplex_field
            #end if
]]>
    </command>

    <inputs>
        <param name="mapping_fp" label="Metadata mapping filepath" type="data"
            format="tabular,txt,tsv,csv" help="The file must contain header
            line indicating SampleID in the first column and BarcodeSequence in
            the second, LinkerPrimerSequence in the third. It is recommended to
            check the mapping file using the dedicated file (-m/--mapping_fp)"/>

        <param name="input_files_fasta" type="data" format="fasta"
            label="Input fasta files" multiple="True" help="(-f/--fasta)"/>

        <param name="input_files_qual" type="data"
            format="qual,qual454,qualillumina,qualsolexa,qualsolid"
            label="Input quality files (optional)" multiple="True"
            help="(-q/--qual)" optional="True"/>

        <param name="min_seq_length" type="integer" value="200"
            label="Minimum sequence length" help="(-l/--min_seq_length)"/>

        <param name="max_seq_length" type="integer" value="1000"
            label="Maximum sequence length" help="(-L/--max_seq_length)"/>

        <param name="trim_seq_length" type="boolean" label="Compute sequence
            lengths after trimming and barcodes?" truevalue="-t" falsevalue=""
            selected="False" help="(-t/--trim_seq_length)" />

        <param name="min_qual_score" type="integer" value="25"
            label="Minimum average quality score allowed in read"
            help="(-s/--min_qual_score)"/>

        <param name="keep_primer" type="boolean" label="Remove primer from
            sequences?" truevalue="" falsevalue="--keep_primer"
            selected="True" help="(-k/--keep_primer)" />

        <param name="keep_barcode" type="boolean" label="Remove barcode from
            sequences?" truevalue="" falsevalue="--keep_barcode"
            selected="True" help="(-B/--keep_barcode)" />

        <param name="max_ambig" type="integer" value="6"
            label="Maximum number of ambiguous bases" help="(-a/--max_ambig)"/>

        <param name="max_homopolymer" type="integer" value="6"
            label="Maximum length of homopolymer run" help="(-H/--max_homopolymer)"/>

        <param name="max_primer_mismatch" type="integer" value="0"
            label="Maximum number of primer mismatch" help="(-M/--max_primer_mismatch)"/>

        <conditional name="barcode_type">
            <param name="barcode_selector" type="select" label="Type of barcode"
                help="(-b/ --barcode_type)">
                <option value="hamming_8">hamming_8</option>
                <option value="golay_12" selected="true">golay_12</option>
                <option value="variable_length">variable_length (disable any barcode correction)</option>
                <option value="custom_length">Custom length</option>
            </param>
            <when value="hamming_8" />
            <when value="golay_12" />
            <when value="variable_length" />
            <when value="custom_length">
                <param name="barcode_length" type="integer" value="4"
                label="Barcode length"/>
            </when>
        </conditional>

        <param name="max_barcode_errors" type="float" value="1.5"
            label="Maximum number of errors in barcode"
            help="(-e/--max_barcode_errors)"/>

        <param name="start_numbering_at" type="integer" value="1"
            label="Sequence id to use for the first seuqence"
            help="(-n/--start_numbering_at)"/>

        <param name="retain_unassigned_reads" type="boolean" label="Retain
            sequences with are Unassigned in the output sequence file?"
            truevalue="--retain_unassigned_reads" falsevalue=""
            selected="False" help="(--retain_unassigned_reads)" />

        <param name="retain_unassigned_reads" type="boolean" label="Retain
            sequences with are Unassigned in the output sequence file?"
            truevalue="--retain_unassigned_reads" falsevalue=""
            selected="False" help="(--retain_unassigned_reads)" />

        <param name="disable_bc_correction" type="boolean" label="Disable attempts
            to find nearest corrected barcode?"
            truevalue="(--disable_bc_correction)" falsevalue=""
            selected="False" help="It can improve performance.
            (-c/--disable_bc_correction)" />

        <param name="qual_score_window" type="integer" value="0"
            label="Size of the sliding window" help="If the average score of a
            continuous set of w nucleotides falls below the threshold, the sequence
            is discarded. A good value would be 50. 0 (zero) means no filtering.
            Must pass a .qual file (see -q parameter) if this functionality is
            enabled. Default behavior for this function is to truncate the sequence
            at the beginning of the poor quality window, and test for minimal
            length (-l parameter) of the resulting sequence (-w/--qual_score_window)"/>

        <param name="discard_bad_windows" type="boolean" label="Discard any
            sequences where a bad window is found?"
            truevalue="--discard_bad_windows" falsevalue=""
            selected="False" help="It will work if the sliding window length is bigger
            than 0 (-g/--discard_bad_windows)" />

        <param name="disable_primers" type="boolean" label="Disable primer usage
            when demultiplexing?" truevalue="--disable_primers" falsevalue=""
            selected="False" help="It should be enabled for unusual circumstances,
            such as analyzing Sanger sequence data generated with different primers
             (-p/--disable_primers)" />

        <conditional name="reverse_primers">
            <param name="reverse_primers_test" type="select" label="Enable removal
                of the reverse primer and any subsequence sequence from the end
                of each read?" help="(-z/--reverse_primers)" >
                <option value="--reverse_primers">Yes</option>
                <option value="" selected="true">No</option>
            </param>
            <when value="" />
            <when value="--reverse_primers" >
                <param name="reverse_primer_mismatches" type="integer" value="0"
                label="Number of allowed mismatches for reverse primers"
                help="(--reverse_primer_mismatches)"/>
            </when>
        </conditional>

        <param name="record_qual_scores" type="boolean" label="Record quality
            scores for all sequences that are recorded?" truevalue="--record_qual_scores"
            falsevalue="" selected="False" help="If this option is enabled, a file
            named seqs_filtered.qual will be created in the output directory, and
            will contain the same sequence IDs in the seqs.fna file and sequence
            quality scores matching the bases present in the seqs.fna file
             (-d/--record_qual_scores)" />

        <param name="median_length_filtering" type="integer"
            label="Median length filtering (optional)" help="It disables minimum
            and maximum sequence length filtering, and instead calculates the median
            sequence length and filters the sequences based upon the number of median
            absolute deviations specified by this parameter. Any sequences with
            lengths outside the number of deviations will be removed
            (-i/--median_length_filtering)"
            optional="True"/>

        <param name="added_demultiplex_field" type="text" label="Field
            to use in the mapping file as additional demultiplexing (optional)"
            help="It can be used with or without barcodes.  All combinations of
            barcodes/primers and these fields must be unique. The fields must contain
            values that can be parsed from the fasta labels such as 'plate=R_2008_12_09'.
            In this case, 'plate' would be the column header and 'R_2008_12_09'
             would be the field data (minus quotes) in the mapping file.
             To use the run prefix from the fasta label, such as 'FLP3FBN01ELBSX',
            where 'FLP3FBN01' is generated from the run ID, use 'run_prefix' and
            set the run prefix to be used as the data under the column header
            'run_prefix' (-j/--added_demultiplex_field)" optional="True"/>

        <param name="truncate_ambi_bases" type="boolean" label="Enable to truncate
            at the first N character encountered in the sequences?"
            truevalue="--truncate_ambi_bases" falsevalue="" selected="False"
            help="This will disable testing for ambiguous bases
            (-x/--truncate_ambi_bases)"/>
    </inputs>

    <outputs>
        <data name="sequences" format="fasta"
            from_work_dir="split_libraries/*.fna"
            label="${tool.name} on ${on_string}: sequences"/>

        <data name="log" format="txt"
            from_work_dir="split_libraries/split_library_log.txt"
            label="${tool.name} on ${on_string}: log"/>

        <data name="histograms" format="txt"
            from_work_dir="split_libraries/histograms.txt"
            label="${tool.name} on ${on_string}: histograms"/>

        <data name="quality" format="qual,qual454,qualillumina,qualsolexa,qualsolid"
            from_work_dir="split_libraries/*.qual"
            label="${tool.name} on ${on_string}: quality">
            <filter>record_qual_scores is True</filter>
        </data>
    </outputs>

    <tests>
        <test>
        </test>
    </tests>

    <help><![CDATA[

**What it does**

This tool splits libraries according to barcodes specified in mapping file.

Since newer sequencing technologies provide many reads per run (e.g. the 454 GS FLX Titanium series can produce 400-600 million base pairs with 400-500 base pair read lengths) researchers are now finding it useful to combine multiple samples into a single 454 run. This multiplexing is achieved through the application of a pyrosequencing-tailored nucleotide barcode design (described in (Parameswaran et al., 2007)). By assigning individual, unique sample specific barcodes, multiple sequencing runs may be performed in parallel and the resulting reads can later be binned according to sample. The script %prog performs this task, in addition to several quality filtering steps including user defined cut-offs for: sequence lengths; end-trimming; minimum quality score. To summarize, by using the fasta, mapping, and quality files, the program %prog will parse sequences that meet user defined quality thresholds and then rename each read with the appropriate Sample ID, thus formatting the sequence data for downstream analysis. If a combination of different sequencing technologies are used in any particular study, %prog can be used to perform the quality-filtering for each library individually and the output may then be combined.

Sequences from samples that are not found in the mapping file (no corresponding barcode) and sequences without the correct primer sequence will be excluded. Additional scripts can be used to exclude sequences that match a given reference sequence (e.g. the human genome; exclude_seqs_by_blast.py) and/or sequences that are flagged as chimeras (identify_chimeric_seqs.py).

More information about this tool is available on
`QIIME documentation <http://qiime.org/scripts/split_libraries.html>`_.
    ]]>
    </help>

    <citations>
        <expand macro="citations" />
    </citations>
</tool>
author	bebatut
date	Tue, 02 Feb 2016 05:50:37 -0500
parents
children