Mercurial > repos > iuc > longdust

<tool id="longdust" name="longdust" version="@TOOL_VERSION@+galaxy@VERSION_SUFFIX@" profile="@PROFILE@" license="MIT">
    <description>Detect low-complexity regions in long sequences</description>
    <macros>
        <import>macros.xml</import>
    </macros>
    <expand macro="requirements"/>
    <command detect_errors="exit_code"><![CDATA[
        longdust '$input'
        -k $k
        -w $w
        -g $g
        -t $t
        -e $e
        $f
        $a
        > '$output'
    ]]></command>
    <inputs>
        <param name="input" type="data" format="fasta,fastq.gz" label="Input FASTA file"/>
        <param argument="-k" type="integer" label="k-mer length" value="7" help="k-mer length"/>
        <param argument="-w" type="integer" label="Window size" value="5000" help="Window size"/>
        <param argument="-g" type="float" label="Genome-wide GC content" value="0.5" help="Specify genome-wide GC content"/>
        <param argument="-t" type="float" label="Score threshold" value="0.6" help="Score threshold"/>
        <param argument="-e" type="integer" label="Extension X-drop length" value="50" help="Extension X-drop length (0 to disable)"/>
        <param argument="-f" type="boolean" label="Forward strand only" truevalue="-f" falsevalue="" checked="false" help="Limit analysis to forward strand only" />
        <param argument="-a" type="boolean" label="Enable Guaranteed O(Lw) Algorithm" truevalue="-a" falsevalue="" checked="false" help="Use the guaranteed O(Lw) algorithm with increased approximation for faster runtime on large genomes. This mode evaluates only the smallest candidate start per position, reducing runtime to a strict O(Lw) but may miss ~5-10% of low-complexity regions compared to the default."/>
    </inputs>
    <outputs>
        <data name="output" format="bed"/>
    </outputs>
    <tests>
        <test expect_num_outputs="1">
            <param name="input" location="https://zenodo.org/records/17226147/files/GCF_000146045.2_R64_genomic.fna.gz"/>
            <param name="k" value="6"/>
            <param name="w" value="1000"/>
            <param name="t" value="0.55"/>
            <param name="g" value="0.5"/>
            <param name="e" value="0"/>
            <param name="f" value="false"/>
            <param name="a" value="false"/>
            <output name="output" ftype="bed">
                <assert_contents>
                    <has_n_columns n="3"/>
                    <has_n_lines n="7426"/>
                </assert_contents>
            </output>
        </test>
    </tests>
    <help><![CDATA[

.. class:: infomark

**What it does**

*longdust* detects low-complexity (dusty) regions in long DNA sequences. It scans input FASTA sequences using k-mer statistics and reports regions that fall below a complexity threshold.
These regions are often repetitive or homopolymeric stretches that may interfere with sequence analysis, alignment, or downstream bioinformatics pipelines.

The method is tunable via parameters for k-mer size, window size, score threshold, and extension length, allowing you to control how strict or relaxed the detection should be.


**Input**
- A FASTA file containing DNA sequences (typically long reads or assembled contigs).
- Optional parameters to configure detection:

  - **-k** : k-mer length (default 7)
  - **-w** : window size (default 5000)
  - **-t** : score threshold (default 0.6)
  - **-e** : extension X-drop length, 0 disables extension (default 50)
  - **-f** : forward strand only (optional flag)
  - **-a** : approximate O(Lw) algorithm (optional flag)

  * Recommend w < 4^k for performance, especially given large w
  * Use "-k6 -w1000 -t.55" for more relaxed but shorter regions

**Output**

- A BED file listing detected low-complexity regions

    ]]></help>
    <expand macro="citations"/>
    <expand macro="creator"/>
</tool>
author	iuc
date	Mon, 01 Dec 2025 12:21:21 +0000
parents	9607b6eccee4
children