annotate awk.xml @ 0:ec66f9d90ef0 draft

initial uploaded
author bgruening
date Thu, 05 Sep 2013 04:58:21 -0400
parents
children a4ad586d1403
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
1 <tool id="unixtools_awk_tool" name="Awk" version="0.1.1">
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
2 <description></description>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
3 <requirements>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
4 <requirement type="package" version="4.1.0">gnu_awk</requirement>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
5 </requirements>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
6 <command>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
7 awk --sandbox -v FS=\$'\t' -v OFS=\$'\t' --re-interval -f '$awk_script' '$input' &gt; '$output'
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
8 </command>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
9 <inputs>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
10 <param format="txt" name="input" type="data" label="File to process" />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
11 <param name="url_paste" type="text" area="true" size="5x35" label="AWK Program" help="">
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
12 <sanitizer>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
13 <valid initial="string.printable">
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
14 <remove value="&apos;"/>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
15 </valid>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
16 </sanitizer>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
17 </param>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
18 </inputs>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
19 <tests>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
20 <test>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
21 <param name="input" value="unix_awk_input1.txt" />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
22 <output name="output" file="unix_awk_output1.txt" />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
23 <param name="FS" value="tab" />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
24 <param name="OFS" value="tab" />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
25 <param name="file_data" value="$2>0.5 { print $2*9, $1 }" />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
26 </test>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
27 </tests>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
28 <outputs>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
29 <data format="input" name="output" metadata_source="input1"
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
30 />
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
31 </outputs>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
32 <configfiles>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
33 <configfile name="awk_script">
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
34 $url_paste
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
35 </configfile>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
36 </configfiles>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
37 <help>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
38
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
39 **What it does**
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
40
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
41 This tool runs the unix **awk** command on the selected data file.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
42
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
43 .. class:: infomark
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
44
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
45 **TIP:** This tool uses the **extended regular** expression syntax (not the perl syntax).
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
46
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
47
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
48 **Further reading**
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
49
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
50 - Awk by Example (http://www.ibm.com/developerworks/linux/library/l-awk1.html)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
51 - Long AWK tutorial (http://www.grymoire.com/Unix/Awk.html)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
52 - Learn AWK in 1 hour (http://www.selectorweb.com/awk.html)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
53 - awk cheat-sheet (http://cbi.med.harvard.edu/people/peshkin/sb302/awk_cheatsheets.pdf)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
54 - Collection of useful awk one-liners (http://student.northpark.edu/pemente/awk/awk1line.txt)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
55
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
56 -----
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
57
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
58 **AWK programs**
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
59
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
60 Most AWK programs consist of **patterns** (i.e. rules that match lines of text) and **actions** (i.e. commands to execute when a pattern matches a line).
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
61
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
62 The basic form of AWK program is::
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
63
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
64 pattern { action 1; action 2; action 3; }
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
65
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
66
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
67
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
68
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
69
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
70 **Pattern Examples**
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
71
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
72 - **$2 == "chr3"** will match lines whose second column is the string 'chr3'
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
73 - **$5-$4>23** will match lines that after subtracting the value of the fourth column from the value of the fifth column, gives value alrger than 23.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
74 - **/AG..AG/** will match lines that contain the regular expression **AG..AG** (meaning the characeters AG followed by any two characeters followed by AG). (This is the way to specify regular expressions on the entire line, similar to GREP.)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
75 - **$7 ~ /A{4}U/** will match lines whose seventh column contains 4 consecutive A's followed by a U. (This is the way to specify regular expressions on a specific field.)
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
76 - **10000 &lt; $4 &amp;&amp; $4 &lt; 20000** will match lines whose fourth column value is larger than 10,000 but smaller than 20,000
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
77 - If no pattern is specified, all lines match (meaning the **action** part will be executed on all lines).
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
78
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
79
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
80
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
81 **Action Examples**
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
82
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
83 - **{ print }** or **{ print $0 }** will print the entire input line (the line that matched in **pattern**). **$0** is a special marker meaning 'the entire line'.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
84 - **{ print $1, $4, $5 }** will print only the first, fourth and fifth fields of the input line.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
85 - **{ print $4, $5-$4 }** will print the fourth column and the difference between the fifth and fourth column. (If the fourth column was start-position in the input file, and the fifth column was end-position - the output file will contain the start-position, and the length).
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
86 - If no action part is specified (not even the curly brackets) - the default action is to print the entire line.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
87
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
88
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
89
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
90
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
91
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
92
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
93
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
94
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
95
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
96 **AWK's Regular Expression Syntax**
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
97
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
98 The select tool searches the data for lines containing or not containing a match to the given pattern. A Regular Expression is a pattern descibing a certain amount of text.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
99
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
100 - **( ) { } [ ] . * ? + \ ^ $** are all special characters. **\\** can be used to "escape" a special character, allowing that special character to be searched for.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
101 - **^** matches the beginning of a string(but not an internal line).
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
102 - **(** .. **)** groups a particular pattern.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
103 - **{** n or n, or n,m **}** specifies an expected number of repetitions of the preceding pattern.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
104
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
105 - **{n}** The preceding item is matched exactly n times.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
106 - **{n,}** The preceding item ismatched n or more times.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
107 - **{n,m}** The preceding item is matched at least n times but not more than m times.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
108
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
109 - **[** ... **]** creates a character class. Within the brackets, single characters can be placed. A dash (-) may be used to indicate a range such as **a-z**.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
110 - **.** Matches any single character except a newline.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
111 - ***** The preceding item will be matched zero or more times.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
112 - **?** The preceding item is optional and matched at most once.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
113 - **+** The preceding item will be matched one or more times.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
114 - **^** has two meaning:
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
115 - matches the beginning of a line or string.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
116 - indicates negation in a character class. For example, [^...] matches every character except the ones inside brackets.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
117 - **$** matches the end of a line or string.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
118 - **\|** Separates alternate possibilities.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
119
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
120
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
121 **Note**: AWK uses extended regular expression syntax, not Perl syntax. **\\d**, **\\w**, **\\s** etc. are **not** supported.
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
122
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
123 </help>
ec66f9d90ef0 initial uploaded
bgruening
parents:
diff changeset
124 </tool>