Mercurial > repos > bgruening > text_processing
comparison readme.rst @ 0:ec66f9d90ef0 draft
initial uploaded
| author | bgruening |
|---|---|
| date | Thu, 05 Sep 2013 04:58:21 -0400 |
| parents | |
| children | a4ad586d1403 |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:ec66f9d90ef0 |
|---|---|
| 1 These are Galaxy wrappers for common unix text-processing tools | |
| 2 =============================================================== | |
| 3 | |
| 4 The initial work was done by Assaf Gordon and Greg Hannon's lab ( http://hannonlab.cshl.edu ) | |
| 5 in Cold Spring Harbor Laboratory ( http://www.cshl.edu ). | |
| 6 | |
| 7 | |
| 8 The tools are: | |
| 9 | |
| 10 * awk - The AWK programmning language ( http://www.gnu.org/software/gawk/ ) | |
| 11 * sed - Stream Editor ( http://sed.sf.net ) | |
| 12 * grep - Search files ( http://www.gnu.org/software/grep/ ) | |
| 13 * sort_columns - Sorting every line according to there columns | |
| 14 * GNU Coreutils programs ( http://www.gnu.org/software/coreutils/ ): | |
| 15 * sort - sort files | |
| 16 * join - join two files, based on common key field. | |
| 17 * cut - keep/discard fields from a file | |
| 18 * unsorted_uniq - keep unique/duplicated lines in a file | |
| 19 * sorted_uniq - keep unique/duplicated lines in a file | |
| 20 * head - keep the first X lines in a file. | |
| 21 * tail - keep the last X lines in a file. | |
| 22 | |
| 23 Few improvements over the standard tools: | |
| 24 | |
| 25 * EasyJoin - A Join tool that does not require pre-sorted the files ( https://github.com/agordon/filo/blob/scripts/src/scripts/easyjoin ) | |
| 26 * Multi-Join - Join multiple (>2) files ( https://github.com/agordon/filo/blob/scripts/src/scripts/multijoin ) | |
| 27 * Find_and_Replace - Find/Replace text in a line or specific column. | |
| 28 * Grep with Perl syntax - uses grep with Perl-Compatible regular expressions. | |
| 29 * HTML'd Grep - grep text in a file, and produced high-lighted HTML output, for easier viewing ( uses https://github.com/agordon/filo/blob/scripts/src/scripts/sort-header ) | |
| 30 | |
| 31 | |
| 32 Requirements | |
| 33 ------------ | |
| 34 | |
| 35 1. Coreutils vesion 8.19 or later. | |
| 36 2. AWK version 4.0.1 or later. | |
| 37 3. SED version 4.2 *with* a special patch | |
| 38 4. Grep with PCRE support | |
| 39 | |
| 40 These will be installed automatically with the Galaxy Tool Shed. | |
| 41 | |
| 42 | |
| 43 ------------------- | |
| 44 NOTE About Security | |
| 45 ------------------- | |
| 46 | |
| 47 The included tools are secure (barring unintentional bugs): | |
| 48 The main concern might be executing system commands with awk's "system" and sed's "e" commands, | |
| 49 or reading/writing arbitrary files with awk's redirection and sed's "r/w" commands. | |
| 50 These commands are DISABLED using the "--sandbox" parameter to awk and sed. | |
| 51 | |
| 52 User trying to run an awk program similar to: | |
| 53 BEGIN { system("ls") } | |
| 54 Will get an error (in Galaxy) saying: | |
| 55 fatal: 'system' function not allowed in sandbox mode. | |
| 56 | |
| 57 User trying to run a SED program similar to: | |
| 58 1els | |
| 59 will get an error (in Galaxy) saying: | |
| 60 sed: -e expression #1, char 2: e/r/w commands disabled in sandbox mode | |
| 61 | |
| 62 That being said, if you do find some vulnerability in these tools, please let me know and I'll try fix them. | |
| 63 | |
| 64 ------------ | |
| 65 Installation | |
| 66 ------------ | |
| 67 | |
| 68 Should be done with the Galaxy `Tool Shed`_. | |
| 69 | |
| 70 .. _`Tool Shed`: http://wiki.galaxyproject.org/Tool%20Shed | |
| 71 | |
| 72 | |
| 73 ---- | |
| 74 TODO | |
| 75 ---- | |
| 76 | |
| 77 - unit-tests | |
| 78 - uniqu will get a new --group funciton with the 8.22 release, its currently commended out | |
| 79 - also shuf will get a major improved performance with large files http://git.savannah.gnu.org/gitweb/?p=coreutils.git;a=commit;h=20d7bce0f7e57d9a98f0ee811e31c757e9fedfff | |
| 80 we can remove the random feature from sort and use shuf instead | |
| 81 - move some advanced settings under a conditional, for example the cut tools offers to cut bytes | |
| 82 | |
| 83 | |
| 84 | |
| 85 | |
| 86 |
