annotate tools/protein_analysis/signalp3.py @ 30:6d9d7cdf00fc draft

v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
author peterjc
date Thu, 21 Sep 2017 11:15:55 -0400
parents 3cb02adf4326
children 20da7f48b56f
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
2 """Wrapper for SignalP v3.0 for use in Galaxy.
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
3
5
ef7ceca37e3f Migrated tool version 0.0.8 from old tool shed archive to new tool shed repository
peterjc
parents: 0
diff changeset
4 This script takes exactly five command line arguments:
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
5 * the organism type (euk, gram+ or gram-)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
6 * length to truncate sequences to (integer)
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
7 * number of threads to use (integer, defaults to one)
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
8 * an input protein FASTA filename
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
9 * output tabular filename.
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
10
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
11 There are two further optional arguments
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
12 * cut type (NN_Cmax, NN_Ymax, NN_Smax or HMM_Cmax)
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
13 * output GFF3 filename
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
14
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
15 It then calls the standalone SignalP v3.0 program (not the webservice)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
16 requesting the short output (one line per protein) using both NN and HMM
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
17 for predictions.
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
18
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
19 First major feature is cleaning up the output. The raw output from SignalP
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
20 v3.0 looks like this (21 columns space separated):
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
21
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
22 # SignalP-NN euk predictions # SignalP-HMM euk predictions
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
23 # name Cmax pos ? Ymax pos ? Smax pos ? Smean ? D ? # name ! Cmax pos ? Sprob ?
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
24 gi|2781234|pdb|1JLY| 0.061 17 N 0.043 17 N 0.199 1 N 0.067 N 0.055 N gi|2781234|pdb|1JLY|B Q 0.000 17 N 0.000 N
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
25 gi|4959044|gb|AAD342 0.099 191 N 0.012 38 N 0.023 12 N 0.014 N 0.013 N gi|4959044|gb|AAD34209.1|AF069992_1 Q 0.000 0 N 0.000 N
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
26 gi|671626|emb|CAA856 0.139 381 N 0.020 8 N 0.121 4 N 0.067 N 0.044 N gi|671626|emb|CAA85685.1| Q 0.000 0 N 0.000 N
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
27 gi|3298468|dbj|BAA31 0.208 24 N 0.184 38 N 0.980 32 Y 0.613 Y 0.398 N gi|3298468|dbj|BAA31520.1| Q 0.066 24 N 0.139 N
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
28
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
29 In order to make it easier to use in Galaxy, this wrapper script reformats
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
30 this to use tab separators. Also it removes the redundant truncated name
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
31 column, and assigns unique column names in the header:
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
32
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
33 #ID NN_Cmax_score NN_Cmax_pos NN_Cmax_pred NN_Ymax_score NN_Ymax_pos NN_Ymax_pred NN_Smax_score NN_Smax_pos NN_Smax_pred NN_Smean_score NN_Smean_pred NN_D_score NN_D_pred HMM_bang HMM_Cmax_score HMM_Cmax_pos HMM_Cmax_pred HMM_Sprob_score HMM_Sprob_pred
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
34 gi|2781234|pdb|1JLY|B 0.061 17 N 0.043 17 N 0.199 1 N 0.067 N 0.055 N Q 0.000 17 N 0.000 N
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
35 gi|4959044|gb|AAD34209.1|AF069992_1 0.099 191 N 0.012 38 N 0.023 12 N 0.014 N 0.013 N Q 0.000 0 N 0.000 N
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
36 gi|671626|emb|CAA85685.1| 0.139 381 N 0.020 8 N 0.121 4 N 0.067 N 0.044 N Q 0.000 0 N 0.000 N
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
37 gi|3298468|dbj|BAA31520.1| 0.208 24 N 0.184 38 N 0.980 32 Y 0.613 Y 0.398 N Q 0.066 24 N 0.139 N
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
38
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
39 The second major feature is overcoming SignalP's built in limit of 4000
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
40 sequences by breaking up the input FASTA file into chunks. This also allows
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
41 us to pre-trim the sequences since SignalP only needs their starts.
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
42
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
43 The third major feature is taking advantage of multiple cores (since SignalP
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
44 v3.0 itself is single threaded) by using the individual FASTA input files to
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
45 run multiple copies of TMHMM in parallel. I would normally use Python's
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
46 multiprocessing library in this situation but it requires at least Python 2.6
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
47 and at the time of writing Galaxy still supports Python 2.4.
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
48
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
49 Note that this is somewhat redundant with job-splitting available in Galaxy
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
50 itself (see the SignalP XML file for settings).
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
51
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
52 Finally, you can opt to have a GFF3 file produced which will describe the
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
53 predicted signal peptide and mature peptide for each protein (using one of
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
54 the predictors which gives a cleavage site). *WORK IN PROGRESS*
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
55 """ # noqa: E501
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
56
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
57 from __future__ import print_function
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
58
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
59 import os
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
60 import sys
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
61 import tempfile
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
62
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
63 from seq_analysis_utils import fasta_iterator, split_fasta
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
64 from seq_analysis_utils import run_jobs, thread_count
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
65
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
66 FASTA_CHUNK = 500
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
67 MAX_LEN = 6000 # Found by trial and error
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
68
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
69 if "-v" in sys.argv or "--version" in sys.argv:
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
70 print("SignalP Galaxy wrapper version 0.0.19")
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
71 sys.exit(os.system("signalp -version"))
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
72
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
73 if len(sys.argv) not in [6, 8]:
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
74 sys.exit("Require five (or 7) arguments, organism, truncate, threads, "
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
75 "input protein FASTA file & output tabular file (plus "
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
76 "optionally cut method and GFF3 output file). "
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
77 "Got %i arguments." % (len(sys.argv) - 1))
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
78
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
79 organism = sys.argv[1]
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
80 if organism not in ["euk", "gram+", "gram-"]:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
81 sys.exit("Organism argument %s is not one of euk, gram+ or gram-" % organism)
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
82
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
83 try:
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
84 truncate = int(sys.argv[2])
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
85 except ValueError:
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
86 truncate = 0
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
87 if truncate < 0:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
88 sys.exit("Truncate argument %s is not a positive integer (or zero)" % sys.argv[2])
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
89
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
90 num_threads = thread_count(sys.argv[3], default=4)
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
91 fasta_file = sys.argv[4]
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
92 tabular_file = sys.argv[5]
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
93
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
94 if len(sys.argv) == 8:
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
95 cut_method = sys.argv[6]
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
96 if cut_method not in ["NN_Cmax", "NN_Ymax", "NN_Smax", "HMM_Cmax"]:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
97 sys.exit("Invalid cut method %r" % cut_method)
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
98 gff3_file = sys.argv[7]
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
99 else:
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
100 cut_method = None
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
101 gff3_file = None
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
102
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
103
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
104 tmp_dir = tempfile.mkdtemp()
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
105
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
106
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
107 def clean_tabular(raw_handle, out_handle, gff_handle=None):
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
108 """Clean up SignalP output to make it tabular."""
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
109 for line in raw_handle:
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
110 if not line or line.startswith("#"):
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
111 continue
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
112 parts = line.rstrip("\r\n").split()
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
113 assert len(parts) == 21, repr(line)
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
114 assert parts[14].startswith(parts[0]), \
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
115 "Bad entry in SignalP output, ID miss-match:\n%r" % line
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
116 # Remove redundant truncated name column (col 0)
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
117 # and put full name at start (col 14)
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
118 parts = parts[14:15] + parts[1:14] + parts[15:]
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
119 out_handle.write("\t".join(parts) + "\n")
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
120
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
121
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
122 def make_gff(fasta_file, tabular_file, gff_file, cut_method):
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
123 """Make a GFF file."""
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
124 cut_col, score_col = {"NN_Cmax": (2, 1),
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
125 "NN_Ymax": (5, 4),
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
126 "NN_Smax": (8, 7),
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
127 "HMM_Cmax": (16, 15),
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
128 }[cut_method]
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
129
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
130 source = "SignalP"
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
131 strand = "." # not stranded
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
132 phase = "." # not phased
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
133 tags = "Note=%s" % cut_method
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
134
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
135 tab_handle = open(tabular_file)
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
136 line = tab_handle.readline()
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
137 assert line.startswith("#ID\t"), line
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
138
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
139 gff_handle = open(gff_file, "w")
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
140 gff_handle.write("##gff-version 3\n")
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
141
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
142 for (title, seq), line in zip(fasta_iterator(fasta_file), tab_handle):
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
143 parts = line.rstrip("\n").split("\t")
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
144 seqid = parts[0]
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
145 assert title.startswith(seqid), "%s vs %s" % (seqid, title)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
146 if not seq:
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
147 # Is it possible to have a zero length reference in GFF3?
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
148 continue
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
149 cut = int(parts[cut_col])
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
150 if cut == 0:
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
151 assert cut_method == "HMM_Cmax", cut_method
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
152 # TODO - Why does it do this?
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
153 cut = 1
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
154 assert 1 <= cut <= len(seq), "%i for %s len %i" % (cut, seqid, len(seq))
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
155 score = parts[score_col]
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
156 gff_handle.write("##sequence-region %s %i %i\n"
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
157 % (seqid, 1, len(seq)))
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
158 # If the cut is at the very begining, there is no signal peptide!
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
159 if cut > 1:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
160 # signal_peptide = SO:0000418
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
161 gff_handle.write("%s\t%s\t%s\t%i\t%i\t%s\t%s\t%s\t%s\n"
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
162 % (seqid, source,
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
163 "signal_peptide", 1, cut - 1,
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
164 score, strand, phase, tags))
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
165 # mature_protein_region = SO:0000419
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
166 gff_handle.write("%s\t%s\t%s\t%i\t%i\t%s\t%s\t%s\t%s\n"
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
167 % (seqid, source,
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
168 "mature_protein_region", cut, len(seq),
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
169 score, strand, phase, tags))
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
170 tab_handle.close()
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
171 gff_handle.close()
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
172
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
173
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
174 fasta_files = split_fasta(fasta_file, os.path.join(tmp_dir, "signalp"),
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
175 n=FASTA_CHUNK, truncate=truncate, max_len=MAX_LEN)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
176 temp_files = [f + ".out" for f in fasta_files]
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
177 assert len(fasta_files) == len(temp_files)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
178 jobs = ["signalp -short -t %s %s > %s" % (organism, fasta, temp)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
179 for (fasta, temp) in zip(fasta_files, temp_files)]
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
180 assert len(fasta_files) == len(temp_files) == len(jobs)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
181
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
182
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
183 def clean_up(file_list):
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
184 """Remove temp files, and if possible the temp directory."""
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
185 for f in file_list:
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
186 if os.path.isfile(f):
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
187 os.remove(f)
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
188 try:
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
189 os.rmdir(tmp_dir)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
190 except Exception:
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
191 pass
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
192
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
193
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
194 if len(jobs) > 1 and num_threads > 1:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
195 # A small "info" message for Galaxy to show the user.
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
196 print("Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)))
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
197 results = run_jobs(jobs, num_threads)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
198 assert len(fasta_files) == len(temp_files) == len(jobs)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
199 for fasta, temp, cmd in zip(fasta_files, temp_files, jobs):
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
200 error_level = results[cmd]
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
201 try:
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
202 output = open(temp).readline()
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
203 except IOError:
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
204 output = "(no output)"
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
205 if error_level or output.lower().startswith("error running"):
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
206 clean_up(fasta_files + temp_files)
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
207 if output:
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
208 sys.stderr.write("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output))
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
209 else:
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
210 sys.stderr.write("One or more tasks failed, e.g. %i from %r with no output\n" % (error_level, cmd))
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
211 sys.exit(error_level)
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
212 del results
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
213
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
214 out_handle = open(tabular_file, "w")
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
215 fields = ["ID"]
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
216 # NN results:
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
217 for name in ["Cmax", "Ymax", "Smax"]:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
218 fields.extend(["NN_%s_score" % name, "NN_%s_pos" % name, "NN_%s_pred" % name])
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
219 fields.extend(["NN_Smean_score", "NN_Smean_pred", "NN_D_score", "NN_D_pred"])
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
220 # HMM results:
0
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
221 fields.extend(["HMM_type", "HMM_Cmax_score", "HMM_Cmax_pos", "HMM_Cmax_pred",
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
222 "HMM_Sprob_score", "HMM_Sprob_pred"])
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
223 out_handle.write("#" + "\t".join(fields) + "\n")
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
224 for temp in temp_files:
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
225 data_handle = open(temp)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
226 clean_tabular(data_handle, out_handle)
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
227 data_handle.close()
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
228 out_handle.close()
a2eeeaa6f75e Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff changeset
229
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
230 # GFF3:
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
231 if cut_method:
8
391a142c1e60 Uploaded
peterjc
parents: 7
diff changeset
232 make_gff(fasta_file, tabular_file, gff3_file, cut_method)
7
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
233
5e62aefb2918 Uploaded v0.1.2 to Test Tool Shed
peterjc
parents: 5
diff changeset
234 clean_up(fasta_files + temp_files)