annotate tools/protein_analysis/psortb.py @ 30:6d9d7cdf00fc draft

v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
author peterjc
date Thu, 21 Sep 2017 11:15:55 -0400
parents 3cb02adf4326
children 20da7f48b56f
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
1 #!/usr/bin/env python
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
2 """Wrapper for psortb for use in Galaxy.
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
3
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
4 This script takes exactly six command line arguments - which includes the
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
5 number of threads, and the input protein FASTA filename and output
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
6 tabular filename. It then splits up the FASTA input and calls multiple
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
7 copies of the standalone psortb v3 program, then collates the output.
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
8 e.g. Rather than this,
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
9
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
10 psort $type -c $cutoff -d $divergent -o long $sequence > $outfile
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
11
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
12 Call this:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
13
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
14 psort $threads $type $cutoff $divergent $sequence $outfile
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
15
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
16 If ommitting -c or -d options, set $cutoff and $divergent to zero or blank.
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
17
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
18 Note that this is somewhat redundant with job-splitting available in Galaxy
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
19 itself (see the SignalP XML file for settings), but both can be applied.
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
20
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
21 Additionally it ensures the header line (with the column names) starts
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
22 with a # character as used elsewhere in Galaxy.
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
23 """
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
24
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
25 from __future__ import print_function
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
26
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
27 import os
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
28 import sys
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
29 import tempfile
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
31 from seq_analysis_utils import run_jobs, split_fasta, thread_count
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
32
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
33 FASTA_CHUNK = 500
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
34
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
35 if "-v" in sys.argv or "--version" in sys.argv:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
36 """Return underlying PSORTb's version"""
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
37 sys.exit(os.system("psort --version"))
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
38
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
39 if len(sys.argv) != 8:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
40 sys.exit("Require 7 arguments, number of threads (int), type (e.g. archaea), "
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
41 "output (e.g. terse/normal/long), cutoff, divergent, input protein "
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
42 "FASTA file & output tabular file")
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
43
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
44 num_threads = thread_count(sys.argv[1], default=4)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
45 org_type = sys.argv[2]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
46 out_type = sys.argv[3]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
47 cutoff = sys.argv[4]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
48 if cutoff.strip() and float(cutoff.strip()) != 0.0:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
49 cutoff = "-c %s" % cutoff
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
50 else:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
51 cutoff = ""
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
52 divergent = sys.argv[5]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
53 if divergent.strip() and float(divergent.strip()) != 0.0:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
54 divergent = "-d %s" % divergent
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
55 else:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
56 divergent = ""
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
57 fasta_file = sys.argv[6]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
58 tabular_file = sys.argv[7]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
59
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
60 if out_type == "terse":
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
61 header = ['SeqID', 'Localization', 'Score']
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
62 elif out_type == "normal":
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
63 sys.exit("Normal output not implemented yet, sorry.")
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
64 elif out_type == "long":
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
65 if org_type == "-n":
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
66 # Gram negative bacteria
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
67 header = ['SeqID', 'CMSVM-_Localization', 'CMSVM-_Details', 'CytoSVM-_Localization', 'CytoSVM-_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
68 'ECSVM-_Localization', 'ECSVM-_Details', 'ModHMM-_Localization', 'ModHMM-_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
69 'Motif-_Localization', 'Motif-_Details', 'OMPMotif-_Localization', 'OMPMotif-_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
70 'OMSVM-_Localization', 'OMSVM-_Details', 'PPSVM-_Localization', 'PPSVM-_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
71 'Profile-_Localization', 'Profile-_Details',
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
72 'SCL-BLAST-_Localization', 'SCL-BLAST-_Details',
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
73 'SCL-BLASTe-_Localization', 'SCL-BLASTe-_Details',
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
74 'Signal-_Localization', 'Signal-_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
75 'Cytoplasmic_Score', 'CytoplasmicMembrane_Score', 'Periplasmic_Score', 'OuterMembrane_Score',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
76 'Extracellular_Score', 'Final_Localization', 'Final_Localization_Details', 'Final_Score',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
77 'Secondary_Localization', 'PSortb_Version']
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
78 elif org_type == "-p":
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
79 # Gram positive bacteria
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
80 header = ['SeqID', 'CMSVM+_Localization', 'CMSVM+_Details', 'CWSVM+_Localization', 'CWSVM+_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
81 'CytoSVM+_Localization', 'CytoSVM+_Details', 'ECSVM+_Localization', 'ECSVM+_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
82 'ModHMM+_Localization', 'ModHMM+_Details', 'Motif+_Localization', 'Motif+_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
83 'Profile+_Localization', 'Profile+_Details',
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
84 'SCL-BLAST+_Localization', 'SCL-BLAST+_Details',
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
85 'SCL-BLASTe+_Localization', 'SCL-BLASTe+_Details',
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
86 'Signal+_Localization', 'Signal+_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
87 'Cytoplasmic_Score', 'CytoplasmicMembrane_Score', 'Cellwall_Score',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
88 'Extracellular_Score', 'Final_Localization', 'Final_Localization_Details', 'Final_Score',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
89 'Secondary_Localization', 'PSortb_Version']
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
90 elif org_type == "-a":
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
91 # Archaea
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
92 header = ['SeqID', 'CMSVM_a_Localization', 'CMSVM_a_Details', 'CWSVM_a_Localization', 'CWSVM_a_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
93 'CytoSVM_a_Localization', 'CytoSVM_a_Details', 'ECSVM_a_Localization', 'ECSVM_a_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
94 'ModHMM_a_Localization', 'ModHMM_a_Details', 'Motif_a_Localization', 'Motif_a_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
95 'Profile_a_Localization', 'Profile_a_Details',
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
96 'SCL-BLAST_a_Localization', 'SCL-BLAST_a_Details',
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
97 'SCL-BLASTe_a_Localization', 'SCL-BLASTe_a_Details',
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
98 'Signal_a_Localization', 'Signal_a_Details',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
99 'Cytoplasmic_Score', 'CytoplasmicMembrane_Score', 'Cellwall_Score',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
100 'Extracellular_Score', 'Final_Localization', 'Final_Localization_Details', 'Final_Score',
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
101 'Secondary_Localization', 'PSortb_Version']
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
102 else:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
103 sys.exit("Expected -n, -p or -a for the organism type, not %r" % org_type)
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
104 else:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
105 sys.exit("Expected terse, normal or long for the output type, not %r" % out_type)
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
106
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
107 tmp_dir = tempfile.mkdtemp()
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
108
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
109
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
110 def clean_tabular(raw_handle, out_handle):
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
111 """Clean up tabular TMHMM output, returns output line count."""
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
112 global header
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
113 count = 0
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
114 for line in raw_handle:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
115 if not line.strip() or line.startswith("#"):
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
116 # Ignore any blank lines or comment lines
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
117 continue
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
118 parts = [x.strip() for x in line.rstrip("\r\n").split("\t")]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
119 if parts == header:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
120 # Ignore the header line
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
121 continue
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
122 if not parts[-1] and len(parts) == len(header) + 1:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
123 # Ignore dummy blank extra column, e.g.
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
124 # "...2.0\t\tPSORTb version 3.0\t\n"
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
125 parts = parts[:-1]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
126 assert len(parts) == len(header), \
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
127 "%i fields, not %i, in line:\n%r" % (len(line), len(header), line)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
128 out_handle.write(line)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
129 count += 1
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
130 return count
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
131
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
132
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
133 # Note that if the input FASTA file contains no sequences,
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
134 # split_fasta returns an empty list (i.e. zero temp files).
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
135 fasta_files = split_fasta(fasta_file, os.path.join(tmp_dir, "tmhmm"), FASTA_CHUNK)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
136 temp_files = [f + ".out" for f in fasta_files]
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
137 jobs = ["psort %s %s %s -o %s %s > %s" % (org_type, cutoff, divergent, out_type, fasta, temp)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
138 for fasta, temp in zip(fasta_files, temp_files)]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
139
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
140
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
141 def clean_up(file_list):
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
142 for f in file_list:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
143 if os.path.isfile(f):
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
144 os.remove(f)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
145 try:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
146 os.rmdir(tmp_dir)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
147 except Exception:
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
148 pass
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
149
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
150
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
151 if len(jobs) > 1 and num_threads > 1:
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
152 # A small "info" message for Galaxy to show the user.
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
153 print("Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)))
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
154 results = run_jobs(jobs, num_threads)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
155 for fasta, temp, cmd in zip(fasta_files, temp_files, jobs):
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
156 error_level = results[cmd]
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
157 if error_level:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
158 try:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
159 output = open(temp).readline()
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
160 except IOError:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
161 output = ""
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
162 clean_up(fasta_files + temp_files)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
163 sys.exit("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output),
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
164 error_level)
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
165 del results
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
166 del jobs
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
167
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
168 out_handle = open(tabular_file, "w")
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
169 out_handle.write("#%s\n" % "\t".join(header))
11
3d74c1176d67 Uploaded minor fix
peterjc
parents: 8
diff changeset
170 count = 0
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
171 for temp in temp_files:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
172 data_handle = open(temp)
11
3d74c1176d67 Uploaded minor fix
peterjc
parents: 8
diff changeset
173 count += clean_tabular(data_handle, out_handle)
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
174 data_handle.close()
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
175 if not count:
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
176 clean_up(fasta_files + temp_files)
29
3cb02adf4326 v0.2.9 Python style improvements
peterjc
parents: 26
diff changeset
177 sys.exit("No output from psortb")
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
178 out_handle.close()
30
6d9d7cdf00fc v0.2.11 Job splitting fast-fail; RXLR tools supports HMMER2 from BioConda; Capture more version information; misc internal changes
peterjc
parents: 29
diff changeset
179 print("%i records" % count)
8
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
180
391a142c1e60 Uploaded
peterjc
parents:
diff changeset
181 clean_up(fasta_files + temp_files)