Mercurial > repos > peterjc > tmhmm_and_signalp
annotate tools/protein_analysis/tmhmm2.py @ 29:3cb02adf4326 draft
v0.2.9 Python style improvements
author | peterjc |
---|---|
date | Wed, 01 Feb 2017 09:46:14 -0500 |
parents | 20139cb4c844 |
children | 6d9d7cdf00fc |
rev | line source |
---|---|
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
1 #!/usr/bin/env python |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
2 """Wrapper for TMHMM v2.0 for use in Galaxy. |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
3 |
7 | 4 This script takes exactly three command line arguments - number of threads, |
5 an input protein FASTA filename, and an output tabular filename. It then | |
6 calls the standalone TMHMM v2.0 program (not the webservice) requesting | |
7 the short output (one line per protein). | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
8 |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
9 The first major feature is cleaning up the tabular output. The short form raw |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
10 output from TMHMM v2.0 looks like this (six columns tab separated): |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
11 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
12 gi|2781234|pdb|1JLY|B len=304 ExpAA=0.01 First60=0.00 PredHel=0 Topology=o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
13 gi|4959044|gb|AAD34209.1|AF069992_1 len=600 ExpAA=0.00 First60=0.00 PredHel=0 Topology=o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
14 gi|671626|emb|CAA85685.1| len=473 ExpAA=0.19 First60=0.00 PredHel=0 Topology=o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
15 gi|3298468|dbj|BAA31520.1| len=107 ExpAA=59.37 First60=31.17 PredHel=3 Topology=o23-45i52-74o89-106i |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
16 |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
17 If there are any additional 'comment' lines starting with the hash (#) |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
18 character these are ignored by this script. |
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
19 |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
20 In order to make it easier to use in Galaxy, this wrapper script simplifies |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
21 this to remove the redundant tags, and instead adds a comment line at the |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
22 top with the column names: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
23 |
29 | 24 #ID len ExpAA First60 PredHel Topology |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
25 gi|2781234|pdb|1JLY|B 304 0.01 60 0.00 0 o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
26 gi|4959044|gb|AAD34209.1|AF069992_1 600 0.00 0 0.00 0 o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
27 gi|671626|emb|CAA85685.1| 473 0.19 0.00 0 o |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
28 gi|3298468|dbj|BAA31520.1| 107 59.37 31.17 3 o23-45i52-74o89-106i |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
29 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
30 The second major potential feature is taking advantage of multiple cores |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
31 (since TMHMM v2.0 itself is single threaded) by dividing the input FASTA file |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
32 into chunks and running multiple copies of TMHMM in parallel. I would normally |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
33 use Python's multiprocessing library in this situation but it requires at |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
34 least Python 2.6 and at the time of writing Galaxy still supports Python 2.4. |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
35 |
7 | 36 Note that this is somewhat redundant with job-splitting available in Galaxy |
37 itself (see the SignalP XML file for settings). | |
38 | |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
39 Also tmhmm2 can fail without returning an error code, for example if run on a |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
40 64 bit machine with only the 32 bit binaries installed. This script will spot |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
41 when there is no output from tmhmm2, and raise an error. |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
42 """ |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
43 import sys |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
44 import os |
7 | 45 import tempfile |
29 | 46 from seq_analysis_utils import split_fasta, run_jobs, thread_count |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
47 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
48 FASTA_CHUNK = 500 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
49 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
50 if len(sys.argv) != 4: |
29 | 51 sys.exit("Require three arguments, number of threads (int), input protein FASTA file & output tabular file") |
7 | 52 |
53 num_threads = thread_count(sys.argv[1], default=4) | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
54 fasta_file = sys.argv[2] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
55 tabular_file = sys.argv[3] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
56 |
7 | 57 tmp_dir = tempfile.mkdtemp() |
58 | |
29 | 59 |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
60 def clean_tabular(raw_handle, out_handle): |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
61 """Clean up tabular TMHMM output, returns output line count.""" |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
62 count = 0 |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
63 for line in raw_handle: |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
64 if not line.strip() or line.startswith("#"): |
29 | 65 # Ignore any blank lines or comment lines |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
66 continue |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
67 parts = line.rstrip("\r\n").split("\t") |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
68 try: |
29 | 69 identifier, length, exp_aa, first60, predhel, topology = parts |
70 except ValueError: | |
71 assert len(parts) != 6 | |
72 sys.exit("Bad line: %r" % line) | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
73 assert length.startswith("len="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
74 length = length[4:] |
29 | 75 assert exp_aa.startswith("ExpAA="), line |
76 exp_aa = exp_aa[6:] | |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
77 assert first60.startswith("First60="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
78 first60 = first60[8:] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
79 assert predhel.startswith("PredHel="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
80 predhel = predhel[8:] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
81 assert topology.startswith("Topology="), line |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
82 topology = topology[9:] |
29 | 83 out_handle.write("%s\t%s\t%s\t%s\t%s\t%s\n" |
84 % (identifier, length, exp_aa, first60, predhel, topology)) | |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
85 count += 1 |
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
86 return count |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
87 |
29 | 88 # Note that if the input FASTA file contains no sequences, |
89 # split_fasta returns an empty list (i.e. zero temp files). | |
7 | 90 fasta_files = split_fasta(fasta_file, os.path.join(tmp_dir, "tmhmm"), FASTA_CHUNK) |
29 | 91 temp_files = [f + ".out" for f in fasta_files] |
2
747cec3192d3
Migrated tool version 0.0.5 from old tool shed archive to new tool shed repository
peterjc
parents:
1
diff
changeset
|
92 jobs = ["tmhmm -short %s > %s" % (fasta, temp) |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
93 for fasta, temp in zip(fasta_files, temp_files)] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
94 |
29 | 95 |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
96 def clean_up(file_list): |
29 | 97 """Remove temp files, and if possible the temp directory.""" |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
98 for f in file_list: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
99 if os.path.isfile(f): |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
100 os.remove(f) |
7 | 101 try: |
102 os.rmdir(tmp_dir) | |
29 | 103 except Exception: |
7 | 104 pass |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
105 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
106 if len(jobs) > 1 and num_threads > 1: |
29 | 107 # A small "info" message for Galaxy to show the user. |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
108 print "Using %i threads for %i tasks" % (min(num_threads, len(jobs)), len(jobs)) |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
109 results = run_jobs(jobs, num_threads) |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
110 for fasta, temp, cmd in zip(fasta_files, temp_files, jobs): |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
111 error_level = results[cmd] |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
112 if error_level: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
113 try: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
114 output = open(temp).readline() |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
115 except IOError: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
116 output = "" |
7 | 117 clean_up(fasta_files + temp_files) |
29 | 118 sys.exit("One or more tasks failed, e.g. %i from %r gave:\n%s" % (error_level, cmd, output), |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
119 error_level) |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
120 del results |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
121 del jobs |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
122 |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
123 out_handle = open(tabular_file, "w") |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
124 out_handle.write("#ID\tlen\tExpAA\tFirst60\tPredHel\tTopology\n") |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
125 for temp in temp_files: |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
126 data_handle = open(temp) |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
127 count = clean_tabular(data_handle, out_handle) |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
128 data_handle.close() |
1
9a8a7f680dd6
Migrated tool version 0.0.3 from old tool shed archive to new tool shed repository
peterjc
parents:
0
diff
changeset
|
129 if not count: |
7 | 130 clean_up(fasta_files + temp_files) |
29 | 131 sys.exit("No output from tmhmm2") |
0
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
132 out_handle.close() |
a2eeeaa6f75e
Migrated tool version 0.0.1 from old tool shed archive to new tool shed repository
peterjc
parents:
diff
changeset
|
133 |
7 | 134 clean_up(fasta_files + temp_files) |