Mercurial > repos > jjohnson > defuse
comparison defuse_results_to_vcf.py @ 25:2ecf82136986
Define defuse.results.tsv ext as subclass of tabular, add defuse_results_to_vcf to generate vcf form DeFuse results
author | Jim Johnson <jj@umn.edu> |
---|---|
date | Fri, 09 Aug 2013 11:19:26 -0500 |
parents | |
children | d57fcac025e2 |
comparison
equal
deleted
inserted
replaced
24:2cbb285cc54e | 25:2ecf82136986 |
---|---|
1 #!/usr/bin/env python | |
2 """ | |
3 # | |
4 #------------------------------------------------------------------------------ | |
5 # University of Minnesota | |
6 # Copyright 2012, Regents of the University of Minnesota | |
7 #------------------------------------------------------------------------------ | |
8 # Author: | |
9 # | |
10 # James E Johnson | |
11 # Jesse Erdmann | |
12 # | |
13 #------------------------------------------------------------------------------ | |
14 """ | |
15 | |
16 | |
17 """ | |
18 This tool takes the defuse results.tsv tab-delimited file as input and creates a Variant Call Format file as output. | |
19 """ | |
20 | |
21 import sys,re,os.path | |
22 import optparse | |
23 from optparse import OptionParser | |
24 | |
25 """ | |
26 http://www.1000genomes.org/wiki/analysis/variant-call-format/vcf-variant-call-format-version-42 | |
27 | |
28 5. INFO keys used for structural variants | |
29 When the INFO keys reserved for encoding structural variants are used for imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles). | |
30 The following INFO keys are reserved for encoding structural variants. In general, when these keys are used by imprecise variants, the values should be best estimates. When a key reflects a property of a single alt allele (e.g. SVLEN), then when there are multiple alt alleles there will be multiple values for the key corresponding to each alelle (e.g. SVLEN=-100,-110 for a deletion with two distinct alt alleles). | |
31 ##INFO=<ID=IMPRECISE,Number=0,Type=Flag,Description="Imprecise structural variation"> | |
32 ##INFO=<ID=NOVEL,Number=0,Type=Flag,Description="Indicates a novel structural variation"> | |
33 ##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record"> | |
34 For precise variants, END is POS + length of REF allele - 1, and the for imprecise variants the corresponding best estimate. | |
35 ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> | |
36 Value should be one of DEL, INS, DUP, INV, CNV, BND. This key can be derived from the REF/ALT fields but is useful for filtering. | |
37 ##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> | |
38 One value for each ALT allele. Longer ALT alleles (e.g. insertions) have positive values, shorter ALT alleles (e.g. deletions) have negative values. | |
39 ##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants"> | |
40 ##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants"> | |
41 ##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints"> | |
42 ##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints"> | |
43 ##INFO=<ID=BKPTID,Number=.,Type=String,Description="ID of the assembled alternate allele in the assembly file"> | |
44 For precise variants, the consensus sequence the alternate allele assembly is derivable from the REF and ALT fields. However, the alternate allele assembly file may contain additional information about the characteristics of the alt allele contigs. | |
45 ##INFO=<ID=MEINFO,Number=4,Type=String,Description="Mobile element info of the form NAME,START,END,POLARITY"> | |
46 ##INFO=<ID=METRANS,Number=4,Type=String,Description="Mobile element transduction info of the form CHR,START,END,POLARITY"> | |
47 ##INFO=<ID=DGVID,Number=1,Type=String,Description="ID of this element in Database of Genomic Variation"> | |
48 ##INFO=<ID=DBVARID,Number=1,Type=String,Description="ID of this element in DBVAR"> | |
49 ##INFO=<ID=DBRIPID,Number=1,Type=String,Description="ID of this element in DBRIP"> | |
50 ##INFO=<ID=MATEID,Number=.,Type=String,Description="ID of mate breakends"> | |
51 ##INFO=<ID=PARID,Number=1,Type=String,Description="ID of partner breakend"> | |
52 ##INFO=<ID=EVENT,Number=1,Type=String,Description="ID of event associated to breakend"> | |
53 ##INFO=<ID=CILEN,Number=2,Type=Integer,Description="Confidence interval around the length of the inserted material between breakends"> | |
54 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend"> | |
55 ##INFO=<ID=DPADJ,Number=.,Type=Integer,Description="Read Depth of adjacency"> | |
56 ##INFO=<ID=CN,Number=1,Type=Integer,Description="Copy number of segment containing breakend"> | |
57 ##INFO=<ID=CNADJ,Number=.,Type=Integer,Description="Copy number of adjacency"> | |
58 ##INFO=<ID=CICN,Number=2,Type=Integer,Description="Confidence interval around copy number for the segment"> | |
59 ##INFO=<ID=CICNADJ,Number=.,Type=Integer,Description="Confidence interval around copy number for the adjacency"> | |
60 6. FORMAT keys used for structural variants | |
61 ##FORMAT=<ID=CN,Number=1,Type=Integer,Description="Copy number genotype for imprecise events"> | |
62 ##FORMAT=<ID=CNQ,Number=1,Type=Float,Description="Copy number genotype quality for imprecise events"> | |
63 ##FORMAT=<ID=CNL,Number=.,Type=Float,Description="Copy number genotype likelihood for imprecise events"> | |
64 ##FORMAT=<ID=NQ,Number=1,Type=Integer,Description="Phred style probability score that the variant is novel with respect to the genome's ancestor"> | |
65 ##FORMAT=<ID=HAP,Number=1,Type=Integer,Description="Unique haplotype identifier"> | |
66 ##FORMAT=<ID=AHAP,Number=1,Type=Integer,Description="Unique identifier of ancestral haplotype"> | |
67 These keys are analogous to GT/GQ/GL and are provided for genotyping imprecise events by copy number (either because there is an unknown number of alternate alleles or because the haplotypes cannot be determined). CN specifies the integer copy number of the variant in this sample. CNQ is encoded as a phred quality -10log_10p(copy number genotype call is wrong). CNL specifies a list of log10 likelihoods for each potential copy number, starting from zero. When possible, GT/GQ/GL should be used instead of (or in addition to) these keys. | |
68 | |
69 Specifying Complex Rearrangements with Breakends | |
70 An arbitrary rearrangement event can be summarized as a set of novel adjacencies. | |
71 Each adjacency ties together 2 breakends. The two breakends at either end of a novel adjacency are called mates. | |
72 There is one line of VCF (i.e. one record) for each of the two breakends in a novel adjacency. A breakend record is identified with the tag SYTYPE=BND" in the INFO field. The REF field of a breakend record indicates a base or sequence s of bases beginning at position POS, as in all VCF records. The ALT field of a breakend record indicates a replacement for s. This "breakend replacement" has three parts: | |
73 the string t that replaces places s. The string t may be an extended version of s if some novel bases are inserted during the formation of the novel adjacency. | |
74 The position p of the mate breakend, indicated by a string of the form "chr:pos". This is the location of the first mapped base in the piece being joined at this novel adjacency. | |
75 The direction that the joined sequence continues in, starting from p. This is indicated by the orientation of square brackets surrounding p. | |
76 These 3 elements are combined in 4 possible ways to create the ALT. In each of the 4 cases, the assertion is that s is replaced with t, and then some piece starting at position p is joined to t. The cases are: | |
77 REF ALT Meaning | |
78 s t[p[ piece extending to the right of p is joined after t | |
79 s t]p] reverse comp piece extending left of p is joined after t | |
80 s ]p]t piece extending to the left of p is joined before t | |
81 s [p[t reverse comp piece extending right of p is joined before t | |
82 | |
83 Examples: | |
84 #CHROM POS ID REF ALT QUAL FILT INFO | |
85 2 321681 bnd_W G G]17:198982] 6 PASS SVTYPE=BND;MATEID=bnd_Y | |
86 2 321682 bnd_V T ]13:123456]T 6 PASS SVTYPE=BND;MATEID=bnd_U | |
87 13 123456 bnd_U C C[2:321682[ 6 PASS SVTYPE=BND;MATEID=bnd_V | |
88 13 123457 bnd_X A [17:198983[A 6 PASS SVTYPE=BND;MATEID=bnd_Z | |
89 17 198982 bnd_Y A A]2:321681] 6 PASS SVTYPE=BND;MATEID=bnd_W | |
90 17 198983 bnd_Z C [13:123457[C 6 PASS SVTYPE=BND;MATEID=bnd_X | |
91 """ | |
92 | |
93 vcf_header = """\ | |
94 ##fileformat=VCFv4.1 | |
95 ##source=defuse | |
96 ##reference=%s | |
97 ##INFO=<ID=SVLEN,Number=.,Type=Integer,Description="Difference in length between REF and ALT alleles"> | |
98 ##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant"> | |
99 ##INFO=<ID=MATEID,Number=1,Type=String,Description="ID of the BND mate"> | |
100 ##INFO=<ID=DP,Number=1,Type=Integer,Description="Read Depth of segment containing breakend"> | |
101 ##INFO=<ID=SPLITCNT,Number=1,Type=Integer,Description="number of split reads supporting the prediction"> | |
102 ##INFO=<ID=SPANCNT,Number=1,Type=Integer,Description="number of spanning reads supporting the fusion"> | |
103 ##INFO=<ID=HOMLEN,Number=1,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints"> | |
104 ##INFO=<ID=SPLICESCORE,Number=1,Type=Integer,Description="number of nucleotides similar to GTAG at fusion splice"> | |
105 ##INFO=<ID=GENE,Number=2,Type=String,Description="Gene Names at each breakend"> | |
106 ##INFO=<ID=GENEID,Number=2,Type=String,Description="Gene IDs at each breakend"> | |
107 ##INFO=<ID=ORF,Number=0,Type=Flag,Description="fusion combines genes in a way that preserves a reading frame"> | |
108 ##INFO=<ID=EXONBND,Number=0,Type=Flag,Description="fusion splice at exon boundaries"> | |
109 #CHROM POS ID REF ALT QUAL FILTER INFO\ | |
110 """ | |
111 | |
112 def cmp_alphanumeric(s1,s2): | |
113 if s1 == s2: | |
114 return 0 | |
115 a1 = re.findall("\d+|[a-zA-Z]+",s1) | |
116 a2 = re.findall("\d+|[a-zA-Z]+",s2) | |
117 for i in range(min(len(a1),len(a2))): | |
118 if a1[i] == a2[i]: | |
119 continue | |
120 if a1[i].isdigit() and a2[i].isdigit(): | |
121 return int(a1[i]) - int(a2[i]) | |
122 return 1 if a1[i] > a2[i] else -1 | |
123 return len(a1) - len(a2) | |
124 | |
125 def __main__(): | |
126 # VCF functions | |
127 chr_dict = dict() | |
128 def add_vcf_line(chr,pos,id,line): | |
129 if chr not in chr_dict: | |
130 pos_dict = dict() | |
131 chr_dict[chr] = pos_dict | |
132 if pos not in chr_dict[chr]: | |
133 id_dict = dict() | |
134 chr_dict[chr][pos] = id_dict | |
135 chr_dict[chr][pos][id] = line | |
136 | |
137 def write_vcf(): | |
138 print >> outputFile, vcf_header % (refname) | |
139 for chr in sorted(chr_dict.keys(),cmp=cmp_alphanumeric): | |
140 for pos in sorted(chr_dict[chr].keys()): | |
141 for id in chr_dict[chr][pos]: | |
142 print >> outputFile, chr_dict[chr][pos][id] | |
143 #Parse Command Line | |
144 parser = optparse.OptionParser() | |
145 # files | |
146 parser.add_option( '-i', '--input', dest='input', help='The input defuse results.tsv file (else read from stdin)' ) | |
147 parser.add_option( '-o', '--output', dest='output', help='The output vcf file (else write to stdout)' ) | |
148 parser.add_option( '-r', '--reference', dest='reference', default=None, help='The genomic reference id' ) | |
149 (options, args) = parser.parse_args() | |
150 | |
151 # results.tsv input | |
152 if options.input != None: | |
153 try: | |
154 inputPath = os.path.abspath(options.input) | |
155 inputFile = open(inputPath, 'r') | |
156 except Exception, e: | |
157 print >> sys.stderr, "failed: %s" % e | |
158 exit(2) | |
159 else: | |
160 inputFile = sys.stdin | |
161 # vcf output | |
162 if options.output != None: | |
163 try: | |
164 outputPath = os.path.abspath(options.output) | |
165 outputFile = open(outputPath, 'w') | |
166 except Exception, e: | |
167 print >> sys.stderr, "failed: %s" % e | |
168 exit(3) | |
169 else: | |
170 outputFile = sys.stdout | |
171 | |
172 refname = options.reference if options.reference else 'unknown' | |
173 | |
174 svtype = 'SVTYPE=BND' | |
175 filt = 'PASS' | |
176 columns = [] | |
177 try: | |
178 for linenum,line in enumerate(inputFile): | |
179 ## print >> sys.stderr, "%d: %s\n" % (linenum,line) | |
180 fields = line.strip().split('\t') | |
181 if line.startswith('cluster_id'): | |
182 columns = fields | |
183 ## print >> sys.stderr, "columns: %s\n" % columns | |
184 continue | |
185 cluster_id = fields[columns.index('cluster_id')] | |
186 gene_chromosome1 = fields[columns.index('gene_chromosome1')] | |
187 gene_chromosome2 = fields[columns.index('gene_chromosome2')] | |
188 genomic_strand1 = fields[columns.index('genomic_strand1')] | |
189 genomic_strand2 = fields[columns.index('genomic_strand2')] | |
190 gene1 = fields[columns.index('gene1')] | |
191 gene2 = fields[columns.index('gene2')] | |
192 gene_info = 'GENEID=%s,%s' % (gene1,gene2) | |
193 gene_name1 = fields[columns.index('gene_name1')] | |
194 gene_name2 = fields[columns.index('gene_name2')] | |
195 gene_name_info = 'GENE=%s,%s' % (gene_name1,gene_name2) | |
196 genomic_break_pos1 = int(fields[columns.index('genomic_break_pos1')]) | |
197 genomic_break_pos2 = int(fields[columns.index('genomic_break_pos2')]) | |
198 breakpoint_homology = int(fields[columns.index('breakpoint_homology')]) | |
199 homlen = 'HOMLEN=%s' % breakpoint_homology | |
200 orf = fields[columns.index('orf')] == 'Y' | |
201 exonboundaries = fields[columns.index('exonboundaries')] == 'Y' | |
202 read_through = fields[columns.index('read_through')] == 'Y' | |
203 span_count = int(fields[columns.index('span_count')]) | |
204 splitr_count = int(fields[columns.index('splitr_count')]) | |
205 splice_score = int(fields[columns.index('splice_score')]) | |
206 probability = fields[columns.index('probability')] if columns.index('probability') else '.' | |
207 splitr_sequence = fields[columns.index('splitr_sequence')] | |
208 split_seqs = splitr_sequence.split('|') | |
209 mate_id1 = "bnd_%s_1" % cluster_id | |
210 mate_id2 = "bnd_%s_2" % cluster_id | |
211 ref1 = split_seqs[0][-1] | |
212 ref2 = split_seqs[1][0] | |
213 b1 = '[' if genomic_strand1 == '+' else ']' | |
214 b2 = '[' if genomic_strand2 == '+' else ']' | |
215 alt1 = "%s%s%s:%d%s" % (ref1,b2,gene_chromosome2,genomic_break_pos2,b2) | |
216 alt2 = "%s%s:%d%s%s" % (b1,gene_chromosome1,genomic_break_pos1,b1,ref2) | |
217 #TODO evaluate what should be included in the INFO field | |
218 info = ['DP=%d' % (span_count + splitr_count),'SPLITCNT=%d' % splitr_count,'SPANCNT=%d' % span_count,gene_name_info,gene_info,homlen,'SPLICESCORE=%d' % splice_score] | |
219 if orf: | |
220 info.append('ORF') | |
221 if exonboundaries: | |
222 info.append('EXONBND') | |
223 info1 = [svtype,'MATEID=%s' % mate_id2] + info | |
224 info2 = [svtype,'MATEID=%s' % mate_id1] + info | |
225 qual = int(float(fields[columns.index('probability')]) * 255) if columns.index('probability') else '.' | |
226 vcf1 = '%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s'% (gene_chromosome1,genomic_break_pos1, mate_id1, ref1, alt1, qual, filt, ';'.join(info1) ) | |
227 vcf2 = '%s\t%d\t%s\t%s\t%s\t%s\t%s\t%s'% (gene_chromosome2,genomic_break_pos2, mate_id2, ref2, alt2, qual, filt, ';'.join(info2) ) | |
228 add_vcf_line(gene_chromosome1,genomic_break_pos1,mate_id1,vcf1) | |
229 add_vcf_line(gene_chromosome2,genomic_break_pos2,mate_id2,vcf2) | |
230 write_vcf() | |
231 except Exception, e: | |
232 print >> sys.stderr, "failed: %s" % e | |
233 exit(1) | |
234 | |
235 if __name__ == "__main__" : __main__() | |
236 |