annotate igparse.pl @ 3:8b026af4b298 draft default tip

Uploaded
author davidvanzessen
date Tue, 11 Mar 2014 10:39:04 -0400
parents 3cce6e04b1e8
children
Ignore whitespace changes - Everywhere: Within whitespace: At end of lines:
rev   line source
0
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1 #!/usr/bin/perl
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
2 =head1 IGBLAST_simple.pl
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
4 This version (1.4) has been heavily adapted since the original program was first created back in October 2012.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
5 Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
6
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
7 From V 1.2 onwards a 'Change Log' is included at the end of the program
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
8
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
9 =head2 Usage
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
10
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
11 Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
12 as it allows inspection of the data stores.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
13
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
14 The program takes a text file of the
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
15
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
16 ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process->
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
17
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
18 Supply the text version of the igBLAST report in the format as in the example below.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
19 The extra command line arugment is the record number (aka. BLAST report) to process.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
20 If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
21
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
22 =head2 Example Input
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
23
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
24 A standard igBLAST record or set of them in a file; this being typical:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
25
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
26 BLASTN 2.2.27+
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
27
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
28
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
29 Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
30 Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
31 Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
32 protein database search programs", Nucleic Acids Res. 25:3389-3402.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
33
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
34
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
35
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
36 Database: human_gl_V; human_gl_D; human_gl_J
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
37 674 sequences; 179,480 total letters
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
38
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
39
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
40
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
41 Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
42 run=R_2012_04_10_11_57_56_
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
43
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
44 Length=433
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
45 Score E
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
46 Sequences producing significant alignments: (Bits) Value
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
47
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
48 lcl|IGHV3-30*04 330 2e-92
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
49 lcl|IGHV3-30-3*01 330 2e-92
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
50 lcl|IGHV3-30*01 327 2e-91
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
51 lcl|IGHD3-16*01 14.4 11
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
52 lcl|IGHD3-16*02 14.4 11
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
53 lcl|IGHD1-14*01 12.4 43
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
54 lcl|IGHJ4*02 78.3 1e-18
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
55 lcl|IGHJ5*02 70.3 4e-16
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
56 lcl|IGHJ4*01 68.3 2e-15
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
57
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
58
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
59 Domain classification requested: imgt
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
60
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
61
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
62 V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand):
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
63 IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
64
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
65 V(D)J junction details (V end, V-D junction, D region, D-J junction, J start). Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
66 AGAGA TATGAGCCCCATCATGACA ACGTTTG CCGGAA ACTAC
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
67
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
68 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
69 FWR1 27 38 12 11 1 0 91.7
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
70 CDR1 39 62 24 22 2 0 91.7
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
71 FWR2 63 113 51 50 1 0 98
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
72 CDR2 114 137 24 23 1 0 95.8
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
73 FWR3 138 251 114 109 5 0 95.6
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
74 CDR3 (V region only) 252 259 8 7 1 0 87.5
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
75 Total N/A N/A 233 222 11 0 95.3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
76
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
77
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
78 Alignments
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
79
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
80 <----FWR1--><----------CDR1--------><-----------------------FWR2------
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
81 W A A S G F T F N T Y A V H W V R Q A P G K G
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
82 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
83 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
84 C A A S G F T F S S Y A M H W V R Q A P G K G
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
85 V 95.7% (221/231) IGHV3-30-3*01 64 ..T......................G..G.......A................................. 133
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
86 V 94.8% (221/233) IGHV3-30*01 64 ..T......................G..G.......A................................. 133
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
87
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
88 ----------------><----------CDR2--------><----------------------------
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
89 L E W V A V I S Y D G S N K N Y A D S V K G R F
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
90 Query_1 97 TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT 166
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
91 V 95.3% (222/233) IGHV3-30*04 134 ..................................T......T............................ 203
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
92 L E W V A V I S Y D G S N K Y Y A D S V K G R F
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
93 V 95.7% (221/231) IGHV3-30-3*01 134 .........................................T............................ 203
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
94 V 94.8% (221/233) IGHV3-30*01 134 .A................................T......T............................ 203
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
95
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
96 ---------------------------FWR3---------------------------------------
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
97 T I S R D N S K N T L Y L Q M N S L R V E D T
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
98 Query_1 167 CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG 236
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
99 V 95.3% (222/233) IGHV3-30*04 204 ...............................C.G.........................C.......... 273
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
100 T I S R D N S K N T L Y L Q M N S L R A E D T
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
101 V 95.7% (221/231) IGHV3-30-3*01 204 ...............................C.G.........................C.......... 273
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
102 V 94.8% (221/233) IGHV3-30*01 204 ...............................C.G.........................C.......... 273
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
103
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
104 -------------->
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
105 A V Y Y C T R D M S P I M T T F A G N Y W G Q
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
106 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
107 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
108 A V Y Y C A R
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
109 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
110 V 94.8% (221/233) IGHV3-30*01 274 .....G.........G.......----------------------------------------------- 296
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
111 D 100.0% (7/7) IGHD3-16*01 12 ------------------------------------------.......--------------------- 18
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
112 D 100.0% (7/7) IGHD3-16*02 12 ------------------------------------------.......--------------------- 18
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
113 D 100.0% (6/6) IGHD1-14*01 8 -------------------------------------------------......--------------- 13
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
114 J 100.0% (39/39) IGHJ4*02 10 -------------------------------------------------------............... 24
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
115 J 100.0% (35/35) IGHJ5*02 17 -----------------------------------------------------------........... 27
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
116 J 97.4% (38/39) IGHJ4*01 10 -------------------------------------------------------.............A. 24
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
117
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
118
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
119 G T L V T V S S
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
120 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
121 J 100.0% (39/39) IGHJ4*02 25 ........................ 48
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
122 J 100.0% (35/35) IGHJ5*02 28 ........................ 51
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
123 J 97.4% (38/39) IGHJ4*01 25 ........................ 48
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
124
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
125
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
126 Lambda K H
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
127 1.10 0.333 0.549
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
128
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
129 Gapped
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
130 Lambda K H
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
131 1.08 0.280 0.540
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
132
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
133 Effective search space used: 64847385
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
134
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
135
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
136 Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
137 run=R_2012_04_10_11_57_56_
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
138
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
139
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
140 ...etc...
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
141
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
142 =head2 Example Output
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
143
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
144
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
145 Example output from the data above sent:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
146 $ ./IGBLAST_simple.pl igBLASTOutput.txt 1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
147 D: Request to process just record '1' received
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
148 D: printOUTPUTData: Running
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
149 D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
150 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
151 D: ID is: 'HL67IUI01D26LR'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
152 D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
153 D: Starting Search for CDR3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
154 D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
155 D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
156 D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
157 D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01 IGHJ4*02 VH In-frame +'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
158 D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
159 D: printOUTPUTData: Running
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
160 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
161
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
162 =head4 Usage notes:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
163
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
164 Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
165
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
166 ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
167
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
168 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
169 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
170 OUTPUT: HL67IUI01EQMLY In-frame IGHV4-39*01 IGHD2-8*01 IGHJ3*02 GGSISSSSYY 29 IYHSGST 20 CARDATYYSNGFDIWGQG 53 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
171 OUTPUT: HL67IUI01CDCLP Out-of-frame IGHV3-23*01 IGHD3-3*01 IGHJ4*02 FSNYAM 16 SGSGDRTY 23 AKAD*FLEWLFRIGDGERLLGPGN 72 MOTIF_FOUND_IN_DNA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
172 OUTPUT: HL67IUI01AHRNH N/A IGHV3-33*01 N/A N/A WIHLQ*LW 23 YGMMEVI 23 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
173 OUTPUT: HL67IUI01DZZ1V Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASG 20 LYCASEGDIVASELLSTGARV 62 MOTIF_FOUND_IN_DNA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
174 OUTPUT: HL67IUI01DTR2Y Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*VLGPG 59 MOTIF_FOUND_IN_DNA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
175 OUTPUT: HL67IUI01EQL3S In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
176 OUTPUT: HL67IUI01AFG46 In-frame IGLV2-34*01 N/A IGHJ4*02 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
177 OUTPUT: HL67IUI01EFFKO In-frame IGHV3-11*01 IGHD6-6*01 IGHJ4*02 GFTFSDYY 23 ISYSGGTI 23 CARASGAARHRPLDYWGQG 56 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
178 OUTPUT: HL67IUI01B18SG In-frame IGHV3-33*01 IGHD5-12*01 IGHJ4*02 VRQA 11 KYYANSVK 23 RLGGFDYWGQGTLVTVSS 53 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
179 OUTPUT: HL67IUI01D6LER In-frame IGHV1-24*01 IGHD3-22*01 IGHJ4*02 GYSLNELS 23 PDPEDDE 23 TVQPSRITMMAVVITRIHWGASGARE 76 MOTIF_FOUND_IN_DNA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
180 OUTPUT: HL67IUI01CYCLF N/A IGHV4-39*01 N/A N/A GGSISSSSYY 29 IYYSGST 20 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
181 OUTPUT: HL67IUI01B4LEE In-frame IGHV7-4-1*02 IGHD6-19*01 IGHJ6*02 GYTFRTFT 23 INTNTGTP 23 CAKESGTGSAHFFYGMDVWGQG 65 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
182 OUTPUT: HL67IUI01A4KW4 Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 LDSPLTNM 23 LYLPVV 20 TVRVRGT*WLRSF*IWGQG 58 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
183 OUTPUT: HL67IUI01E05BV In-frame IGHV1-24*01 IGHD3-22*01 IGHJ2*01 GYSLNELS 23 PDPEDDE 23 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
184 OUTPUT: HL67IUI01CVVKY In-frame IGHV1-3*01 IGHD2-15*01 IGHJ1*01 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
185 OUTPUT: HL67IUI01CN5P2 In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
186 OUTPUT: HL67IUI01DUUJ5 In-frame IGHV3-21*01 IGHD1-7*01 IGHJ4*02 GYTFSTYS 23 ISSSSAYR 23 CARDIRLELRDWGQG 44 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
187 OUTPUT: HL67IUI01E1AIR Out-of-frame IGHV4-39*01 N/A IGHJ3*01 WGLHRRW**L 29 FVS*RAPR 23 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
188 OUTPUT: HL67IUI01CCZ8D Out-of-frame IGHV3-23*01 IGHD5-12*01 IGHJ4*02 GFTFDKYA 23 ILASGR 20 YCASEGDIVASELLSTGARE 58 MOTIF_FOUND_IN_DNA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
189 OUTPUT: HL67IUI01BT9IR N/A IGHV3-21*02 N/A N/A NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
190 OUTPUT: HL67IUI01COTO0 Out-of-frame IGHV4-39*01 N/A IGHJ3*01 GGFIGGGDNF 29 LYHDGRPA 23 NOT_FOUND
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
191 OUTPUT: HL67IUI01D994O In-frame IGHV7-4-1*02 IGHD2-21*02 IGHJ5*02 GYSITDYG 23 LNTRTGNP 23 CAVKDARDFVSWGQG 44 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
192 OUTPUT: HL67IUI01A08CJ In-frame IGHV4-39*01 IGHD6-13*01 IGHJ5*02 GGSISSSSYY 29 IYYTWEH 21 CERARRGSSWGQLVRPLGPG 62 MOTIF_FOUN
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
193
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
194
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
195
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
196 OUTPUT: # ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
197 OUTPUT: HL67IUI01D26LR In-frame IGHV3-30*04 IGHD3-16*01 IGHJ4*02 GFTFNTYA 23 ISYDGSNK 23 CTRDMSPIMTTFAGNYWGQG 59 MOTIF_FOUND_IN_BOTH
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
198 ...etc...
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
199
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
200 =head4 Also, combined grep & sed:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
201
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
202 $ ./IGBLAST_simple.pl igBLASTOutput.txt | grep OUTPUT: | sed 's/OUTPUT:\t//'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
203
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
204 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
205
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
206 =head3 CDR3 Patterns:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
207
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
208 We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
209
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
210 my $DNACDR3_Pat = "TGGGG....GG.";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
211 my $AASequenceMotifPattern = "WG.G";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
212
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
213 They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X').
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
214
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
215 [NB: These are original patterns used for testing, check the code for the current ones.]
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
216
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
217 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
218
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
219 my $DNACDR3_Pat = "TGGGG....GG.";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
220 my $AACDR3_Pat = "WG.G";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
221
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
222 use strict;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
223 use Data::Dumper;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
224 # Set this as to number of the result (aka "record") you want to process or 0 for all:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
225 my $ProcessRecord =0;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
226 if (defined $ARGV[1]) { $ProcessRecord = pop @ARGV; } #Also accept from the command line:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
227 if ($ProcessRecord != 0) { print "D: Request to process just record '$ProcessRecord' received\n"; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
228
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
229 #Adjust the record separator:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
230 $/="Query= ";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
231 my $Record=0; # A simple counter, that we might not use.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
232 #Force-loaded header / version information:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
233 my $Header = <>;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
234 #At the moment we don't use this - so dump it immediately:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
235 $Header = undef;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
236 #print "D: Force-loaded header / version information: '$Header'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
237
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
238 #Print the Header for the output line (we need this once, at the start)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
239 print &printOUTPUTData ({"HEADER" => 1})."\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
240
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
241 while (<>)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
242 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
243 =head4 First check - should we be processing this record at all?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
244
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
245 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
246 $Record++; #Increment the record counter:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
247 #Do we process this record - or all records?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
248 if ($ProcessRecord != $Record && $ProcessRecord != 0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
249 { next; } #We need to increment the record counter before we increment
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
250
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
251 =head4 Setup the output line storage and print the header:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
252
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
253 We enter this initially and work to change it:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
254
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
255 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
256
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
257 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
258
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
259 my %OUTPUT_Data; #To collect data for the output line in
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
260 #Assume the first and work to find better:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
261 $OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
262 #The whole record - one per read - is now stored in $_
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
263 my @Lines =split (/[\r\n]+/,$_); # split on windows/linux/mac new lines
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
264
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
265 #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
266 #print "D: Record #$Record\n"; print $_; print "\n---------\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
267 print "D: ''$Lines[0]'\nD: ...etc...'\nD: ############\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
268
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
269 =head3 Get the ID
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
270
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
271 Quite easy: the first field on the first line:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
272
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
273 Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
274
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
275 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
276
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
277 (my $ID) = $Lines[0]=~ m/^(\S+)/;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
278 unless (defined $ID && $ID ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
279 { # So a near total failure...?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
280 $OUTPUT_Data{"ID"} = "Unknown";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
281 print &printOUTPUTData (\%OUTPUT_Data)."\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
282 next; #No ID is terminal for this record
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
283 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
284 else
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
285 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
286 print "D: ID is: '$ID'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
287 $OUTPUT_Data{"ID"} = $ID;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
288 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
289 =head3 Declare the variables we will need here in the next few sections to store data
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
290
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
291 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
292
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
293 my $CurrentRegion;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
294 my $RegionMarkup;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
295
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
296 #So we can sync the coordinated of the alignment up to the domains found:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
297 my $Query_Start = -1; my $Query_End = -1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
298
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
299 #Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
300 my $ThisQueryStart =-1; my $ThisQueryEnd =-1; #Think $ThisQueryEnd isn't used at the moment.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
301 my $DNAQuerySequence =""; #The actual DNA Query sequence...
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
302 my $AAQuerySequence = "";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
303
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
304 #As this changes with the alleles identified:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
305 my $CurrentAASequence;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
306 #The main storage variables
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
307
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
308 my %Alginments; my %Alleles;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
309 my %DomainBoundaries;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
310
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
311 =head2 Stanza 1: Get the general structure of the sequence identified
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
312
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
313 =head3 Method 1: Use the table supplied
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
314
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
315 Technically this valid for the top hit...realistically this is the only information we have reported to us
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
316 so we use this or nothing. This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd? Who knows!
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
317
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
318 Targets this block:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
319
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
320 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
321 FWR1 167 240 75 72 2 1 96
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
322 CDR1 241 264 24 20 4 0 83.3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
323 FWR2 265 315 51 48 3 0 94.1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
324 CDR2 316 336 24 15 6 3 62.5
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
325 FWR3 337 450 114 106 8 0 93
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
326 CDR3 (V region only) 451 454 4 4 0 0 100
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
327 Total N/A N/A 292 265 23 4 90.8
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
328
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
329 Then we split out the lines inside it in a second scanning step - less optimal but easier to read:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
330
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
331 FWR1 167 240 75 72 2 1 96
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
332 CDR1 241 264 24 20 4 0 83.3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
333 FWR2 265 315 51 48 3 0 94.1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
334 CDR2 316 336 24 15 6 3 62.5
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
335 FWR3 337 450 114 106 8 0 93
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
336 CDR3 (V region only) 451 454 4 4 0 0 100
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
337
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
338 into:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
339
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
340 (Section, from, to, length, matches, mismatches, gaps, percent identity)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
341
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
342 =head3 Method 2: Use the table supplied
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
343
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
344 The other way to do this is to split the graphical markup out of the alignment.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
345 This works for _any_ reported alignment, not just the top hits:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
346
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
347 In the main alignment table processing section collect the information, collect the information:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
348
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
349 #Is region mark-up:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
350 if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
351 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
352 # print ": Region Markup detected\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
353 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
354 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
355 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
356
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
357 Then afterwards when all the region was collected, process it like this:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
358 #Pad the CDER3 region:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
359
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
360 #Remove the trailing spaces:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
361 $RegionMarkup =~ s/ *$//g;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
362 #Calculate the length of the CDR3 region so we can add it in:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
363 my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
364 #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop):
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
365 $RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
366 #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
367
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
368 #Now really process it:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
369 my $C_Pos = 0;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
370 my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup); #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
371 foreach my $C_Domain (@Domains)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
372 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
373 if (length ($C_Domain) <=0) {next;}
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
374 my $DomainStart= $C_Pos;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
375 my $DomainEnd = $DomainStart + length ($C_Domain)-1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
376 my ($DomainType) = $C_Domain =~ m/(...[123])/;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
377 # print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
378 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
379 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
380 $C_Pos = $DomainEnd+1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
381 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
382
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
383 The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
384 Why stress?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
385
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
386
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
387 =head3 The end of the FWR3 is the start of CDR3?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
388
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
389 This is an assumption made. Hence the two variables:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
390
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
391 my $MaxDomainReported =0 ; # In nts / bps
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
392 my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
393
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
394 $MaxDomainBaseFound
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
395
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
396 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
397 my $MaxDomainBaseFound =0 ; # In nts / bps
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
398 my $AlignmentStart ; # In nts /bp #Alternative name would be: '$MinDomainBaseFound'; set to null until primed
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
399 # my $FWR3_Found_Flag = 0; # Did we find the end of the FWR3 - which is the start of the CDR3. Set to 'false' initially.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
400
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
401 (my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" );
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
402 #Enable the next line if you want the raw data we are going to parse in this section:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
403 #print Dumper @StructureSummaryTable;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
404 foreach my $C_Section (@StructureSummaryTable)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
405 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
406 my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
407 #print "D: Domain type: '$DomainType'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
408 #$DomainType =~ s/ .*$//g;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
409 $DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
410 $DomainBoundaries{$DomainType}{"End"} = $DomainEnd;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
411
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
412 #So we can do a reality check on the length / start of the CDR3 if we have to go looking:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
413 if ($MaxDomainBaseFound <= $DomainEnd)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
414 { $MaxDomainBaseFound = $DomainEnd; } #Store the maximum base found
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
415 if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
416 { $AlignmentStart = $DomainStart; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
417 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
418 #print Dumper %DomainBoundaries;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
419 #die "HIT BLOCK\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
420
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
421 =head3 Did we find the CDR3 region specifically?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
422
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
423 If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
424
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
425 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
426 print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
427
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
428 #my @WantedSections = qw (V D J);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
429
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
430 =head2 Second Stanza: Parse the main Alignment Table
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
431
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
432 =head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
433
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
434 As this is a little involved and comparamentalises nicely we sub-contract this to two functions""
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
435
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
436 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
437 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
438
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
439 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
440 (my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
441 my $PanelSplitPoint = findSplitPoint (\@Table); #Why can't they just use a fixed field width or a tab as a delimiter?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
442 #If you are interested, enable this line:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
443 # print "D: The info panel was detected at: '$splitPoint'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
444
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
445 =head3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
446
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
447 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
448
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
449
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
450 foreach my $C_Line (0..$#Table)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
451 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
452
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
453 =head3 Call the line type we find: There are 4:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
454
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
455 These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
456
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
457 | <- This split is ~40 chars. from the start of the line
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
458 * InfoPanel * | * Alignment Panel *
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
459 : is a "Blank" line (-1,-1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
460 <----FWR1--><----------CDR1--------><-----------------------FWR2------ : is "Region Markup" (-1,0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
461 W A A S G F T F N T Y A V H W V R Q A P G K G : is "AA Sequence" (-1, >=0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
462 Query_1 27 TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC 96 : is "DNA Sequence" (2,1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
463 V 95.3% (222/233) IGHV3-30*04 64 ..T......................G..G.......A................................. 133 : is "" "
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
464
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
465 So we split 40 chars in and then the two parts on spaces.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
466
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
467
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
468 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
469
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
470 # print "D: (sub) Line in parsed table: '$C_Line': \n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
471
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
472 my ($InfoPanel, $AlignmentPanel) = $Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
473
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
474 my @InfoColumns = split (/\s+/,$InfoPanel);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
475 my @AlignmentColumns = split (/\s+/,$AlignmentPanel);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
476
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
477 #If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
478 # print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
479 # print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
480
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
481 #Populate this so we can step through it
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
482
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
483 =head4 Is a blank line:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
484 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
485 if ($#InfoColumns == -1 && $#AlignmentColumns == -1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
486 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
487 # print ": Blank\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
488 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
489 } #For now I think we just skip - is not needed (though might be implict mark-up)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
490
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
491 =head4 Is region mark-up:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
492 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
493 if ($#InfoColumns == -1 && $#AlignmentColumns ==0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
494 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
495 # print ": Region Markup detected\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
496 $RegionMarkup = $RegionMarkup.$AlignmentPanel; #Collect the information, then re-synthesise it at the end of record
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
497 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
498 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
499 =head4 Is query DNA Sequence:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
500 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
501 if ($#InfoColumns == 2 && $#AlignmentColumns ==1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
502 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
503 # print ": DNA Query Sequence\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
504 #Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels')
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
505 ($ThisQueryStart) = $InfoPanel =~ / (\d+) *$/;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
506 ($ThisQueryEnd) = $AlignmentPanel =~ / (\d+) *$/;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
507 my ($ThisDNASeq) = $AlignmentPanel =~ /^(.*?) /;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
508 #If you want to know what we just found:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
509 #print "D: This DNA Sequence: '$ThisDNASeq'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
510 $DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq; #Add it on to whatever we already have.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
511 #Move the needle if there are smaller / greater; otherwise prime the 'needles':
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
512 if ($ThisQueryStart < $Query_Start or $Query_Start == -1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
513 { $Query_Start = $ThisQueryStart; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
514 if ($ThisQueryEnd > $Query_End or $Query_End == -1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
515 { $Query_End = $ThisQueryEnd; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
516 # print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
517 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
518 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
519 =head4 Is AA Sequence:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
520
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
521 This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
522 -------------->
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
523 A V Y Y C T R D M S P I M T T F A G N Y W G Q << Want this
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
524 Query_1 237 GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG 306
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
525 V 95.3% (222/233) IGHV3-30*04 274 .....G.........G.......----------------------------------------------- 296
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
526 A V Y Y C A R
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
527 V 95.7% (221/231) IGHV3-30-3*01 274 .....G.........G.....------------------------------------------------- 294
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
528
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
529 ...etc...
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
530 G T L V T V S S << Want this
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
531 Query_1 307 GAACCCTGGTCACCGTCTCCTCAG 330
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
532
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
533 To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
534
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
535 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
536
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
537 if ($#InfoColumns == -1 && $#AlignmentColumns >=-1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
538 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
539 unless ($Table[$C_Line+1] =~ /Query/) { next; } #Is the next line the DNA sequence ?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
540 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
541 # print ": AA sequence\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
542
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
543
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
544 $CurrentAASequence = $AlignmentPanel;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
545 #print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
546 $CurrentAASequence =~ s/^ {$PanelSplitPoint}//;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
547 #print "D: '$AAQuerySequence'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
548 # print "D: Current AA Sequence: \t'$CurrentAASequence'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
549 $AAQuerySequence = $AAQuerySequence.$CurrentAASequence; #Store the elongating AA Sequence as well
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
550 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
551 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
552 =head4 Is Alignment:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
553 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
554 if ($#InfoColumns == 4 && $#AlignmentColumns ==1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
555 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
556 #Not acutally interesting to us for this version of the parser. Delete ultimately?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
557 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
558 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
559
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
560 #Is weird! Don't recognise it!
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
561
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
562 warn "Weird! Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
563 } #End main iteration loop for alignment parsing.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
564
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
565
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
566 =head2 The CDR3 is noted as problematic. Can we identify it?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
567
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
568 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
569 print "D: Starting Search for CDR3\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
570 #Do have the end of the FWR3 but not the CDR3? If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
571 if (exists ($DomainBoundaries{"FWR3"}{"End"})
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
572 && $AlignmentStart !=0
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
573 && not (exists $DomainBoundaries{"CDR3"}{"End"}) ) #Guess we need to go looking for the end then...
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
574 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
575 #print "D: Placing call to markUpCDR3\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
576 my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
577 $DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart,
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
578 $DNACDR3_Pat, $AACDR3_Pat);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
579 if ($CDR3_Start !=0 && $CDR3_End !=0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
580 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
581 $DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
582 $DomainBoundaries{"CDR3"}{"End"} = $CDR3_End ;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
583 $DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
584 print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
585 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
586 else
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
587 { print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
588 $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
589 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
590 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
591 else
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
592 { #Was reported by igBLAST
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
593 print "D: Found the FWR3 from the Domain Boundary Table\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
594 $DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
595 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
596
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
597 #print Dumper %DomainBoundaries;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
598
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
599 =head2 Get the top VDJ regions:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
600
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
601 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
602
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
603 =head2 Extract General Features:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
604
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
605 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
606 (my $TopHit) = $_ =~ m/V-J Frame, Strand\):\n(.*?)\n/s;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
607 print "D: Top Hits (raw)= '$TopHit' \n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
608 my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
609 print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
610
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
611 =head2 Store the V / D / J Genes used
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
612
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
613 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
614
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
615 if (defined $Top_V_gene_match && $Top_V_gene_match ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
616 { $OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
617
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
618 if (defined $Top_D_gene_match && $Top_D_gene_match ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
619 { $OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
620
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
621 if (defined $Top_J_gene_match && $Top_J_gene_match ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
622 { $OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
623
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
624 if (defined $Strand && $Strand ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
625 { $OUTPUT_Data{"Strand"} = $Strand;}
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
626
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
627 =head4 Preamble: ID, Frame, and V / D / J used:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
628
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
629 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
630 #Do a reality check: if we didn't get an ID, then skip:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
631 unless (defined (defined $ID) && $ID ne "" &&
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
632 defined $VJFrame && $VJFrame ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
633 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
634 print &printOUTPUTData (\%OUTPUT_Data)."\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
635 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
636 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
637
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
638 #Ok, so we have data...most likely:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
639 #print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
640
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
641 if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
642 { $OUTPUT_Data{"VDJ Frame"} = $VJFrame;}
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
643 else
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
644 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
645 print &printOUTPUTData (\%OUTPUT_Data)."\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
646 next;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
647 }#REALLY? We didn't find anything? Oh well, move to next record
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
648
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
649 =head4 CDR1
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
650
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
651 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
652 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
653
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
654 my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
655
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
656 # print "D: AA Seqeunce is: '$AAQuerySequence'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
657 if (exists $DomainBoundaries{"CDR1"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
658 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
659 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
660 my $CDR1Start = $DomainBoundaries{"CDR1"}{"Start"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
661 my $CDR1End = $DomainBoundaries{"CDR1"}{"End"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
662 my $CDR1_Length = $CDR1End - $CDR1Start;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
663 # print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
664 #Remember that the alignment starts at the FWR1 start, not nt =0 on the read
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
665 my $CDR1_Seq_AA = substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
666 # print "D: '$CDR1_Seq_AA'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
667 $CDR1_Seq_AA =~ s/ //g;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
668 my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
669 #Add this data to the output store specifically:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
670 $OUTPUT_Data{"CDR1 Seq"} = $CDR1_Seq_AA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
671 $OUTPUT_Data{"CDR1 Length"} = $CDR1_Length;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
672 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
673 #What happens if there is no CDR1 found? Leave blank - the output routine can handle this
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
674
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
675 =head4 CDR2
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
676
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
677 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
678
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
679 if (exists $DomainBoundaries{"CDR2"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
680 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
681 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
682 my $CDR2Start = $DomainBoundaries{"CDR2"}{"Start"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
683 my $CDR2End = $DomainBoundaries{"CDR2"}{"End"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
684 my $CDR2_Length = $CDR2End - $CDR2Start;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
685 my $CDR2_Seq_AA = substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
686 $CDR2_Seq_AA =~ s/ //g;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
687 my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
688 #Add this data to the output store specifically:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
689 $OUTPUT_Data{"CDR2 Seq"} = $CDR2_Seq_AA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
690 $OUTPUT_Data{"CDR2 Length"} = $CDR2_Length;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
691 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
692 #What happens if there is no CDR2 found? Leave blank - the output routine can handle this.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
693
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
694 =head4 CDR3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
695
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
696 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
697 if (exists $DomainBoundaries{"CDR3"}{"Start"}) #It is very possible that it doesn't; assume the End does though if we find the Start
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
698 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
699 # my $VRegion = $Alginments{"V"}{$C_VRegion}; #Convenience....
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
700 my $CDR3Start = $DomainBoundaries{"CDR3"}{"Start"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
701 my $CDR3End = $DomainBoundaries{"CDR3"}{"End"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
702 my $CDR3_Length = $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
703 my $CDR3_Seq_AA = substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
704 my $CDR3_Seq_DNA = substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
705 $CDR3_Seq_AA =~ s/ //g;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
706 $CDR3_Seq_DNA =~ s/ //g;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
707 my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
708 my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
709 #Add this data to the output store specifically:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
710 $OUTPUT_Data{"CDR3 Seq"} = $CDR3_Seq_AA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
711 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
712 $OUTPUT_Data{"CDR3 Seq DNA"} = $CDR3_Seq_DNA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
713 $OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
714 #And in the case of the CDR3 how we found it:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
715 $OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"};
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
716 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
717 #What happens if there is no CDR3 found? Leave blank - the output routine can handle this.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
718 #die "HIT BLOCK\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
719 #End of the record; output the data we have collected and move on.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
720 print &printOUTPUTData (\%OUTPUT_Data)."\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
721 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
722
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
723
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
724
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
725 ############
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
726 sub returnLinesBetween {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
727 =head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section})
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
728
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
729 When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
730 that are bounded by these tags.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
731
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
732 If either of the tags are not found - or are found in the wrong order - then a null list is returned.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
733
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
734 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
735
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
736 my ($Text_ref, $TopTag, $BotTag) = @_;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
737
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
738 my @Table;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
739 #The two boundary conditions at which we will cut the table:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
740 #print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
741 #How we record these:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
742 my $AlignmentLine_Top=0; my $AlignmentLine_Bot=0;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
743
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
744 my $LineIndex=-1; #-1 As the loop increments this line counter first, then does its checks.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
745 #If you care:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
746 #print "D: Lines of text passed: $$#Lines\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
747
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
748 #Iterate through until we find what we are looking for or run out of text to search:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
749 while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref})
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
750 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
751 $LineIndex++;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
752 #Enable if you need to care:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
753 # print "D: Line Index = $LineIndex\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
754
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
755 if ($$Text_ref[$LineIndex] =~ m/$TopTag/)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
756 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
757 $AlignmentLine_Top = $LineIndex;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
758 # print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
759 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
760 if ($$Text_ref[$LineIndex] =~ m/$BotTag/)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
761 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
762 $AlignmentLine_Bot = $LineIndex;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
763 # print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n"; #Enable if you are interested
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
764 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
765 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
766 #Reality check: did we find anything? If not then we return null.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
767 if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
768 { return; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
769 #Again, enable if you care:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
770 #print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
771
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
772 #We want the lines one down and one up - so polish these.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
773 $AlignmentLine_Top++; $AlignmentLine_Bot--;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
774
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
775 #Return as an array slice:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
776 return (@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
777 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
778 ############
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
779
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
780 sub findSplitPoint
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
781 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
782 =head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
783
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
784 When passed a table with the alignment in it makes an educated guess as to the precise split point to
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
785 spearate the 'info' and 'alignment' panels.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
786 This is a right olde faff because the field / panel boundaries change.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
787
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
788 ' Query_6 167 GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG 235'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
789 ' Query_6 236 CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG 305'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
790 ' Query_6 306 GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC 372'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
791 ' Query_6 373 TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT 442'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
792 ' Query_6 443 ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC 512'
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
793 MOTIF_FOUND_IN_AA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
794 i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
795
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
796 Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
797 because the pattern match has been failing more often that it suceeded.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
798
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
799 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
800 #A rough guess is 38 for normal sequences, 48 for reversed ones:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
801
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
802 my $SplitPos = 0;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
803
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
804 (my $Table_ref) = @_; #Get the reference to the table
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
805 my @DNALines; #We populate this for mining in the next section
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
806 foreach my $C_Line (@{$Table_ref})
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
807 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
808 #print "D: $C_Line\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
809 # (my $SplitLine) = $C_Line;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
810 #Split on consecutive tabs or spaces:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
811 my @LineFields = split (/[\t\s]+/,$C_Line);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
812 #print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
813 unless ( $LineFields[3] =~ m/[^\.]/
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
814 && $LineFields[3] =~ m/[ATGCX]{20,}/
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
815 && $#LineFields==4)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
816 { next; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
817 #Enable if you want to know the lines we think are the DNA Query strings:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
818 #print "D: DNA Line: '$C_Line'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
819 push @DNALines, $C_Line; #Note it down
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
820 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
821
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
822 my %PanelBounds; #Will contain the positions of the panel boundaries
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
823
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
824 foreach my $C_DNALine (@DNALines)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
825 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
826 #print "D: '$C_DNALine'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
827 $C_DNALine =~ m/[ATGC-]+ \d+$/; #Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
828 my $MatchPos = $-[0]; #This is the position of the start of the last match because we can't get the index() function to work
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
829 #(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
830 #print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
831 $PanelBounds{$MatchPos}++;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
832 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
833 #Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
834 #The brackets around "($SplitPos)" are really necessary it seems.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
835 ($SplitPos) = (sort { $a <=> $b } keys %PanelBounds);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
836 #If you want
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
837 #print Dumper %PanelBounds;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
838 #Tell people if we are having difficultlty:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
839 if ($SplitPos == -1) { warn "Couldn't identify the panel boundaries\n"; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
840 #print "D: $SplitPos: Returning the split position of: '$SplitPos'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
841 return $SplitPos;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
842 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
843
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
844
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
845 ##
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
846 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
847 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
848 ###
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
849
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
850
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
851
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
852
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
853
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
854 #####
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
855 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
856 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
857 #####
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
858 sub markUpCDR3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
859 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
860 =head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
861
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
862 Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied. The CDR3 is assumed to start
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
863 at the end of the FWR3.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
864 To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
865 The position of the first matching pattern is reported.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
866
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
867 =head4 Fuller Usage:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
868
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
869 my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence,
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
870 $DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"},
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
871 $DNACDR3_Pat, $AACDR3_Pat);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
872
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
873
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
874
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
875 =head4 Returned Values
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
876
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
877 If the CDR3 was found then we we signal like this:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
878
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
879 $MotifFound ==0 : Nope, didn't find either motif
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
880 $MotifFound ==1 : Found at the DNA level, not the AA level
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
881 $MotifFound ==2 : Found at the the AA level, not the DNA level
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
882 $MotifFound ==3 : Found at the the AA level & the DNA level
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
883
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
884 (Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
885
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
886 The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
887 Formally the test order is:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
888
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
889 1) DNA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
890 2) AA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
891
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
892 i.e. DNA bp locations have priority.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
893
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
894 Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match).
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
895 See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
896
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
897 =head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
898
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
899 Note that patterns are assumed to require white space inserting in them between the letters.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
900 This could be a serious limitation
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
901
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
902
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
903 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
904
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
905 #Get the parameters passed:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
906 my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat) = @_;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
907 print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
908
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
909
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
910 #Setup our return values:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
911 my $Start = 0; my $End =0; my $MotifFound = 0;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
912 my $How; #Literally How the motif was found (or not if blank)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
913
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
914
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
915 =head4 Prepare the sequences and the patterns for use
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
916
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
917 Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
918
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
919 Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
920
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
921
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
922 We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
923 in position calculations.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
924
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
925
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
926 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
927
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
928 #Because igBLAST doesn't always report from the start of the read (primers and things are upstream):
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
929
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
930 my $CropPoint = $FWR3_End - $FWR1_Start - 1 ;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
931 #print "D: markUpCDR3: Crop point is: '$CropPoint'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
932
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
933 #print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
934 #We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
935
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
936 my $AA_Trimmed = substr ($AA, $CropPoint);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
937 my $DNA_Trimmed = substr ($DNA ,$CropPoint);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
938 #print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
939 #print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
940
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
941 #This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
942 my $AAPat_Spaced;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
943 foreach my $C_Char (0..length($AAPat)-1) #The -1 is because we don't want trailing spaces until the next nt -> AA translation.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
944 { $AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1); }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
945 #And write this back into the main pattern we were passed:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
946 $AAPat = $AAPat_Spaced;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
947
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
948 #temp hack:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
949 #$AA_Trimmed = $AA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
950 my $MotifFound=0; #So we can record which patterns we found
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
951 my $MotifPositionDNA =-1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
952 my $MotifPositionAA =-1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
953
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
954 #print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
955 =head4 At DNA level: "TGG GGx xxx GGx" [+1]
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
956
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
957 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
958
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
959 #print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
960
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
961 if ($DNA_Trimmed =~ m/$DNAPat/)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
962 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
963 $MotifPositionDNA = $+[0]; #Just the easiest way to do this in Perl
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
964 # print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
965 $MotifFound = $MotifFound + 1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
966 #Any more matches further on?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
967 my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
968 # print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
969 # print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
970 # print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
971 # print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
972 if ($LaterString =~ m/$DNAPat/)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
973 { print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n"; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
974 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
975
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
976 =head4 At AA level: "WGxG" [+2]
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
977
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
978 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
979
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
980 if ($AA_Trimmed=~ m/$AAPat/)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
981 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
982 $MotifPositionAA = $+[0]; #Just the easiest way to do this in Perl
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
983 $MotifFound = $MotifFound + 2;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
984 # print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
985 (my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
986 # print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
987
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
988 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
989
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
990 =head4 Assess the results of motif position finding
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
991
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
992 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
993
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
994 #print "D: markUpCDR3: MotifFound = '$MotifFound'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
995
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
996 if ($MotifFound ==0)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
997 { return ($Start, $End, $MotifFound); } #The easy one really: return we didn't find the CDR3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
998
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
999 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1000 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1001 #Just found in DNA:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1002 if ($MotifFound ==1)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1003 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1004 $Start = $FWR3_End; #We assume the end of the FWR3 is the start of CDR3:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1005 $End = $MotifPositionDNA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1006 $How = "MOTIF_FOUND_IN_DNA";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1007 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1008 #Just found in AA:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1009 if ($MotifFound ==2)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1010 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1011 $End = $MotifPositionAA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1012 $How = "MOTIF_FOUND_IN_AA";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1013 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1014
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1015 #Found in both, DNA has priority:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1016 if ($MotifFound ==3)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1017 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1018 $Start = $FWR3_End ; #We assume the end of the FWR3 is the start of CDR3:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1019 $End = $MotifPositionDNA;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1020 $How = "MOTIF_FOUND_IN_BOTH";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1021 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1022
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1023 #print "D: markUpCDR3: Motif found = $MotifFound\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1024
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1025 =head4 These next few lines are for testing / diagnostics only - disable for general use
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1026
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1027 If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1028 the start of FWR1 is unlikely to be at nt 1.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1029
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1030 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1031
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1032 $Start = $FWR3_End - $FWR1_Start -1;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1033 $End = $End + $CropPoint;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1034 my $CDR3_RegionLength = $End - $Start;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1035 #print "D: markUpCDR3: CDR3 Length= $Start - $End = '$CDR3_RegionLength'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1036 (my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1037
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1038 #Add onto the coordinates what we trimmed off:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1039
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1040
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1041 #print "D: markUpCDR3: Seq ='$CDR3_seq'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1042
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1043 print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1044 #die "HIT BLOCK\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1045 return ($Start + $FWR1_Start, $End + $FWR1_Start, $How);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1046 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1047
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1048
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1049 sub printOUTPUTData {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1050 =head2 sub: $OutputDataString = printOUTPUTData {\%OutputData}
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1051
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1052 When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1053 This prepared and then returned as a text string that can then be printed to STDOUT:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1054
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1055 print (printOUTPUTData (\%OutputData));
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1056
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1057 Any missing data in the Hash array it polietly ignored and a null string printed in place.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1058 The text field is tab delimited; there are no extra trailing tabs or carriage returns in place.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1059
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1060 Actually the fields printed out are stored in an index array.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1061
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1062 =head3 Header output
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1063
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1064 If the routine is passed a key 'HEADER' then the header columns are returned as that string.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1065 This is tested first - so don't add this unless you mean to.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1066
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1067 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1068
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1069 my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene",
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1070 "CDR1 Seq", "CDR1 Length",
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1071 "CDR2 Seq", "CDR2 Length",
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1072 "CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand",
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1073 "CDR3 Found How");
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1074
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1075 my $OutputString = "OUTPUT:"; #What we are going to build the output into.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1076
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1077 =head4 Print Header & Exit?
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1078
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1079 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1080
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1081 my ($Data_ref) = @_;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1082 #print "D: printOUTPUTData: Running\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1083
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1084 if (exists $$Data_ref {"HEADER"})
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1085 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1086 $OutputString .= "\t";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1087 for(my $n = 0; $n <= $#HeaderFields; $n++)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1088 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1089 $OutputString .= $HeaderFields[$n];
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1090 $OutputString .= "\t" if($n < $#HeaderFields);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1091 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1092
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1093 # foreach my $C_Header (@HeaderFields)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1094 # { $OutputString .= "$C_Header"; } #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1095
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1096 print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1097 return ($OutputString);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1098 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1099
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1100 =head3 Assemble whatever data we have - and tab delimit the null fields
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1101
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1102 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1103 #print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1104 foreach my $C_Header (@HeaderFields)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1105 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1106
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1107 if (exists ($$Data_ref {$C_Header}))
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1108 { $OutputString .= "\t". $$Data_ref{$C_Header}; } #We have data to print out
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1109 else
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1110 { $OutputString .="\t"; } #Add a trailing space
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1111 } #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1112
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1113 return ($OutputString);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1114 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1115
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1116
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1117 ######################################### Code Junk ########################
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1118
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1119
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1120 =head2 Code Junk Attic
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1121
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1122 =head3 Demonstrates how to reverse translate an amino acid sequence into DNA:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1123
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1124 use Bio::Tools::CodonTable;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1125 use Bio::Seq;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1126
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1127 # print possible codon tables
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1128 my $tables = Bio::Tools::CodonTable->tables;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1129 while ( (my $id, my $name) = each %{$tables} ) {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1130 print "$id = $name\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1131 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1132 my $CodonTable = Bio::Tools::CodonTable->new();
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1133
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1134 my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1135
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1136
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1137 my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1138 print "D: '$rvSeq'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1139 die "TEST OVER\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1140
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1141 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1142
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1143
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1144 =head3 For processing the 'Alignment lines' section of the alginment table
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1145
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1146 #If we are ever interested; then enable the code below:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1147 # print ": Alignment\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1148 # $InfoPanel =~ s/^ +//; $InfoPanel =~ s/ +$//; #Clean off trailing spaces
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1149 # my ($Germclass, $PID, $PID_Counts, $Allele) = split (/\s+/,$InfoPanel); #Split on spaces
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1150 ##Enable if you need to know what we just found:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1151 # #print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1152 # #A reality check: we should have an Allele - or some text here.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1153 # unless (defined $Allele && $Allele ne "")
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1154 # { warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n"; }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1155 # if (exists ($Alginments {$Germclass}{$Allele}))
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1156 # { $Alginments {$Germclass}{$Allele} = $Alginments {$Germclass}{$Allele}.$CurrentAASequence; } #Carry on adding
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1157 # else #more work needed as we need to 'pad' the sequence with fake gap characters)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1158 # {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1159 ##Do we still need this padding? I don't think so
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1160 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1161 #
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1162 # my $PaddingChars = ($ThisQueryStart-$Query_Start);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1163 # print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1164 # #To help testing, calculate this first:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1165 # my $PaddingString = " "x $PaddingChars;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1166 # $Alginments {$Germclass}{$Allele} = $CurrentAASequence;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1167 # }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1168 # next
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1169
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1170 =head3 Demonstration of Pattern match positions
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1171
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1172 my $Text = "12345TTT TTAAAAA";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1173 my $TestPat = "TTT\\s+TT";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1174 (my $Result)= $Text =~ m/$TestPat/;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1175 print "D: Two vars are: - = ",$-[0], " & + =", $+[0]," for test pattern '$TestPat'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1176
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1177 sub printCDR3 {
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1178
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1179 =head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1180
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1181 ???? IS THIS FUNCTION IN USE ?????
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1182
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1183 Handles the printing of the output when passed information about the CDR3 region.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1184
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1185
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1186 The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1187
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1188 print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1189
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1190 #=cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1191
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1192 #Despite the similarity in names, these are all local copies passed to us:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1193
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1194 my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1195
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1196 #For DNA:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1197 my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1198 my ($CDR_DNA_Length) = length ($CDR_DNA_Seq);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1199
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1200 #For AA:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1201 my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1202 my ($CDR_AA_Length) = length ($CDR_AA_Seq);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1203
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1204 my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1205 print "D: SUB: printCDR3: As returned: '$ReturnString'\n";
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1206 return ($ReturnString);
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1207
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1208 }
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1209
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1210 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1211
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1212
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1213
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1214 =head2 Change Log
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1215
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1216 =head3 Version 1.2
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1217
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1218 1) Fixed the 'Process recrod request' feature' [was failed increment in $Record]
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1219 2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts].
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1220 This function is replaced by the more general printOUTPUTData()
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1221 3) A tag for the CDR3 status is now output for every record / read.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1222 Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1223
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1224 =head4 Version 1.3
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1225
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1226 1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”,
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1227 which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1228 - found by Bas Horsman.
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1229
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1230 =head4 Version 1.3a
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1231
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1232 1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3)
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1233
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1234 =cut
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1235
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1236 =head4 Version 1.4
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1237
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1238 1) Now processes files using Mac/Unix/MS-DOS newline characters:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1239
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1240 $_ =~ s/\r\n/\n/g; #In case line ends are MS-DOS
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1241 $_ =~ s/\r/\n/g; #In case line ends are Mac
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1242 #The whole record - one per read - is now stored in $_
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1243 my @Lines =split (/\R/,$_); #Split on new lines
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1244
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1245 =head4 Version 1.4a
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1246
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1247 1) Fixed the length of the CDR3 AA string being reported correctly:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1248
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1249 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1250 to:
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1251 $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
3cce6e04b1e8 Uploaded
davidvanzessen
parents:
diff changeset
1252