| 
0
 | 
     1 #!/usr/bin/perl
 | 
| 
 | 
     2 =head1 IGBLAST_simple.pl
 | 
| 
 | 
     3 
 | 
| 
 | 
     4 This version (1.4) has been heavily adapted since the original program was first created back in October 2012.
 | 
| 
 | 
     5 Bas Horsman (EMC, Rotterdam, The Netherlands) has contributed with minor - though important - code changes.
 | 
| 
 | 
     6 
 | 
| 
 | 
     7 From V 1.2 onwards a 'Change Log' is included at the end of the program
 | 
| 
 | 
     8 
 | 
| 
 | 
     9 =head2 Usage
 | 
| 
 | 
    10 
 | 
| 
 | 
    11 Requires no modules in general use; the Data::Dumper (supplied as part of the Perl Core module set) might be useful for debugging/adjustment 
 | 
| 
 | 
    12 as it allows inspection of the data stores.
 | 
| 
 | 
    13 
 | 
| 
 | 
    14 The program takes a text file of the 
 | 
| 
 | 
    15 
 | 
| 
 | 
    16  ./IGBLAST_simple.pl igBLASTOutput.txt <-optional: index of record to process->
 | 
| 
 | 
    17  
 | 
| 
 | 
    18 Supply the text version of the igBLAST report in the format as in the example below.
 | 
| 
 | 
    19 The extra command line arugment is the record number (aka. BLAST report) to process.  
 | 
| 
 | 
    20 If 0 or absent all are processed, if supplied that record (base 1) is processed and the program dies afterwards.  
 | 
| 
 | 
    21 
 | 
| 
 | 
    22 =head2 Example Input
 | 
| 
 | 
    23 
 | 
| 
 | 
    24 A standard igBLAST record or set of them in a file; this being typical:
 | 
| 
 | 
    25 
 | 
| 
 | 
    26  BLASTN 2.2.27+
 | 
| 
 | 
    27 
 | 
| 
 | 
    28 
 | 
| 
 | 
    29 Reference: Stephen F. Altschul, Thomas L. Madden, Alejandro A.
 | 
| 
 | 
    30 Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J.
 | 
| 
 | 
    31 Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of
 | 
| 
 | 
    32 protein database search programs", Nucleic Acids Res. 25:3389-3402.
 | 
| 
 | 
    33 
 | 
| 
 | 
    34 
 | 
| 
 | 
    35 
 | 
| 
 | 
    36 Database: human_gl_V; human_gl_D; human_gl_J
 | 
| 
 | 
    37            674 sequences; 179,480 total letters
 | 
| 
 | 
    38 
 | 
| 
 | 
    39 
 | 
| 
 | 
    40 
 | 
| 
 | 
    41 Query= HL67IUI01D26LR length=433 xy=1559_1437 region=1
 | 
| 
 | 
    42 run=R_2012_04_10_11_57_56_
 | 
| 
 | 
    43 
 | 
| 
 | 
    44 Length=433
 | 
| 
 | 
    45                                                                       Score     E
 | 
| 
 | 
    46 Sequences producing significant alignments:                          (Bits)  Value
 | 
| 
 | 
    47 
 | 
| 
 | 
    48 lcl|IGHV3-30*04                                                        330    2e-92
 | 
| 
 | 
    49 lcl|IGHV3-30-3*01                                                      330    2e-92
 | 
| 
 | 
    50 lcl|IGHV3-30*01                                                        327    2e-91
 | 
| 
 | 
    51 lcl|IGHD3-16*01                                                       14.4       11
 | 
| 
 | 
    52 lcl|IGHD3-16*02                                                       14.4       11
 | 
| 
 | 
    53 lcl|IGHD1-14*01                                                       12.4       43
 | 
| 
 | 
    54 lcl|IGHJ4*02                                                          78.3    1e-18
 | 
| 
 | 
    55 lcl|IGHJ5*02                                                          70.3    4e-16
 | 
| 
 | 
    56 lcl|IGHJ4*01                                                          68.3    2e-15
 | 
| 
 | 
    57 
 | 
| 
 | 
    58 
 | 
| 
 | 
    59 Domain classification requested: imgt
 | 
| 
 | 
    60 
 | 
| 
 | 
    61 
 | 
| 
 | 
    62 V(D)J rearrangement summary for query sequence (Top V gene match, Top D gene match, Top J gene match, Chain type, V-J Frame, Strand):
 | 
| 
 | 
    63 IGHV3-30*04	IGHD3-16*01	IGHJ4*02	VH	In-frame	+
 | 
| 
 | 
    64 
 | 
| 
 | 
    65 V(D)J junction details (V end, V-D junction, D region, D-J junction, J start).  Note that possible overlapping nucleotides at VDJ junction (i.e, nucleotides that could be assigned to either joining gene segment) are indicated in parentheses (i.e., (TACT)) but are not included under V, D, or J gene itself
 | 
| 
 | 
    66 AGAGA	TATGAGCCCCATCATGACA	ACGTTTG	CCGGAA	ACTAC	
 | 
| 
 | 
    67 
 | 
| 
 | 
    68 Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
 | 
| 
 | 
    69 FWR1	27	38	12	11	1	0	91.7
 | 
| 
 | 
    70 CDR1	39	62	24	22	2	0	91.7
 | 
| 
 | 
    71 FWR2	63	113	51	50	1	0	98
 | 
| 
 | 
    72 CDR2	114	137	24	23	1	0	95.8
 | 
| 
 | 
    73 FWR3	138	251	114	109	5	0	95.6
 | 
| 
 | 
    74 CDR3 (V region only)	252	259	8	7	1	0	87.5
 | 
| 
 | 
    75 Total	N/A	N/A	233	222	11	0	95.3
 | 
| 
 | 
    76 
 | 
| 
 | 
    77 
 | 
| 
 | 
    78 Alignments
 | 
| 
 | 
    79 
 | 
| 
 | 
    80                                         <----FWR1--><----------CDR1--------><-----------------------FWR2------
 | 
| 
 | 
    81                                          W  A  A  S  G  F  T  F  N  T  Y  A  V  H  W  V  R  Q  A  P  G  K  G  
 | 
| 
 | 
    82                     Query_1        27   TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC  96
 | 
| 
 | 
    83 V  95.3% (222/233)  IGHV3-30*04    64   ..T......................G..G.......A.................................  133
 | 
| 
 | 
    84                                          C  A  A  S  G  F  T  F  S  S  Y  A  M  H  W  V  R  Q  A  P  G  K  G  
 | 
| 
 | 
    85 V  95.7% (221/231)  IGHV3-30-3*01  64   ..T......................G..G.......A.................................  133
 | 
| 
 | 
    86 V  94.8% (221/233)  IGHV3-30*01    64   ..T......................G..G.......A.................................  133
 | 
| 
 | 
    87 
 | 
| 
 | 
    88                                         ----------------><----------CDR2--------><----------------------------
 | 
| 
 | 
    89                                         L  E  W  V  A  V  I  S  Y  D  G  S  N  K  N  Y  A  D  S  V  K  G  R  F
 | 
| 
 | 
    90                     Query_1        97   TGGAGTGGGTGGCAGTTATATCATATGATGGAAGCAATAAAAACTACGCAGACTCCGTGAAGGGCCGATT  166
 | 
| 
 | 
    91 V  95.3% (222/233)  IGHV3-30*04    134  ..................................T......T............................  203
 | 
| 
 | 
    92                                         L  E  W  V  A  V  I  S  Y  D  G  S  N  K  Y  Y  A  D  S  V  K  G  R  F
 | 
| 
 | 
    93 V  95.7% (221/231)  IGHV3-30-3*01  134  .........................................T............................  203
 | 
| 
 | 
    94 V  94.8% (221/233)  IGHV3-30*01    134  .A................................T......T............................  203
 | 
| 
 | 
    95 
 | 
| 
 | 
    96                                         ---------------------------FWR3---------------------------------------
 | 
| 
 | 
    97                                           T  I  S  R  D  N  S  K  N  T  L  Y  L  Q  M  N  S  L  R  V  E  D  T 
 | 
| 
 | 
    98                     Query_1        167  CACCATCTCCAGAGACAATTCCAAGAACACGTTATATCTGCAAATGAACAGCCTGAGAGTTGAGGACACG  236
 | 
| 
 | 
    99 V  95.3% (222/233)  IGHV3-30*04    204  ...............................C.G.........................C..........  273
 | 
| 
 | 
   100                                           T  I  S  R  D  N  S  K  N  T  L  Y  L  Q  M  N  S  L  R  A  E  D  T 
 | 
| 
 | 
   101 V  95.7% (221/231)  IGHV3-30-3*01  204  ...............................C.G.........................C..........  273
 | 
| 
 | 
   102 V  94.8% (221/233)  IGHV3-30*01    204  ...............................C.G.........................C..........  273
 | 
| 
 | 
   103 
 | 
| 
 | 
   104                                         -------------->                                                       
 | 
| 
 | 
   105                                          A  V  Y  Y  C  T  R  D  M  S  P  I  M  T  T  F  A  G  N  Y  W  G  Q  
 | 
| 
 | 
   106                     Query_1        237  GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG  306
 | 
| 
 | 
   107 V  95.3% (222/233)  IGHV3-30*04    274  .....G.........G.......-----------------------------------------------  296
 | 
| 
 | 
   108                                          A  V  Y  Y  C  A  R                                                  
 | 
| 
 | 
   109 V  95.7% (221/231)  IGHV3-30-3*01  274  .....G.........G.....-------------------------------------------------  294
 | 
| 
 | 
   110 V  94.8% (221/233)  IGHV3-30*01    274  .....G.........G.......-----------------------------------------------  296
 | 
| 
 | 
   111 D  100.0% (7/7)     IGHD3-16*01    12   ------------------------------------------.......---------------------  18
 | 
| 
 | 
   112 D  100.0% (7/7)     IGHD3-16*02    12   ------------------------------------------.......---------------------  18
 | 
| 
 | 
   113 D  100.0% (6/6)     IGHD1-14*01    8    -------------------------------------------------......---------------  13
 | 
| 
 | 
   114 J  100.0% (39/39)   IGHJ4*02       10   -------------------------------------------------------...............  24
 | 
| 
 | 
   115 J  100.0% (35/35)   IGHJ5*02       17   -----------------------------------------------------------...........  27
 | 
| 
 | 
   116 J  97.4% (38/39)    IGHJ4*01       10   -------------------------------------------------------.............A.  24
 | 
| 
 | 
   117 
 | 
| 
 | 
   118                                                                 
 | 
| 
 | 
   119                                         G  T  L  V  T  V  S  S  
 | 
| 
 | 
   120                     Query_1        307  GAACCCTGGTCACCGTCTCCTCAG  330
 | 
| 
 | 
   121 J  100.0% (39/39)   IGHJ4*02       25   ........................  48
 | 
| 
 | 
   122 J  100.0% (35/35)   IGHJ5*02       28   ........................  51
 | 
| 
 | 
   123 J  97.4% (38/39)    IGHJ4*01       25   ........................  48
 | 
| 
 | 
   124 
 | 
| 
 | 
   125 
 | 
| 
 | 
   126 Lambda      K        H
 | 
| 
 | 
   127     1.10    0.333    0.549 
 | 
| 
 | 
   128 
 | 
| 
 | 
   129 Gapped
 | 
| 
 | 
   130 Lambda      K        H
 | 
| 
 | 
   131     1.08    0.280    0.540 
 | 
| 
 | 
   132 
 | 
| 
 | 
   133 Effective search space used: 64847385
 | 
| 
 | 
   134 
 | 
| 
 | 
   135 
 | 
| 
 | 
   136 Query= HL67IUI01EQMLY length=609 xy=1826_1636 region=1
 | 
| 
 | 
   137 run=R_2012_04_10_11_57_56_
 | 
| 
 | 
   138 
 | 
| 
 | 
   139 
 | 
| 
 | 
   140 ...etc...
 | 
| 
 | 
   141 
 | 
| 
 | 
   142 =head2 Example Output
 | 
| 
 | 
   143 
 | 
| 
 | 
   144 
 | 
| 
 | 
   145 Example output from the data above sent:
 | 
| 
 | 
   146  $ ./IGBLAST_simple.pl igBLASTOutput.txt 1
 | 
| 
 | 
   147  D: Request to process just record '1' received
 | 
| 
 | 
   148  D: printOUTPUTData: Running
 | 
| 
 | 
   149  D: printOUTPUTData: HEADER Printout requested 'ID VDJ Frame Top V Gene Top D Gene Top J Gene CDR1 Seq CDR1 Length CDR2 Seq CDR2 Length CDR3 Seq CDR3 Length CDR3 Found How'
 | 
| 
 | 
   150  OUTPUT: # ID    VDJ Frame       Top V Gene      Top D Gene      Top J Gene      CDR1 Seq        CDR1 Length     CDR2 Seq        CDR2 Length     CDR3 Seq        CDR3 Length     CDR3 Found How
 | 
| 
 | 
   151  D: ID is: 'HL67IUI01D26LR'
 | 
| 
 | 
   152  D: Minimum base marked-up (27) - aka. $AlignmentStart; maximum: (259)
 | 
| 
 | 
   153  D: Starting Search for CDR3
 | 
| 
 | 
   154  D: markUpCDR3: Passed Parameters '251, 27, TGGGG....GG., WG.G' (& AA & DNA sequence)
 | 
| 
 | 
   155  D: markUpCDR3: returning: 223, 282, MOTIF_FOUND_IN_BOTH, (3) [NB: offset of :'+ 27'
 | 
| 
 | 
   156  D: CDR3 was found by pattern matching: 'MOTIF_FOUND_IN_BOTH' (250, 309)
 | 
| 
 | 
   157  D: Top Hits (raw)= 'IGHV3-30*04 IGHD3-16*01     IGHJ4*02        VH      In-frame        +' 
 | 
| 
 | 
   158  D: Top Hits (parsed)= 'IGHV3-30*04, IGHD3-16*01, IGHJ4*02, VH, In-frame, +'
 | 
| 
 | 
   159  D: printOUTPUTData: Running
 | 
| 
 | 
   160  OUTPUT: HL67IUI01D26LR  In-frame        IGHV3-30*04     IGHD3-16*01     IGHJ4*02        GFTFNTYA        23      ISYDGSNK        23      CTRDMSPIMTTFAGNYWGQG    59      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   161 
 | 
| 
 | 
   162 =head4 Usage notes:
 | 
| 
 | 
   163 
 | 
| 
 | 
   164 Designed to be easy to "grep -v D:" or "grep OUTPUT:" for to select the parts you need:
 | 
| 
 | 
   165 
 | 
| 
 | 
   166  ./IGBLAST_simple.pl igBLASTOutput.txt 1 | grep OUTPUT:
 | 
| 
 | 
   167  
 | 
| 
 | 
   168  OUTPUT: # ID    VDJ Frame       Top V Gene      Top D Gene      Top J Gene      CDR1 Seq        CDR1 Length     CDR2 Seq        CDR2 Length     CDR3 Seq        CDR3 Length     CDR3 Found How
 | 
| 
 | 
   169  OUTPUT: HL67IUI01D26LR  In-frame        IGHV3-30*04     IGHD3-16*01     IGHJ4*02        GFTFNTYA        23      ISYDGSNK        23      CTRDMSPIMTTFAGNYWGQG    59      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   170  OUTPUT: HL67IUI01EQMLY  In-frame        IGHV4-39*01     IGHD2-8*01      IGHJ3*02        GGSISSSSYY      29      IYHSGST 20      CARDATYYSNGFDIWGQG      53      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   171  OUTPUT: HL67IUI01CDCLP  Out-of-frame    IGHV3-23*01     IGHD3-3*01      IGHJ4*02        FSNYAM  16      SGSGDRTY        23      AKAD*FLEWLFRIGDGERLLGPGN        72      MOTIF_FOUND_IN_DNA
 | 
| 
 | 
   172  OUTPUT: HL67IUI01AHRNH  N/A     IGHV3-33*01     N/A     N/A     WIHLQ*LW        23      YGMMEVI 23                      NOT_FOUND
 | 
| 
 | 
   173  OUTPUT: HL67IUI01DZZ1V  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        GFTFDKYA        23      ILASG   20      LYCASEGDIVASELLSTGARV   62      MOTIF_FOUND_IN_DNA
 | 
| 
 | 
   174  OUTPUT: HL67IUI01DTR2Y  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        LDSPLTNM        23      LYLPVV  20      TVRVRGT*WLRSF*VLGPG     59      MOTIF_FOUND_IN_DNA
 | 
| 
 | 
   175  OUTPUT: HL67IUI01EQL3S  In-frame        IGHV7-4-1*02    IGHD6-19*01     IGHJ6*02        GYTFRTFT        23      INTNTGTP        23      CAKESGTGSAHFFYGMDVWGQG  65      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   176  OUTPUT: HL67IUI01AFG46  In-frame        IGLV2-34*01     N/A     IGHJ4*02                                                        NOT_FOUND
 | 
| 
 | 
   177  OUTPUT: HL67IUI01EFFKO  In-frame        IGHV3-11*01     IGHD6-6*01      IGHJ4*02        GFTFSDYY        23      ISYSGGTI        23      CARASGAARHRPLDYWGQG     56      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   178  OUTPUT: HL67IUI01B18SG  In-frame        IGHV3-33*01     IGHD5-12*01     IGHJ4*02        VRQA    11      KYYANSVK        23      RLGGFDYWGQGTLVTVSS      53      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   179  OUTPUT: HL67IUI01D6LER  In-frame        IGHV1-24*01     IGHD3-22*01     IGHJ4*02        GYSLNELS        23      PDPEDDE 23      TVQPSRITMMAVVITRIHWGASGARE      76      MOTIF_FOUND_IN_DNA
 | 
| 
 | 
   180  OUTPUT: HL67IUI01CYCLF  N/A     IGHV4-39*01     N/A     N/A     GGSISSSSYY      29      IYYSGST 20                      NOT_FOUND
 | 
| 
 | 
   181  OUTPUT: HL67IUI01B4LEE  In-frame        IGHV7-4-1*02    IGHD6-19*01     IGHJ6*02        GYTFRTFT        23      INTNTGTP        23      CAKESGTGSAHFFYGMDVWGQG  65      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   182  OUTPUT: HL67IUI01A4KW4  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        LDSPLTNM        23      LYLPVV  20      TVRVRGT*WLRSF*IWGQG     58      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   183  OUTPUT: HL67IUI01E05BV  In-frame        IGHV1-24*01     IGHD3-22*01     IGHJ2*01        GYSLNELS        23      PDPEDDE 23                      NOT_FOUND
 | 
| 
 | 
   184  OUTPUT: HL67IUI01CVVKY  In-frame        IGHV1-3*01      IGHD2-15*01     IGHJ1*01                                                        NOT_FOUND
 | 
| 
 | 
   185  OUTPUT: HL67IUI01CN5P2  In-frame        IGHV7-4-1*02    IGHD2-21*02     IGHJ5*02        GYSITDYG        23      LNTRTGNP        23      CAVKDARDFVSWGQG 44      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   186  OUTPUT: HL67IUI01DUUJ5  In-frame        IGHV3-21*01     IGHD1-7*01      IGHJ4*02        GYTFSTYS        23      ISSSSAYR        23      CARDIRLELRDWGQG 44      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   187  OUTPUT: HL67IUI01E1AIR  Out-of-frame    IGHV4-39*01     N/A     IGHJ3*01        WGLHRRW**L      29      FVS*RAPR        23                      NOT_FOUND
 | 
| 
 | 
   188  OUTPUT: HL67IUI01CCZ8D  Out-of-frame    IGHV3-23*01     IGHD5-12*01     IGHJ4*02        GFTFDKYA        23      ILASGR  20      YCASEGDIVASELLSTGARE    58      MOTIF_FOUND_IN_DNA
 | 
| 
 | 
   189  OUTPUT: HL67IUI01BT9IR  N/A     IGHV3-21*02     N/A     N/A                                                     NOT_FOUND
 | 
| 
 | 
   190  OUTPUT: HL67IUI01COTO0  Out-of-frame    IGHV4-39*01     N/A     IGHJ3*01        GGFIGGGDNF      29      LYHDGRPA        23                      NOT_FOUND
 | 
| 
 | 
   191  OUTPUT: HL67IUI01D994O  In-frame        IGHV7-4-1*02    IGHD2-21*02     IGHJ5*02        GYSITDYG        23      LNTRTGNP        23      CAVKDARDFVSWGQG 44      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   192  OUTPUT: HL67IUI01A08CJ  In-frame        IGHV4-39*01     IGHD6-13*01     IGHJ5*02        GGSISSSSYY      29      IYYTWEH 21      CERARRGSSWGQLVRPLGPG    62      MOTIF_FOUN
 | 
| 
 | 
   193  
 | 
| 
 | 
   194  
 | 
| 
 | 
   195  
 | 
| 
 | 
   196  OUTPUT: # ID    VDJ Frame       Top V Gene      Top D Gene      Top J Gene      CDR1 Seq        CDR1 Length     CDR2 Seq        CDR2 Length     CDR3 Seq        CDR3 Length     CDR3 Found How
 | 
| 
 | 
   197  OUTPUT: HL67IUI01D26LR  In-frame        IGHV3-30*04     IGHD3-16*01     IGHJ4*02        GFTFNTYA        23      ISYDGSNK        23      CTRDMSPIMTTFAGNYWGQG    59      MOTIF_FOUND_IN_BOTH
 | 
| 
 | 
   198  ...etc...
 | 
| 
 | 
   199 
 | 
| 
 | 
   200 =head4 Also, combined grep & sed:  
 | 
| 
 | 
   201 
 | 
| 
 | 
   202  $ ./IGBLAST_simple.pl igBLASTOutput.txt  | grep OUTPUT: | sed 's/OUTPUT:\t//'
 | 
| 
 | 
   203 
 | 
| 
 | 
   204 =cut
 | 
| 
 | 
   205 
 | 
| 
 | 
   206 =head3 CDR3 Patterns:
 | 
| 
 | 
   207 
 | 
| 
 | 
   208 We use these two variables to try to identify the end of the CDR3 region if igBLAST doesn't report it directly:
 | 
| 
 | 
   209 
 | 
| 
 | 
   210  my $DNACDR3_Pat = "TGGGG....GG.";
 | 
| 
 | 
   211  my $AASequenceMotifPattern = "WG.G";
 | 
| 
 | 
   212 
 | 
| 
 | 
   213 They are treated as regex's when tested (so use "." to mean any DNA base, rather than 'N' or 'X').
 | 
| 
 | 
   214 
 | 
| 
 | 
   215 [NB: These are original patterns used for testing, check the code for the current ones.]
 | 
| 
 | 
   216 
 | 
| 
 | 
   217 =cut
 | 
| 
 | 
   218 
 | 
| 
 | 
   219 my $DNACDR3_Pat = "TGGGG....GG.";
 | 
| 
 | 
   220 my $AACDR3_Pat =  "WG.G";
 | 
| 
 | 
   221 
 | 
| 
 | 
   222 use strict; 
 | 
| 
 | 
   223 use Data::Dumper;
 | 
| 
 | 
   224 # Set this as to number of the result (aka "record") you want to process or 0 for all:
 | 
| 
 | 
   225 my $ProcessRecord =0;
 | 
| 
 | 
   226 if (defined $ARGV[1])	{	$ProcessRecord = pop @ARGV;	}	#Also accept from the command line:
 | 
| 
 | 
   227 if ($ProcessRecord != 0)	{	print "D: Request to process just record '$ProcessRecord' received\n";	}
 | 
| 
 | 
   228 
 | 
| 
 | 
   229 #Adjust the record separator:
 | 
| 
 | 
   230 $/="Query= ";
 | 
| 
 | 
   231 my $Record=0;	# A simple counter, that we might not use.
 | 
| 
 | 
   232 #Force-loaded header / version information:
 | 
| 
 | 
   233 my $Header = <>; 
 | 
| 
 | 
   234 #At the moment we don't use this - so dump it immediately:
 | 
| 
 | 
   235 $Header = undef;
 | 
| 
 | 
   236 #print "D: Force-loaded header / version information: '$Header'\n";
 | 
| 
 | 
   237 
 | 
| 
 | 
   238 #Print the Header for the output line (we need this once, at the start) 
 | 
| 
 | 
   239 print &printOUTPUTData ({"HEADER" => 1})."\n"; 
 | 
| 
 | 
   240 
 | 
| 
 | 
   241 while (<>)
 | 
| 
 | 
   242 	{
 | 
| 
 | 
   243 =head4 First check - should we be processing this record at all?
 | 
| 
 | 
   244 
 | 
| 
 | 
   245 =cut 
 | 
| 
 | 
   246 	$Record++;	#Increment the record counter:
 | 
| 
 | 
   247 	#Do we process this record - or all records?	
 | 
| 
 | 
   248 	if ($ProcessRecord != $Record && $ProcessRecord != 0)	
 | 
| 
 | 
   249 		{	next; }	#We need to increment the record counter before we increment
 | 
| 
 | 
   250 
 | 
| 
 | 
   251 =head4 Setup the output line storage and print the header:
 | 
| 
 | 
   252 
 | 
| 
 | 
   253 We enter this initially and work to change it:
 | 
| 
 | 
   254 
 | 
| 
 | 
   255  $DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
 | 
| 
 | 
   256 
 | 
| 
 | 
   257 =cut
 | 
| 
 | 
   258 
 | 
| 
 | 
   259 	my %OUTPUT_Data;	#To collect data for the output line in
 | 
| 
 | 
   260 	#Assume the first and work to find better:
 | 
| 
 | 
   261 	$OUTPUT_Data{"CDR3 Found How"} = "NOT_FOUND";
 | 
| 
 | 
   262 	#The whole record - one per read - is now stored in $_
 | 
| 
 | 
   263 	my @Lines =split (/[\r\n]+/,$_);	# split on windows/linux/mac new lines
 | 
| 
 | 
   264 
 | 
| 
 | 
   265  #If you are interested enable either of the next lines depending on how curious you are as to how the splitting went:
 | 
| 
 | 
   266 	#print "D: Record #$Record\n";	print $_;	print "\n---------\n";
 | 
| 
 | 
   267 	print "D:  ''$Lines[0]'\nD:  ...etc...'\nD:  ############\n";
 | 
| 
 | 
   268 
 | 
| 
 | 
   269 =head3 Get the ID
 | 
| 
 | 
   270 
 | 
| 
 | 
   271 Quite easy: the first field on the first line:
 | 
| 
 | 
   272 
 | 
| 
 | 
   273  Query= HL67IUI01DTR2Y length=577 xy=1452_0984 region=1
 | 
| 
 | 
   274 
 | 
| 
 | 
   275 =cut 
 | 
| 
 | 
   276  
 | 
| 
 | 
   277 	(my $ID) = $Lines[0]=~ m/^(\S+)/;
 | 
| 
 | 
   278 	unless (defined $ID && $ID ne "")
 | 
| 
 | 
   279 		{			# So a near total failure...?
 | 
| 
 | 
   280 		$OUTPUT_Data{"ID"} = "Unknown";
 | 
| 
 | 
   281 		print &printOUTPUTData (\%OUTPUT_Data)."\n";
 | 
| 
 | 
   282 		next;	#No ID is terminal for this record 			
 | 
| 
 | 
   283 		}	 
 | 
| 
 | 
   284 	else
 | 
| 
 | 
   285 		{
 | 
| 
 | 
   286 		print "D: ID is: '$ID'\n";
 | 
| 
 | 
   287 		$OUTPUT_Data{"ID"} = $ID;
 | 
| 
 | 
   288 		}
 | 
| 
 | 
   289 =head3 Declare the variables we will need here in the next few sections to store data
 | 
| 
 | 
   290 
 | 
| 
 | 
   291 =cut 
 | 
| 
 | 
   292 
 | 
| 
 | 
   293 	my $CurrentRegion;
 | 
| 
 | 
   294 	my $RegionMarkup;
 | 
| 
 | 
   295 
 | 
| 
 | 
   296 	#So we can sync the coordinated of the alignment up to the domains found:  
 | 
| 
 | 
   297 	my $Query_Start = -1; my $Query_End = -1;
 | 
| 
 | 
   298 	
 | 
| 
 | 
   299 	#Where on the Query Sequence (i.e. the 454 read) does the alignment start & stop? 
 | 
| 
 | 
   300 	my $ThisQueryStart =-1; my $ThisQueryEnd =-1;	#Think $ThisQueryEnd isn't used at the moment.
 | 
| 
 | 
   301 	my $DNAQuerySequence ="";		#The actual DNA Query sequence...
 | 
| 
 | 
   302 	my $AAQuerySequence = "";
 | 
| 
 | 
   303 	
 | 
| 
 | 
   304 	#As this changes with the alleles identified:
 | 
| 
 | 
   305 	my $CurrentAASequence;
 | 
| 
 | 
   306 	#The main storage variables 	
 | 
| 
 | 
   307 	
 | 
| 
 | 
   308 	my %Alginments;  my %Alleles;
 | 
| 
 | 
   309 	my %DomainBoundaries; 
 | 
| 
 | 
   310 
 | 
| 
 | 
   311 =head2 Stanza 1: Get the general structure of the sequence identified
 | 
| 
 | 
   312 
 | 
| 
 | 
   313 =head3 Method 1: Use the table supplied
 | 
| 
 | 
   314 
 | 
| 
 | 
   315 Technically this valid for the top hit...realistically this is the only information we have reported to us
 | 
| 
 | 
   316 so we use this or nothing.  This is fine for the top hit which is likely what we are interested in....but for the 2nd or 3rd?  Who knows!
 | 
| 
 | 
   317 
 | 
| 
 | 
   318 Targets this block:
 | 
| 
 | 
   319 
 | 
| 
 | 
   320  Alignment summary between query and top germline V gene hit (from, to, length, matches, mismatches, gaps, percent identity)
 | 
| 
 | 
   321  FWR1	167	240	75	72	2	1	96
 | 
| 
 | 
   322  CDR1	241	264	24	20	4	0	83.3
 | 
| 
 | 
   323  FWR2	265	315	51	48	3	0	94.1
 | 
| 
 | 
   324  CDR2	316	336	24	15	6	3	62.5
 | 
| 
 | 
   325  FWR3	337	450	114	106	8	0	93
 | 
| 
 | 
   326  CDR3 (V region only)	451	454	4	4	0	0	100
 | 
| 
 | 
   327  Total	N/A	N/A	292	265	23	4	90.8
 | 
| 
 | 
   328 
 | 
| 
 | 
   329 Then we split out the lines inside it in a second scanning step - less optimal but easier to read:
 | 
| 
 | 
   330 
 | 
| 
 | 
   331  FWR1	167	240	75	72	2	1	96
 | 
| 
 | 
   332  CDR1	241	264	24	20	4	0	83.3
 | 
| 
 | 
   333  FWR2	265	315	51	48	3	0	94.1
 | 
| 
 | 
   334  CDR2	316	336	24	15	6	3	62.5
 | 
| 
 | 
   335  FWR3	337	450	114	106	8	0	93
 | 
| 
 | 
   336  CDR3 (V region only)	451	454	4	4	0	0	100
 | 
| 
 | 
   337 
 | 
| 
 | 
   338 into:
 | 
| 
 | 
   339 
 | 
| 
 | 
   340  (Section, from, to, length, matches, mismatches, gaps, percent identity)
 | 
| 
 | 
   341 
 | 
| 
 | 
   342 =head3 Method 2: Use the table supplied
 | 
| 
 | 
   343 
 | 
| 
 | 
   344 The other way to do this is to split the graphical markup out of the alignment.
 | 
| 
 | 
   345 This works for _any_ reported alignment, not just the top hits:
 | 
| 
 | 
   346 
 | 
| 
 | 
   347 In the main alignment table processing section collect the information, collect the information:
 | 
| 
 | 
   348 
 | 
| 
 | 
   349  #Is region mark-up:
 | 
| 
 | 
   350 	if ($#InfoColumns == -1 && $#AlignmentColumns ==0) 	
 | 
| 
 | 
   351 		{					
 | 
| 
 | 
   352 #		print ": Region Markup detected\n";
 | 
| 
 | 
   353 		$RegionMarkup = $RegionMarkup.$AlignmentPanel;	#Collect the information, then re-synthesise it at the end of record		
 | 
| 
 | 
   354 		next;
 | 
| 
 | 
   355 		}
 | 
| 
 | 
   356  
 | 
| 
 | 
   357 Then afterwards when all the region was collected, process it like this:
 | 
| 
 | 
   358 #Pad the CDER3 region:
 | 
| 
 | 
   359 
 | 
| 
 | 
   360  #Remove the trailing spaces:
 | 
| 
 | 
   361 	$RegionMarkup =~ s/ *$//g;		 
 | 
| 
 | 
   362  #Calculate the length of the CDR3 region so we can add it in: 
 | 
| 
 | 
   363 	my $CDR3PaddingNeeded = ($Query_End-$Query_Start)-length ($RegionMarkup) -length ("<-CDR3>")+1;
 | 
| 
 | 
   364  #Build up the CDR3 region, the 'x' operator is very helpful here (implict foreach loop):
 | 
| 
 | 
   365 	$RegionMarkup = $RegionMarkup."<-CDR3"."-" x $CDR3PaddingNeeded. ">";	
 | 
| 
 | 
   366  #print "D: Need to pad with:'$CDR3PaddingNeeded' characters\n"; 
 | 
| 
 | 
   367 
 | 
| 
 | 
   368  #Now really process it:
 | 
| 
 | 
   369  	my $C_Pos = 0;
 | 
| 
 | 
   370 	my @Domains = split (/(<*-*...[123]-*>*)/,$RegionMarkup);	#
 | 
| 
 | 
   371 	foreach my $C_Domain (@Domains)
 | 
| 
 | 
   372 		{
 | 
| 
 | 
   373 			if (length ($C_Domain) <=0)	{next;}
 | 
| 
 | 
   374 		my $DomainStart= $C_Pos;
 | 
| 
 | 
   375 		my $DomainEnd = $DomainStart + length ($C_Domain)-1; 
 | 
| 
 | 
   376 		my ($DomainType) = $C_Domain =~ m/(...[123])/;
 | 
| 
 | 
   377 #		print "D: $DomainType \t($DomainStart-$DomainEnd=",$DomainEnd-$DomainStart,"):\t$C_Domain\n";
 | 
| 
 | 
   378 		$DomainBoundaries{$DomainType}{"Start"} 	= $DomainStart;
 | 
| 
 | 
   379 		$DomainBoundaries{$DomainType}{"End"} 		= $DomainEnd;
 | 
| 
 | 
   380 		$C_Pos = $DomainEnd+1;
 | 
| 
 | 
   381 		}
 | 
| 
 | 
   382 
 | 
| 
 | 
   383 The two pieces of code are interchangable; the table version as used below, is neater, easier to understand and works nicely.
 | 
| 
 | 
   384 Why stress? 
 | 
| 
 | 
   385 
 | 
| 
 | 
   386 
 | 
| 
 | 
   387 =head3 The end of the FWR3 is the start of CDR3?
 | 
| 
 | 
   388 
 | 
| 
 | 
   389 This is an assumption made.  Hence the two variables:
 | 
| 
 | 
   390 
 | 
| 
 | 
   391 	my $MaxDomainReported =0 ;	# In nts / bps 
 | 
| 
 | 
   392 	my $FWR3_Found_Flag = 0;	# Did we find the end of the FWR3 - which is the start of the CDR3.  Set to 'false' initially.
 | 
| 
 | 
   393 
 | 
| 
 | 
   394 	$MaxDomainBaseFound
 | 
| 
 | 
   395 
 | 
| 
 | 
   396 =cut
 | 
| 
 | 
   397 	my $MaxDomainBaseFound 	=0 	;	# In nts / bps
 | 
| 
 | 
   398 	my $AlignmentStart			;	# In nts /bp	#Alternative name would be: '$MinDomainBaseFound'; set to null until primed 
 | 
| 
 | 
   399 #	my $FWR3_Found_Flag = 0;	# Did we find the end of the FWR3 - which is the start of the CDR3.  Set to 'false' initially.
 | 
| 
 | 
   400 	
 | 
| 
 | 
   401 	(my @StructureSummaryTable) = returnLinesBetween (\@Lines, "Alignment summary", "Total" );
 | 
| 
 | 
   402 #Enable the next line if you want the raw data we are going to parse in this section: 
 | 
| 
 | 
   403 	#print Dumper @StructureSummaryTable;
 | 
| 
 | 
   404 	foreach my $C_Section (@StructureSummaryTable)
 | 
| 
 | 
   405 		{
 | 
| 
 | 
   406 		my ($DomainType, $DomainStart, $DomainEnd, $SectLength, $Matches, $Mismatches, $Gaps, $PID) = split (/\t+/,$C_Section);
 | 
| 
 | 
   407 		#print "D: Domain type: '$DomainType'\n";
 | 
| 
 | 
   408 		#$DomainType =~ s/ .*$//g;
 | 
| 
 | 
   409 		$DomainBoundaries{$DomainType}{"Start"} = $DomainStart;
 | 
| 
 | 
   410 		$DomainBoundaries{$DomainType}{"End"} 	= $DomainEnd;
 | 
| 
 | 
   411 
 | 
| 
 | 
   412 #So we can do a reality check on the length / start of the CDR3 if we have to go looking:		
 | 
| 
 | 
   413 		if ($MaxDomainBaseFound <= $DomainEnd)
 | 
| 
 | 
   414 			{ $MaxDomainBaseFound = $DomainEnd; }	#Store the maximum base found
 | 
| 
 | 
   415 		if ($AlignmentStart eq undef or $AlignmentStart >= $DomainStart)
 | 
| 
 | 
   416 			{	$AlignmentStart = $DomainStart;	}
 | 
| 
 | 
   417 		}
 | 
| 
 | 
   418 #print Dumper %DomainBoundaries;
 | 
| 
 | 
   419 #die "HIT BLOCK\n";
 | 
| 
 | 
   420 
 | 
| 
 | 
   421 =head3 Did we find the CDR3 region specifically?
 | 
| 
 | 
   422 
 | 
| 
 | 
   423 If we did fine; otherwise try to find it using the FWR3 region if we found that; otherwise give up.
 | 
| 
 | 
   424 
 | 
| 
 | 
   425 =cut
 | 
| 
 | 
   426 	print "D: Minimum base marked-up ($AlignmentStart) - aka. \$AlignmentStart; maximum: ($MaxDomainBaseFound)\n";
 | 
| 
 | 
   427 
 | 
| 
 | 
   428 #my @WantedSections = qw (V D J);
 | 
| 
 | 
   429 
 | 
| 
 | 
   430 =head2 Second Stanza: Parse the main Alignment Table
 | 
| 
 | 
   431 
 | 
| 
 | 
   432 =head3 Get the table, then determine the character at which to split the 'Info' & 'Alignment' panels.
 | 
| 
 | 
   433 
 | 
| 
 | 
   434 As this is a little involved and comparamentalises nicely we sub-contract this to two functions""
 | 
| 
 | 
   435 
 | 
| 
 | 
   436 	(my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
 | 
| 
 | 
   437 	my $PanelSplitPoint = findSplitPoint (\@Table);	#Why can't they just use a fixed field width or a tab as a delimiter?
 | 
| 
 | 
   438 
 | 
| 
 | 
   439 =cut 
 | 
| 
 | 
   440 	(my @Table) = returnLinesBetween (\@Lines, "Alignment", "Lambda" );
 | 
| 
 | 
   441 	my $PanelSplitPoint = findSplitPoint (\@Table);	#Why can't they just use a fixed field width or a tab as a delimiter? 
 | 
| 
 | 
   442 #If you are interested, enable this line:
 | 
| 
 | 
   443 #	print "D: The info panel was detected at: '$splitPoint'\n";
 | 
| 
 | 
   444 
 | 
| 
 | 
   445 =head3 
 | 
| 
 | 
   446  
 | 
| 
 | 
   447 =cut  
 | 
| 
 | 
   448 
 | 
| 
 | 
   449 
 | 
| 
 | 
   450 foreach my $C_Line (0..$#Table)
 | 
| 
 | 
   451 	{
 | 
| 
 | 
   452 
 | 
| 
 | 
   453 =head3 Call the line type we find: There are 4:
 | 
| 
 | 
   454 
 | 
| 
 | 
   455 These are distinguished by the number of fields (one or mores spacer is a field separator) in the Info & Alignment Panels (see values in brackets)
 | 
| 
 | 
   456 
 | 
| 
 | 
   457 									     | <- This split is ~40 chars. from the start of the line
 | 
| 
 | 
   458 	* InfoPanel *						 |    * Alignment Panel * 
 | 
| 
 | 
   459 																														: is a "Blank" line  		(-1,-1)
 | 
| 
 | 
   460                                          <----FWR1--><----------CDR1--------><-----------------------FWR2------			: is "Region Markup"		(-1,0)
 | 
| 
 | 
   461                                           W  A  A  S  G  F  T  F  N  T  Y  A  V  H  W  V  R  Q  A  P  G  K  G  			: is "AA Sequence"			(-1, >=0)
 | 
| 
 | 
   462                      Query_1        27   TGGGCAGCCTCTGGATTCACCTTCAATACCTATGCTGTGCACTGGGTCCGCCAGGCTCCAGGCAAGGGGC  96		: is "DNA Sequence"			(2,1)
 | 
| 
 | 
   463  V  95.3% (222/233)  IGHV3-30*04    64   ..T......................G..G.......A.................................  133	: is "" "
 | 
| 
 | 
   464                                          
 | 
| 
 | 
   465 So we split 40 chars in and then the two parts on spaces. 
 | 
| 
 | 
   466 
 | 
| 
 | 
   467 
 | 
| 
 | 
   468 =cut 
 | 
| 
 | 
   469 	
 | 
| 
 | 
   470 #	print "D: (sub) Line in parsed table: '$C_Line': \n";
 | 
| 
 | 
   471 	
 | 
| 
 | 
   472 	my ($InfoPanel, $AlignmentPanel) 	=	$Table[$C_Line] =~ /^(.{$PanelSplitPoint})(.*)$/;
 | 
| 
 | 
   473 	
 | 
| 
 | 
   474 	my @InfoColumns = split (/\s+/,$InfoPanel);
 | 
| 
 | 
   475 	my @AlignmentColumns = split (/\s+/,$AlignmentPanel);
 | 
| 
 | 
   476 
 | 
| 
 | 
   477 #If you want to see how the line is being split enable either of these next two lines; the 2nd is more detailed than the first
 | 
| 
 | 
   478 #	print "D: Line: $C_Line/t Number of Columns (Info, Alignment): \t$#InfoColumns \t $#AlignmentColumns\n";
 | 
| 
 | 
   479 #	print "D: For '$C_Line' \t line in the table there are parts: '$InfoPanel' [$#InfoColumns], '$AlignmentPanel [$#AlignmentColumns]'\n";
 | 
| 
 | 
   480 
 | 
| 
 | 
   481 #Populate this so we can step through it  
 | 
| 
 | 
   482 
 | 
| 
 | 
   483 =head4 Is a blank line:
 | 
| 
 | 
   484 =cut 
 | 
| 
 | 
   485 		if ($#InfoColumns == -1 && $#AlignmentColumns == -1) 	
 | 
| 
 | 
   486 		{		
 | 
| 
 | 
   487 #		print ": Blank\n";		
 | 
| 
 | 
   488 		next;		
 | 
| 
 | 
   489 		}	#For now I think we just skip - is not needed (though might be implict mark-up)
 | 
| 
 | 
   490 		
 | 
| 
 | 
   491 =head4 Is region mark-up:
 | 
| 
 | 
   492 =cut 
 | 
| 
 | 
   493 	if ($#InfoColumns == -1 && $#AlignmentColumns ==0) 	
 | 
| 
 | 
   494 		{					
 | 
| 
 | 
   495 #		print ": Region Markup detected\n";
 | 
| 
 | 
   496 		$RegionMarkup = $RegionMarkup.$AlignmentPanel;	#Collect the information, then re-synthesise it at the end of record		
 | 
| 
 | 
   497 		next;
 | 
| 
 | 
   498 		}
 | 
| 
 | 
   499 =head4 Is query DNA Sequence:
 | 
| 
 | 
   500 =cut 
 | 
| 
 | 
   501 	if ($#InfoColumns == 2 && $#AlignmentColumns ==1) 	
 | 
| 
 | 
   502 		{
 | 
| 
 | 
   503 #		print ": DNA Query Sequence\n"; 
 | 
| 
 | 
   504 		#Detect the two coordinatates of alignment against the query sequence: (last two numbers of the two 'panels') 
 | 
| 
 | 
   505 		($ThisQueryStart)  	= $InfoPanel 			=~ / (\d+) *$/;
 | 
| 
 | 
   506 		($ThisQueryEnd)	 	= $AlignmentPanel   	=~ / (\d+) *$/;
 | 
| 
 | 
   507 		my ($ThisDNASeq) 	= $AlignmentPanel 		=~ /^(.*?) /;
 | 
| 
 | 
   508 	#If you want to know what we just found: 
 | 
| 
 | 
   509 		#print "D: This DNA Sequence: '$ThisDNASeq'\n";
 | 
| 
 | 
   510 		$DNAQuerySequence = $DNAQuerySequence. $ThisDNASeq;	#Add it on to whatever we already have.
 | 
| 
 | 
   511 		#Move the needle if there are smaller / greater; otherwise prime the 'needles':
 | 
| 
 | 
   512 		if ($ThisQueryStart < $Query_Start or $Query_Start == -1)
 | 
| 
 | 
   513 			{	$Query_Start = $ThisQueryStart;	}
 | 
| 
 | 
   514 		if ($ThisQueryEnd > $Query_End or $Query_End == -1)
 | 
| 
 | 
   515 			{	$Query_End = $ThisQueryEnd;	}
 | 
| 
 | 
   516 #		print ": Query DNA Sequence detected This line: ($ThisQueryStart, $ThisQueryEnd) & Maximally: ($Query_Start, $Query_End)\n";		
 | 
| 
 | 
   517 		next; 
 | 
| 
 | 
   518 		}
 | 
| 
 | 
   519 =head4 Is AA Sequence:
 | 
| 
 | 
   520 
 | 
| 
 | 
   521 This is complicated as it Need to decide whether this is the sequence of the read or that of the original V / D / J regions:
 | 
| 
 | 
   522                                        -------------->                                                       
 | 
| 
 | 
   523                                          A  V  Y  Y  C  T  R  D  M  S  P  I  M  T  T  F  A  G  N  Y  W  G  Q  					<< Want this
 | 
| 
 | 
   524                     Query_1        237  GCTGTTTATTACTGTACGAGAGATATGAGCCCCATCATGACAACGTTTGCCGGAAACTACTGGGGCCAGG  306
 | 
| 
 | 
   525  V  95.3% (222/233)  IGHV3-30*04    274  .....G.........G.......-----------------------------------------------  296
 | 
| 
 | 
   526                                          A  V  Y  Y  C  A  R                                                  
 | 
| 
 | 
   527  V  95.7% (221/231)  IGHV3-30-3*01  274  .....G.........G.....-------------------------------------------------  294
 | 
| 
 | 
   528  
 | 
| 
 | 
   529  ...etc...
 | 
| 
 | 
   530                                         G  T  L  V  T  V  S  S  																<< Want this
 | 
| 
 | 
   531                     Query_1        307  GAACCCTGGTCACCGTCTCCTCAG  330
 | 
| 
 | 
   532 
 | 
| 
 | 
   533 To solve this we peak at the next line that it has the tag "Query" in it (we assume the line exists...)
 | 
| 
 | 
   534 
 | 
| 
 | 
   535 =cut
 | 
| 
 | 
   536 
 | 
| 
 | 
   537 		if ($#InfoColumns == -1 && $#AlignmentColumns >=-1) 	
 | 
| 
 | 
   538 		{
 | 
| 
 | 
   539 		unless ($Table[$C_Line+1]	=~	/Query/)	{	next;	}	#Is the next line the DNA sequence ? 
 | 
| 
 | 
   540 		#
 | 
| 
 | 
   541 #		print ": AA sequence\n";		
 | 
| 
 | 
   542 		
 | 
| 
 | 
   543 		
 | 
| 
 | 
   544 		$CurrentAASequence = $AlignmentPanel;
 | 
| 
 | 
   545 		#print "D: Panel Split Point = $PanelSplitPoint, '$AlignmentPanel'\n";
 | 
| 
 | 
   546 		$CurrentAASequence =~ s/^ {$PanelSplitPoint}//;
 | 
| 
 | 
   547 		#print "D: '$AAQuerySequence'\n";
 | 
| 
 | 
   548 #		print "D: Current AA Sequence: \t'$CurrentAASequence'\n";
 | 
| 
 | 
   549 		$AAQuerySequence	=	$AAQuerySequence.$CurrentAASequence;	#Store the elongating AA Sequence as well
 | 
| 
 | 
   550 		next;
 | 
| 
 | 
   551 		}
 | 
| 
 | 
   552 =head4 Is Alignment:
 | 
| 
 | 
   553 =cut 
 | 
| 
 | 
   554 		if ($#InfoColumns == 4 && $#AlignmentColumns ==1) 	
 | 
| 
 | 
   555 		{
 | 
| 
 | 
   556 		#Not acutally interesting to us for this version of the parser.  Delete ultimately?
 | 
| 
 | 
   557 		next;
 | 
| 
 | 
   558 		}
 | 
| 
 | 
   559 
 | 
| 
 | 
   560 #Is weird!  Don't recognise it!
 | 
| 
 | 
   561 
 | 
| 
 | 
   562 	warn "Weird!  Don't recongnise this: '$ID' [$#InfoColumns,$#AlignmentColumns]// '",$Lines[$C_Line],15,"...'\n"; 
 | 
| 
 | 
   563 	}	#End main iteration loop for alignment parsing.
 | 
| 
 | 
   564 	
 | 
| 
 | 
   565 	
 | 
| 
 | 
   566 =head2  The CDR3 is noted as problematic.  Can we identify it? 
 | 
| 
 | 
   567 
 | 
| 
 | 
   568 =cut
 | 
| 
 | 
   569 	print "D: Starting Search for CDR3\n";
 | 
| 
 | 
   570 	#Do have the end of the FWR3 but not the CDR3?  If so then it is worth trying to find the CDR3, otherwise...nothing we can do at this point
 | 
| 
 | 
   571 	if (exists ($DomainBoundaries{"FWR3"}{"End"}) 
 | 
| 
 | 
   572 		&& $AlignmentStart !=0  
 | 
| 
 | 
   573 		&& not (exists $DomainBoundaries{"CDR3"}{"End"}) )	#Guess we need to go looking for the end then...
 | 
| 
 | 
   574 		{
 | 
| 
 | 
   575 		#print "D: Placing call to markUpCDR3\n"; 
 | 
| 
 | 
   576 		my ($CDR3_Start, my $CDR3_End, my $CDR3_Found_Tag) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence, 
 | 
| 
 | 
   577 			$DomainBoundaries{"FWR3"}{"End"}, $AlignmentStart, 
 | 
| 
 | 
   578 			$DNACDR3_Pat, $AACDR3_Pat);
 | 
| 
 | 
   579 		if ($CDR3_Start !=0 && $CDR3_End !=0)
 | 
| 
 | 
   580 			{
 | 
| 
 | 
   581 			$DomainBoundaries{"CDR3"}{"Start"} = $CDR3_Start;
 | 
| 
 | 
   582 			$DomainBoundaries{"CDR3"}{"End"} = $CDR3_End 	;
 | 
| 
 | 
   583 			$DomainBoundaries{"CDR3"}{"FoundHow"} = $CDR3_Found_Tag;
 | 
| 
 | 
   584 			print "D: CDR3 was found by pattern matching: '$CDR3_Found_Tag' ($CDR3_Start, $CDR3_End)\n";
 | 
| 
 | 
   585 			}
 | 
| 
 | 
   586 		else
 | 
| 
 | 
   587 			{	print "D: CDR3 was not found [either by igBLAST or by pattern matching]\n";		
 | 
| 
 | 
   588 				$DomainBoundaries{"CDR3"}{"FoundHow"} = "NOT_FOUND";
 | 
| 
 | 
   589 			}	
 | 
| 
 | 
   590 		}
 | 
| 
 | 
   591 	else
 | 
| 
 | 
   592 		{	#Was reported by igBLAST 
 | 
| 
 | 
   593 		print "D: Found the FWR3 from the Domain Boundary Table\n";
 | 
| 
 | 
   594 		$DomainBoundaries{"CDR3"}{"FoundHow"} = "IGBLAST_NATIVE";	
 | 
| 
 | 
   595 		}
 | 
| 
 | 
   596 	 
 | 
| 
 | 
   597 #print Dumper %DomainBoundaries;
 | 
| 
 | 
   598 
 | 
| 
 | 
   599 =head2 Get the top VDJ regions:
 | 
| 
 | 
   600 
 | 
| 
 | 
   601 =cut
 | 
| 
 | 
   602 
 | 
| 
 | 
   603 =head2 Extract General Features: 
 | 
| 
 | 
   604 
 | 
| 
 | 
   605 =cut
 | 
| 
 | 
   606 	(my $TopHit) = $_ =~  m/V-J Frame, Strand\):\n(.*?)\n/s;
 | 
| 
 | 
   607 	print "D: Top Hits (raw)= '$TopHit' \n"; 
 | 
| 
 | 
   608 	my ($Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand) = split (/\t/,$TopHit);
 | 
| 
 | 
   609 	print "D: Top Hits (parsed)= '$Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match, $Chain, $VJFrame, $Strand'\n";
 | 
| 
 | 
   610 
 | 
| 
 | 
   611 =head2 Store the V / D / J Genes used
 | 
| 
 | 
   612 
 | 
| 
 | 
   613 =cut
 | 
| 
 | 
   614 
 | 
| 
 | 
   615 	if (defined $Top_V_gene_match && $Top_V_gene_match ne "")
 | 
| 
 | 
   616 	{	$OUTPUT_Data{"Top V Gene"} = $Top_V_gene_match;		} 
 | 
| 
 | 
   617 	
 | 
| 
 | 
   618 	if (defined $Top_D_gene_match && $Top_D_gene_match ne "")
 | 
| 
 | 
   619 	{	$OUTPUT_Data{"Top D Gene"} = $Top_D_gene_match;		} 
 | 
| 
 | 
   620 
 | 
| 
 | 
   621 	if (defined $Top_J_gene_match && $Top_J_gene_match ne "")
 | 
| 
 | 
   622 	{	$OUTPUT_Data{"Top J Gene"} = $Top_J_gene_match;		} 
 | 
| 
 | 
   623 	
 | 
| 
 | 
   624 	if (defined $Strand && $Strand ne "")
 | 
| 
 | 
   625 	{	$OUTPUT_Data{"Strand"} = $Strand;}
 | 
| 
 | 
   626 	
 | 
| 
 | 
   627 =head4 Preamble: ID, Frame, and V / D / J used:
 | 
| 
 | 
   628 
 | 
| 
 | 
   629 =cut 
 | 
| 
 | 
   630 	#Do a reality check: if we didn't get an ID, then skip:
 | 
| 
 | 
   631 	unless (defined (defined $ID) && $ID ne "" && 
 | 
| 
 | 
   632 			defined $VJFrame && $VJFrame ne "")			
 | 
| 
 | 
   633 			{	
 | 
| 
 | 
   634 			print &printOUTPUTData (\%OUTPUT_Data)."\n";
 | 
| 
 | 
   635 			next;
 | 
| 
 | 
   636 			}  
 | 
| 
 | 
   637 
 | 
| 
 | 
   638 #Ok, so we have data...most likely:
 | 
| 
 | 
   639 	#print "OUTPUT:\t",join ("\t", $ID, $VJFrame, $Top_V_gene_match, $Top_D_gene_match, $Top_J_gene_match);
 | 
| 
 | 
   640 	
 | 
| 
 | 
   641 	if (defined $VJFrame && defined $ID && $VJFrame ne "" && $ID ne "")
 | 
| 
 | 
   642 		{	$OUTPUT_Data{"VDJ Frame"} = $VJFrame;}
 | 
| 
 | 
   643 		else
 | 
| 
 | 
   644 		{	
 | 
| 
 | 
   645 		print &printOUTPUTData (\%OUTPUT_Data)."\n";
 | 
| 
 | 
   646 		 next;	
 | 
| 
 | 
   647 		}#REALLY?  We didn't find anything?  Oh well, move to next record
 | 
| 
 | 
   648 
 | 
| 
 | 
   649 =head4 CDR1
 | 
| 
 | 
   650 
 | 
| 
 | 
   651 =cut 
 | 
| 
 | 
   652 	#Remember that the alignment starts at the FWR1 start, not nt =0 on the read, hence we substract this off all future AA (& DNA coordinates)
 | 
| 
 | 
   653 	
 | 
| 
 | 
   654 	my $AlignmentOffset = $DomainBoundaries{"FWR1"}{"Start"};
 | 
| 
 | 
   655 	
 | 
| 
 | 
   656 #	print "D: AA Seqeunce is: '$AAQuerySequence'\n";
 | 
| 
 | 
   657 	if (exists $DomainBoundaries{"CDR1"}{"Start"})	#It is very possible that it doesn't; assume the End does though if we find the Start
 | 
| 
 | 
   658 		{
 | 
| 
 | 
   659 #		my $VRegion = $Alginments{"V"}{$C_VRegion};	#Convenience....
 | 
| 
 | 
   660 		my $CDR1Start 		= $DomainBoundaries{"CDR1"}{"Start"};
 | 
| 
 | 
   661 		my $CDR1End 		= $DomainBoundaries{"CDR1"}{"End"};
 | 
| 
 | 
   662 		my $CDR1_Length 	= $CDR1End - $CDR1Start; 
 | 
| 
 | 
   663 #		print "D: CDR1 $CDR1Start $CDR1End = $CDR1_Length\n";
 | 
| 
 | 
   664 	#Remember that the alignment starts at the FWR1 start, not nt =0 on the read
 | 
| 
 | 
   665 		my $CDR1_Seq_AA 	= substr ($AAQuerySequence, $CDR1Start - $AlignmentOffset, $CDR1_Length);
 | 
| 
 | 
   666 #		print "D: '$CDR1_Seq_AA'\n";
 | 
| 
 | 
   667 		$CDR1_Seq_AA 	=~ s/ //g;
 | 
| 
 | 
   668 		my $CDR1_Seq_AA_Length = length ($CDR1_Seq_AA);
 | 
| 
 | 
   669 		#Add this data to the output store specifically:
 | 
| 
 | 
   670 		$OUTPUT_Data{"CDR1 Seq"} 	= $CDR1_Seq_AA;
 | 
| 
 | 
   671 		$OUTPUT_Data{"CDR1 Length"} = $CDR1_Length;
 | 
| 
 | 
   672 		}	
 | 
| 
 | 
   673 	#What happens if there is no CDR1 found?  Leave blank - the output routine can handle this
 | 
| 
 | 
   674 	
 | 
| 
 | 
   675 =head4 CDR2
 | 
| 
 | 
   676 
 | 
| 
 | 
   677 =cut 
 | 
| 
 | 
   678 
 | 
| 
 | 
   679 	if (exists $DomainBoundaries{"CDR2"}{"Start"})	#It is very possible that it doesn't; assume the End does though if we find the Start
 | 
| 
 | 
   680 		{
 | 
| 
 | 
   681 #		my $VRegion = $Alginments{"V"}{$C_VRegion};	#Convenience....
 | 
| 
 | 
   682 		my $CDR2Start 		= $DomainBoundaries{"CDR2"}{"Start"};
 | 
| 
 | 
   683 		my $CDR2End 		= $DomainBoundaries{"CDR2"}{"End"};
 | 
| 
 | 
   684 		my $CDR2_Length 	= $CDR2End - $CDR2Start; 
 | 
| 
 | 
   685 		my $CDR2_Seq_AA 	= substr ($AAQuerySequence, $CDR2Start - $AlignmentOffset , $CDR2_Length);
 | 
| 
 | 
   686 		 $CDR2_Seq_AA 	=~ s/ //g;
 | 
| 
 | 
   687 		my $CDR2_Seq_AA_Length = length ($CDR2_Seq_AA); 
 | 
| 
 | 
   688 		#Add this data to the output store specifically:	
 | 
| 
 | 
   689 		$OUTPUT_Data{"CDR2 Seq"} 	= $CDR2_Seq_AA;
 | 
| 
 | 
   690 		$OUTPUT_Data{"CDR2 Length"} = $CDR2_Length;
 | 
| 
 | 
   691 		}	
 | 
| 
 | 
   692 	#What happens if there is no CDR2 found?  Leave blank - the output routine can handle this.
 | 
| 
 | 
   693 
 | 
| 
 | 
   694 =head4 CDR3
 | 
| 
 | 
   695 
 | 
| 
 | 
   696 =cut 
 | 
| 
 | 
   697 	if (exists $DomainBoundaries{"CDR3"}{"Start"})	#It is very possible that it doesn't; assume the End does though if we find the Start
 | 
| 
 | 
   698 		{
 | 
| 
 | 
   699 #		my $VRegion = $Alginments{"V"}{$C_VRegion};	#Convenience....
 | 
| 
 | 
   700 		my $CDR3Start 		= $DomainBoundaries{"CDR3"}{"Start"};
 | 
| 
 | 
   701 		my $CDR3End 		= $DomainBoundaries{"CDR3"}{"End"};
 | 
| 
 | 
   702 		my $CDR3_Length 	= $CDR3End - $CDR3Start; # This variable isn't used - delete it when safe to do so
 | 
| 
 | 
   703 		my $CDR3_Seq_AA 	= substr ($AAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
 | 
| 
 | 
   704 		my $CDR3_Seq_DNA	= substr ($DNAQuerySequence, $CDR3Start - $AlignmentOffset, $CDR3_Length);
 | 
| 
 | 
   705 		$CDR3_Seq_AA 	=~ s/ //g;
 | 
| 
 | 
   706 		$CDR3_Seq_DNA 	=~ s/ //g;
 | 
| 
 | 
   707 		my $CDR3_Seq_AA_Length = length ($CDR3_Seq_AA);
 | 
| 
 | 
   708 		my $CDR3_Seq_DNA_Length = length ($CDR3_Seq_DNA);
 | 
| 
 | 
   709 		#Add this data to the output store specifically:
 | 
| 
 | 
   710 		$OUTPUT_Data{"CDR3 Seq"} 	= $CDR3_Seq_AA;
 | 
| 
 | 
   711 		$OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
 | 
| 
 | 
   712 		$OUTPUT_Data{"CDR3 Seq DNA"}     = $CDR3_Seq_DNA;
 | 
| 
 | 
   713 		$OUTPUT_Data{"CDR3 Length DNA"} = $CDR3_Seq_DNA_Length;
 | 
| 
 | 
   714 		#And in the case of the CDR3 how we found it:
 | 
| 
 | 
   715 		$OUTPUT_Data{"CDR3 Found How"} = $DomainBoundaries{"CDR3"}{"FoundHow"};
 | 
| 
 | 
   716 		}	
 | 
| 
 | 
   717 	#What happens if there is no CDR3 found?  Leave blank - the output routine can handle this.
 | 
| 
 | 
   718 #die "HIT BLOCK\n";
 | 
| 
 | 
   719 #End of the record; output the data we have collected and move on.
 | 
| 
 | 
   720 print &printOUTPUTData (\%OUTPUT_Data)."\n";
 | 
| 
 | 
   721 }
 | 
| 
 | 
   722 
 | 
| 
 | 
   723 
 | 
| 
 | 
   724 	
 | 
| 
 | 
   725 ############
 | 
| 
 | 
   726 sub returnLinesBetween {
 | 
| 
 | 
   727 =head3 SUB: returnLinesBetween ({reference to array Index array}, {regex for top of section}, {regex for bottom of section})
 | 
| 
 | 
   728 
 | 
| 
 | 
   729 When passed a reference to an array and two strings - interpreted as REGEX's - will return the lines of the Array 
 | 
| 
 | 
   730 that are bounded by these tags.
 | 
| 
 | 
   731 
 | 
| 
 | 
   732 If either of the tags are not found - or are found in the wrong order - then a null list is returned.
 | 
| 
 | 
   733 
 | 
| 
 | 
   734 =cut
 | 
| 
 | 
   735 
 | 
| 
 | 
   736 my ($Text_ref, $TopTag, $BotTag) = @_;
 | 
| 
 | 
   737 
 | 
| 
 | 
   738 my @Table;
 | 
| 
 | 
   739 #The two boundary conditions at which we will cut the table:
 | 
| 
 | 
   740 #print "D: [returnLinesBetween]: '$TopTag, $BotTag'\n";
 | 
| 
 | 
   741 #How we record these:
 | 
| 
 | 
   742 my $AlignmentLine_Top=0;  my $AlignmentLine_Bot=0;
 | 
| 
 | 
   743 
 | 
| 
 | 
   744 my $LineIndex=-1;	#-1 As the loop increments this line counter first, then does its checks. 	
 | 
| 
 | 
   745 #If you care:
 | 
| 
 | 
   746 #print "D: Lines of text passed: $$#Lines\n";
 | 
| 
 | 
   747 
 | 
| 
 | 
   748 #Iterate through until we find what we are looking for or run out of text to search:
 | 
| 
 | 
   749 while (($AlignmentLine_Bot ==0 or $AlignmentLine_Top==0) && $LineIndex <=$#{$Text_ref})
 | 
| 
 | 
   750 	{
 | 
| 
 | 
   751 	$LineIndex++;	
 | 
| 
 | 
   752 	#Enable if you need to care:
 | 
| 
 | 
   753 #	print "D: Line Index = $LineIndex\n";	
 | 
| 
 | 
   754 
 | 
| 
 | 
   755 	if ($$Text_ref[$LineIndex] =~ m/$TopTag/)
 | 
| 
 | 
   756 		{
 | 
| 
 | 
   757 		$AlignmentLine_Top = $LineIndex; 
 | 
| 
 | 
   758 #		print "D: [returnLinesBetween]: TopTag found in Line: '$$Text_ref[$LineIndex]'\n";	#Enable if you are interested
 | 
| 
 | 
   759 		}
 | 
| 
 | 
   760 	if ($$Text_ref[$LineIndex] =~ m/$BotTag/)
 | 
| 
 | 
   761 		{
 | 
| 
 | 
   762 		$AlignmentLine_Bot = $LineIndex; 
 | 
| 
 | 
   763 #		print "D: [returnLinesBetween]: Bottom Tag found in Line: '$$Text_ref[$LineIndex]'\n";	#Enable if you are interested
 | 
| 
 | 
   764 		}
 | 
| 
 | 
   765 	}
 | 
| 
 | 
   766 #Reality check: did we find anything?  If not then we return null.
 | 
| 
 | 
   767 if ($AlignmentLine_Top ==0 && $AlignmentLine_Bot ==0)
 | 
| 
 | 
   768 		{	return;	}
 | 
| 
 | 
   769 #Again, enable if you care:
 | 
| 
 | 
   770 #print "D: [returnLinesBetween] Lines for section table: '$AlignmentLine_Top to $AlignmentLine_Bot'\n";
 | 
| 
 | 
   771 
 | 
| 
 | 
   772 #We want the lines one down and one up - so polish these. 
 | 
| 
 | 
   773 $AlignmentLine_Top++; $AlignmentLine_Bot--;
 | 
| 
 | 
   774 
 | 
| 
 | 
   775 #Return as an array slice:  
 | 
| 
 | 
   776 return 	(@$Text_ref[$AlignmentLine_Top .. $AlignmentLine_Bot]);
 | 
| 
 | 
   777 }
 | 
| 
 | 
   778 ############
 | 
| 
 | 
   779 
 | 
| 
 | 
   780 sub findSplitPoint 
 | 
| 
 | 
   781 {
 | 
| 
 | 
   782 =head2 sub: $PanelBoundaryCahracter = findSplitPoint (\@Table)
 | 
| 
 | 
   783 
 | 
| 
 | 
   784 When passed a table with the alignment in it makes an educated guess as to the precise split point to
 | 
| 
 | 
   785 spearate the 'info' and 'alignment' panels.
 | 
| 
 | 
   786 This is a right olde faff because the field / panel boundaries change. 
 | 
| 
 | 
   787  
 | 
| 
 | 
   788  '                    Query_6      167  GAGGTGCAGTTGTTGGAGTCTGGGGGAGGCTTGGCACAGCC-GGGGGGTCCCTGAGACTCTCCTGTGCAG  235'
 | 
| 
 | 
   789  '                    Query_6      236  CCTCTGGATTCACCTTTGACAAATATGCCATGACCTGGGTCCGCCAGGCTCCAGGGAAGGGTCTGGAGTG  305'
 | 
| 
 | 
   790  '                    Query_6      306  GGTCTCAACTATACTTGCCAGTGGTCG---CACAGACGACGCAGACTCCGTGAAGGGCCGGTTTGCCATC  372'
 | 
| 
 | 
   791  '                    Query_6      373  TCCAGAGACAATTCCAAGAACACTCTGTATCTGCAAATGAACAGCCTGAGAGTCGAGGACACGGCCCTTT  442'
 | 
| 
 | 
   792  '                    Query_6      443  ATTACTGTGCGAGTGAGGGGGACATAGTGGCTTCGGAGCTTTTGAGTACTGGGGCCAGGGAAACCTGGTC  512'
 | 
| 
 | 
   793 MOTIF_FOUND_IN_AA
 | 
| 
 | 
   794 i.e to contain just ATGC + "X" bases & the gap "-" character but not the "." character (found in the alingment proper) and have 4 fields in total
 | 
| 
 | 
   795 
 | 
| 
 | 
   796 Returns either -1 or the location of the panel boundary, issues a warning and returns -1 if is the most frequent boundary
 | 
| 
 | 
   797 because the pattern match has been failing more often that it suceeded.  
 | 
| 
 | 
   798 
 | 
| 
 | 
   799 =cut 	
 | 
| 
 | 
   800 #A rough guess is 38 for normal sequences, 48 for reversed ones:
 | 
| 
 | 
   801 
 | 
| 
 | 
   802 my $SplitPos = 0;
 | 
| 
 | 
   803 
 | 
| 
 | 
   804 (my $Table_ref) = @_;	#Get the reference to the table 
 | 
| 
 | 
   805 my @DNALines;	#We populate this for mining in the next section
 | 
| 
 | 
   806 foreach  my $C_Line 	(@{$Table_ref})
 | 
| 
 | 
   807 	{
 | 
| 
 | 
   808 	#print "D: $C_Line\n";
 | 
| 
 | 
   809 #	(my $SplitLine) = $C_Line;
 | 
| 
 | 
   810 	#Split on consecutive tabs or spaces:
 | 
| 
 | 
   811 	my @LineFields = split (/[\t\s]+/,$C_Line);
 | 
| 
 | 
   812 	#print "D: Split Line: '",join (",",@LineFields),"' : $#LineFields\n";
 | 
| 
 | 
   813 	unless (	$LineFields[3] =~ m/[^\.]/ 
 | 
| 
 | 
   814 			&& 	$LineFields[3] =~ m/[ATGCX]{20,}/ 
 | 
| 
 | 
   815 			&& 	$#LineFields==4)	
 | 
| 
 | 
   816 				{	next;	}
 | 
| 
 | 
   817 #Enable if you want to know the lines we think are the DNA Query strings: 
 | 
| 
 | 
   818 	#print "D: DNA Line:          '$C_Line'\n";
 | 
| 
 | 
   819 	push @DNALines, $C_Line;		#Note it down
 | 
| 
 | 
   820 	}
 | 
| 
 | 
   821 
 | 
| 
 | 
   822 my %PanelBounds;	#Will contain the positions of the panel boundaries 
 | 
| 
 | 
   823 
 | 
| 
 | 
   824 foreach my $C_DNALine (@DNALines)
 | 
| 
 | 
   825 	{
 | 
| 
 | 
   826 	#print "D: '$C_DNALine'\n";
 | 
| 
 | 
   827 	$C_DNALine =~ m/[ATGC-]+  \d+$/;	#Match the DNA string and the indexingMOTIF_FOUND_IN_AA numbers afterwards, allow gap characters.
 | 
| 
 | 
   828 	my $MatchPos = $-[0];				#This is the position of the start of the last match because we can't get the index() function to work
 | 
| 
 | 
   829 	#(my $MatchPos) = index ($C_DNALine, / [ATGCX-]{20}/,0);
 | 
| 
 | 
   830 	#print "D: '$C_DNALine' DNA panel starts at:'$MatchPos'\n";
 | 
| 
 | 
   831 	$PanelBounds{$MatchPos}++;
 | 
| 
 | 
   832 	}
 | 
| 
 | 
   833 #Sort the hash values in order and then return the most frequent (will offer some resistance to the occasion pattern failure)
 | 
| 
 | 
   834 #The brackets around "($SplitPos)" are really necessary it seems. 
 | 
| 
 | 
   835 ($SplitPos) = (sort { $a <=> $b } keys %PanelBounds);
 | 
| 
 | 
   836 #If you want 
 | 
| 
 | 
   837 #print Dumper  %PanelBounds;
 | 
| 
 | 
   838 #Tell people if we are having difficultlty:
 | 
| 
 | 
   839 if ($SplitPos == -1)	{	warn "Couldn't identify the panel boundaries\n";	}
 | 
| 
 | 
   840 #print "D: $SplitPos: Returning the split position of: '$SplitPos'\n";
 | 
| 
 | 
   841 return $SplitPos;	
 | 
| 
 | 
   842 }
 | 
| 
 | 
   843 
 | 
| 
 | 
   844 
 | 
| 
 | 
   845 ##
 | 
| 
 | 
   846 #
 | 
| 
 | 
   847 #
 | 
| 
 | 
   848 ###
 | 
| 
 | 
   849 
 | 
| 
 | 
   850 
 | 
| 
 | 
   851 
 | 
| 
 | 
   852 
 | 
| 
 | 
   853 
 | 
| 
 | 
   854 #####
 | 
| 
 | 
   855 #
 | 
| 
 | 
   856 #
 | 
| 
 | 
   857 #####
 | 
| 
 | 
   858 sub markUpCDR3 
 | 
| 
 | 
   859 {
 | 
| 
 | 
   860 =head3 Sub: (Start, End, Found How) = markUpCDR3 (DNASeq, AASeq, FWR3 End, FWR1 Offset, DNA Regex, AA Regex)
 | 
| 
 | 
   861 
 | 
| 
 | 
   862 Tries to identify the end of the CDR3 using the DNA and RNA Sequence patterns MOTIF_FOUND_IN_AAsupplied.  The CDR3 is assumed to start 
 | 
| 
 | 
   863 at the end of the FWR3.
 | 
| 
 | 
   864 To reduce FP matches only the sequences (DNA & AA) after the FWR3 are tested with the pattern.  
 | 
| 
 | 
   865 The position of the first matching pattern is reported. 
 | 
| 
 | 
   866 
 | 
| 
 | 
   867 =head4 Fuller Usage: 
 | 
| 
 | 
   868 
 | 
| 
 | 
   869 my ($CDR3_Start, my $CDR3_End) = markUpCDR3 ($DNAQuerySequence, $AAQuerySequence, 
 | 
| 
 | 
   870 			$DomainBoundaries{"FWR3"}{"End"}, $DomainBoundaries{"FWR1"}{"Start"}, 
 | 
| 
 | 
   871 			$DNACDR3_Pat, $AACDR3_Pat);
 | 
| 
 | 
   872 
 | 
| 
 | 
   873 
 | 
| 
 | 
   874 
 | 
| 
 | 
   875 =head4 Returned Values
 | 
| 
 | 
   876 
 | 
| 
 | 
   877 If the CDR3 was found then we we signal like this: 
 | 
| 
 | 
   878 
 | 
| 
 | 
   879  $MotifFound ==0 	: Nope, didn't find either motif
 | 
| 
 | 
   880  $MotifFound ==1 	: Found at the DNA level, not the AA level
 | 
| 
 | 
   881  $MotifFound ==2 	: Found at the the AA level, not the DNA level 
 | 
| 
 | 
   882  $MotifFound ==3 	: Found at the the AA level & the DNA level
 | 
| 
 | 
   883 
 | 
| 
 | 
   884 (Also remember that if the FWR3 region couldn't be identified in the sequence there is a 4th option: not tested; this routine isn't called therefore)
 | 
| 
 | 
   885 
 | 
| 
 | 
   886 The Start and Ends returned are from the first sucessful match (MotifFound==3): though hopefully they are the same.
 | 
| 
 | 
   887 Formally the test order is:
 | 
| 
 | 
   888 
 | 
| 
 | 
   889  1) DNA
 | 
| 
 | 
   890  2) AA 
 | 
| 
 | 
   891 
 | 
| 
 | 
   892 i.e. DNA bp locations have priority.
 | 
| 
 | 
   893 
 | 
| 
 | 
   894 Technically the locations are determined by a regex match then the $+[0] array (i.e. the end of the pattern match).  
 | 
| 
 | 
   895 See pages like this: http://stackoverflow.com/questions/87380/how-can-i-find-the-location-of-a-regex-match-in-perl for an explanation. 
 | 
| 
 | 
   896 
 | 
| 
 | 
   897 =head3 Manipulation of AA patternsMOTIF_FOUND_IN_AA
 | 
| 
 | 
   898 
 | 
| 
 | 
   899 Note that patterns are assumed to require white space inserting in them between the letters.
 | 
| 
 | 
   900 This could be a serious limitation  
 | 
| 
 | 
   901 
 | 
| 
 | 
   902 
 | 
| 
 | 
   903 =cut	
 | 
| 
 | 
   904 
 | 
| 
 | 
   905 #Get the parameters passed:
 | 
| 
 | 
   906 my ($DNA, $AA, $FWR3_End, $FWR1_Start, $DNAPat, $AAPat)	=	@_;
 | 
| 
 | 
   907 print "D: markUpCDR3: Passed Parameters '$FWR3_End, $FWR1_Start, $DNAPat, $AAPat' (& AA & DNA sequence)\n";
 | 
| 
 | 
   908 
 | 
| 
 | 
   909 
 | 
| 
 | 
   910 #Setup our return values:
 | 
| 
 | 
   911 my $Start = 0; my  $End =0;	my $MotifFound = 0;
 | 
| 
 | 
   912 my $How;	#Literally How the motif was found (or not if blank)
 | 
| 
 | 
   913 
 | 
| 
 | 
   914 
 | 
| 
 | 
   915 =head4 Prepare the sequences and the patterns for use
 | 
| 
 | 
   916 
 | 
| 
 | 
   917 Specifically: trim off the start of the AA & DNA string already allocated to other CDRs or FWRs  
 | 
| 
 | 
   918 
 | 
| 
 | 
   919 Add in spaces into the AA regex pattern because we can't get regex-ex freespacing mode i.e. "$Var =~ m/$AAPat/x" working.
 | 
| 
 | 
   920 
 | 
| 
 | 
   921 
 | 
| 
 | 
   922 We take the "-1" as the CropPoint position to include the previous 3 nucleotides / AAs; remember to add this back on 
 | 
| 
 | 
   923 in position calculations.
 | 
| 
 | 
   924 
 | 
| 
 | 
   925 
 | 
| 
 | 
   926 =cut
 | 
| 
 | 
   927 
 | 
| 
 | 
   928 #Because igBLAST doesn't always report from the start of the read (primers and things are upstream):
 | 
| 
 | 
   929  
 | 
| 
 | 
   930 my $CropPoint = $FWR3_End - $FWR1_Start - 1 ;
 | 
| 
 | 
   931 #print "D: markUpCDR3: Crop point is: '$CropPoint'\n";
 | 
| 
 | 
   932 
 | 
| 
 | 
   933 #print "D: markUpCDR3: Cropping point is: '$CropPoint' characters from start\n";
 | 
| 
 | 
   934 #We trim off the parts we expect to find the CDR3 motifs in leaving at extra 3nts on to allow for base miss-calling:
 | 
| 
 | 
   935 
 | 
| 
 | 
   936 my $AA_Trimmed 		= substr ($AA, $CropPoint);
 | 
| 
 | 
   937 my $DNA_Trimmed 	= substr ($DNA ,$CropPoint);
 | 
| 
 | 
   938 #print "D: markUpCDR3: AA = '$AA' (untrimmed)\nD: markUpCDR3: TR = '$AA_Trimmed' (Trimmed) ", length ($AA_Trimmed)," nts long\n";
 | 
| 
 | 
   939 #print "D: markUpCDR3: Testing: AA = '$AA_Trimmed', DNA = '$DNA_Trimmed'\n";
 | 
| 
 | 
   940 
 | 
| 
 | 
   941 #This lovely hack is to account for the spaces in the AA sequence and we can't get the "$Var =~ m/$AAPat/x" working 
 | 
| 
 | 
   942 my $AAPat_Spaced;
 | 
| 
 | 
   943 foreach my $C_Char (0..length($AAPat)-1)	#The -1 is because we don't want trailing spaces until the next nt ->  AA translation.
 | 
| 
 | 
   944 	{	$AAPat_Spaced = $AAPat_Spaced.'\s+'.substr ($AAPat,$C_Char,1);	}
 | 
| 
 | 
   945 #And write this back into the main pattern we were passed:
 | 
| 
 | 
   946 $AAPat = $AAPat_Spaced;
 | 
| 
 | 
   947 
 | 
| 
 | 
   948 #temp hack:
 | 
| 
 | 
   949 #$AA_Trimmed = $AA;
 | 
| 
 | 
   950 my $MotifFound=0;		#So we can record which patterns we found
 | 
| 
 | 
   951 my $MotifPositionDNA	=-1;
 | 
| 
 | 
   952 my $MotifPositionAA		=-1;
 | 
| 
 | 
   953 
 | 
| 
 | 
   954 #print "D: markUpCDR3: Pattern: '$AAPat_Spaced'\n";
 | 
| 
 | 
   955 =head4 At DNA level: "TGG GGx xxx GGx" [+1]
 | 
| 
 | 
   956 
 | 
| 
 | 
   957 =cut
 | 
| 
 | 
   958 
 | 
| 
 | 
   959 #print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
 | 
| 
 | 
   960 	
 | 
| 
 | 
   961 if ($DNA_Trimmed =~ m/$DNAPat/)
 | 
| 
 | 
   962 	{
 | 
| 
 | 
   963 	$MotifPositionDNA = $+[0];	#Just the easiest way to do this in Perl  
 | 
| 
 | 
   964 #	print "D: markUpCDR3:: Found Motif match on DNA at bp: '$MotifPositionDNA'\n";
 | 
| 
 | 
   965 	$MotifFound = $MotifFound + 1;
 | 
| 
 | 
   966 	#Any more matches further on?
 | 
| 
 | 
   967 	my $LaterString = substr ($DNA_Trimmed, $MotifPositionDNA);
 | 
| 
 | 
   968 #	print "D: markUpCDR3: '$AA_Trimmed' (AA Trimmed string)\n";
 | 
| 
 | 
   969 #	print "D: markUpCDR3: '", substr ($DNA_Trimmed,0, $MotifPositionDNA)," (DNA until pattern match string)\n";
 | 
| 
 | 
   970 #	print "D: markUpCDR3: '$DNA_Trimmed' (Trimmed DNA string)\n";
 | 
| 
 | 
   971 #	print "D: markUpCDR3: '$LaterString' (Later part of DNA string)\n";
 | 
| 
 | 
   972 	if ($LaterString =~ m/$DNAPat/)
 | 
| 
 | 
   973 		{	print "D: markUPCDR3: Also got a match further down the DNA String: at ", $-[0] ," to ", $+ [0], " - which might be worrying\n";		}
 | 
| 
 | 
   974 	}
 | 
| 
 | 
   975 	
 | 
| 
 | 
   976 =head4 At AA level: "WGxG" [+2]
 | 
| 
 | 
   977 
 | 
| 
 | 
   978 =cut
 | 
| 
 | 
   979 		
 | 
| 
 | 
   980 if ($AA_Trimmed=~ m/$AAPat/)	
 | 
| 
 | 
   981 	{
 | 
| 
 | 
   982 	$MotifPositionAA = $+[0];	#Just the easiest way to do this in Perl
 | 
| 
 | 
   983 	$MotifFound = $MotifFound + 2;
 | 
| 
 | 
   984 #	print "D: markUpCDR3: Found Motif match on AA at position (on DNA remember): '$MotifPositionAA' (ie.)\n";
 | 
| 
 | 
   985 	(my $CDR3_seq) = substr ($AA_Trimmed, 0, $MotifPositionAA);
 | 
| 
 | 
   986 #	print "D: markUpCDR3: Seq ='$CDR3_seq' - as detected\n";
 | 
| 
 | 
   987 
 | 
| 
 | 
   988 	}
 | 
| 
 | 
   989 
 | 
| 
 | 
   990 =head4 Assess the results of motif position finding
 | 
| 
 | 
   991 
 | 
| 
 | 
   992 =cut
 | 
| 
 | 
   993 
 | 
| 
 | 
   994 #print "D: markUpCDR3: MotifFound  = '$MotifFound'\n";
 | 
| 
 | 
   995 
 | 
| 
 | 
   996 if ($MotifFound ==0)
 | 
| 
 | 
   997 	{	return ($Start, $End, $MotifFound);		}	#The easy one really: return we didn't find the CDR3
 | 
| 
 | 
   998 
 | 
| 
 | 
   999 #
 | 
| 
 | 
  1000 $Start = $FWR3_End;	#We assume the end of the FWR3 is the start of CDR3:
 | 
| 
 | 
  1001 #Just found in DNA:
 | 
| 
 | 
  1002 if ($MotifFound ==1)
 | 
| 
 | 
  1003 	{
 | 
| 
 | 
  1004 	$Start = $FWR3_End;	#We assume the end of the FWR3 is the start of CDR3:
 | 
| 
 | 
  1005 	$End = $MotifPositionDNA;
 | 
| 
 | 
  1006 	$How = "MOTIF_FOUND_IN_DNA";		
 | 
| 
 | 
  1007 	}
 | 
| 
 | 
  1008 #Just found in AA:
 | 
| 
 | 
  1009 if ($MotifFound ==2)
 | 
| 
 | 
  1010 	{
 | 
| 
 | 
  1011 	$End = $MotifPositionAA;
 | 
| 
 | 
  1012 	$How = "MOTIF_FOUND_IN_AA";		
 | 
| 
 | 
  1013 	}
 | 
| 
 | 
  1014 
 | 
| 
 | 
  1015 #Found in both, DNA has priority:
 | 
| 
 | 
  1016 if ($MotifFound ==3)
 | 
| 
 | 
  1017 	{	
 | 
| 
 | 
  1018 	$Start = $FWR3_End ;	#We assume the end of the FWR3 is the start of CDR3:
 | 
| 
 | 
  1019 	$End = $MotifPositionDNA;
 | 
| 
 | 
  1020 	$How = "MOTIF_FOUND_IN_BOTH";
 | 
| 
 | 
  1021 	}	
 | 
| 
 | 
  1022 
 | 
| 
 | 
  1023 #print "D: markUpCDR3: Motif found = $MotifFound\n";
 | 
| 
 | 
  1024 
 | 
| 
 | 
  1025 =head4 These next few lines are for testing / diagnostics only - disable for general use
 | 
| 
 | 
  1026 
 | 
| 
 | 
  1027 If you are interested in getting the CDR3 directly then remember the main coordinate system is defined such that 
 | 
| 
 | 
  1028 the start of FWR1 is unlikely to be at nt 1.
 | 
| 
 | 
  1029 
 | 
| 
 | 
  1030 =cut
 | 
| 
 | 
  1031 
 | 
| 
 | 
  1032 $Start = $FWR3_End - $FWR1_Start -1;
 | 
| 
 | 
  1033 $End = $End + $CropPoint;
 | 
| 
 | 
  1034 my $CDR3_RegionLength = $End - $Start;
 | 
| 
 | 
  1035 #print "D: markUpCDR3: CDR3 Length=  $Start - $End = '$CDR3_RegionLength'\n";
 | 
| 
 | 
  1036 (my $CDR3_seq) = substr ($AA, $Start, $CDR3_RegionLength);
 | 
| 
 | 
  1037 
 | 
| 
 | 
  1038 #Add onto the coordinates what we trimmed off:
 | 
| 
 | 
  1039 
 | 
| 
 | 
  1040 
 | 
| 
 | 
  1041 #print "D: markUpCDR3: Seq ='$CDR3_seq'\n"; 
 | 
| 
 | 
  1042 
 | 
| 
 | 
  1043 print "D: markUpCDR3: returning: $Start, $End, $How, ($MotifFound) [NB: offset of :'+ $FWR1_Start'\n";
 | 
| 
 | 
  1044 #die "HIT BLOCK\n";
 | 
| 
 | 
  1045 return ($Start + $FWR1_Start, $End + $FWR1_Start, $How);
 | 
| 
 | 
  1046 }
 | 
| 
 | 
  1047 
 | 
| 
 | 
  1048 
 | 
| 
 | 
  1049 sub printOUTPUTData {
 | 
| 
 | 
  1050 =head2 sub: $OutputDataString = printOUTPUTData {\%OutputData}
 | 
| 
 | 
  1051 
 | 
| 
 | 
  1052 When passed an array containing the appropriate CDR, Top V / D/ J genes and the seqeunce ID.
 | 
| 
 | 
  1053 This prepared and then returned as a text string that can then be printed to STDOUT:
 | 
| 
 | 
  1054 
 | 
| 
 | 
  1055 	print (printOUTPUTData (\%OutputData)); 
 | 
| 
 | 
  1056 
 | 
| 
 | 
  1057 Any missing data in the Hash array it polietly ignored and a null string printed in place.
 | 
| 
 | 
  1058 The text field is tab delimited; there are no extra trailing tabs or carriage returns in place. 
 | 
| 
 | 
  1059 
 | 
| 
 | 
  1060 Actually the fields printed out are stored in an index array.  
 | 
| 
 | 
  1061 
 | 
| 
 | 
  1062 =head3 Header output
 | 
| 
 | 
  1063 
 | 
| 
 | 
  1064 If the routine is passed a key 'HEADER' then the header columns are returned as that string.
 | 
| 
 | 
  1065 This is tested first - so don't add this unless you mean to.
 | 
| 
 | 
  1066 
 | 
| 
 | 
  1067 =cut
 | 
| 
 | 
  1068 
 | 
| 
 | 
  1069 my @HeaderFields = ("ID", "VDJ Frame", "Top V Gene", "Top D Gene", "Top J Gene", 
 | 
| 
 | 
  1070 					"CDR1 Seq", "CDR1 Length",
 | 
| 
 | 
  1071 					"CDR2 Seq", "CDR2 Length",
 | 
| 
 | 
  1072 					"CDR3 Seq", "CDR3 Length", "CDR3 Seq DNA", "CDR3 Length DNA", "Strand",
 | 
| 
 | 
  1073 					"CDR3 Found How");
 | 
| 
 | 
  1074 	
 | 
| 
 | 
  1075 my $OutputString = "OUTPUT:";	#What we are going to build the output into. 
 | 
| 
 | 
  1076 
 | 
| 
 | 
  1077 =head4 Print Header & Exit?
 | 
| 
 | 
  1078 
 | 
| 
 | 
  1079 =cut
 | 
| 
 | 
  1080 
 | 
| 
 | 
  1081 my ($Data_ref) = @_;
 | 
| 
 | 
  1082 #print "D: printOUTPUTData: Running\n";
 | 
| 
 | 
  1083 
 | 
| 
 | 
  1084 if (exists $$Data_ref {"HEADER"})
 | 
| 
 | 
  1085 	{
 | 
| 
 | 
  1086 	$OutputString .= "\t";
 | 
| 
 | 
  1087 	for(my $n = 0; $n <= $#HeaderFields; $n++)
 | 
| 
 | 
  1088 	{
 | 
| 
 | 
  1089 		$OutputString .= $HeaderFields[$n];
 | 
| 
 | 
  1090 		$OutputString .= "\t" if($n < $#HeaderFields);
 | 
| 
 | 
  1091 	}
 | 
| 
 | 
  1092 	
 | 
| 
 | 
  1093 	# foreach my $C_Header (@HeaderFields)
 | 
| 
 | 
  1094 		# {	$OutputString .= "$C_Header";		}	#
 | 
| 
 | 
  1095 
 | 
| 
 | 
  1096 	print "D: printOUTPUTData: HEADER Printout requested '@HeaderFields'\n";
 | 
| 
 | 
  1097 	return ($OutputString);
 | 
| 
 | 
  1098 	}
 | 
| 
 | 
  1099 
 | 
| 
 | 
  1100 =head3 Assemble whatever data we have - and tab delimit the null fields
 | 
| 
 | 
  1101 
 | 
| 
 | 
  1102 =cut 
 | 
| 
 | 
  1103 #print "D: printOUTPUTData: Will pretty print this:\n", Dumper $Data_ref;
 | 
| 
 | 
  1104 foreach my $C_Header (@HeaderFields)
 | 
| 
 | 
  1105 	{	
 | 
| 
 | 
  1106 	
 | 
| 
 | 
  1107 	if (exists ($$Data_ref {$C_Header}))
 | 
| 
 | 
  1108 		{	$OutputString .= "\t". $$Data_ref{$C_Header};	}	#We have data to print out
 | 
| 
 | 
  1109 		else
 | 
| 
 | 
  1110 		{	$OutputString .="\t";	}	#Add a trailing space
 | 
| 
 | 
  1111 	}	#
 | 
| 
 | 
  1112 	
 | 
| 
 | 
  1113 return ($OutputString);
 | 
| 
 | 
  1114 }
 | 
| 
 | 
  1115 
 | 
| 
 | 
  1116 
 | 
| 
 | 
  1117 ######################################### Code Junk ########################
 | 
| 
 | 
  1118 
 | 
| 
 | 
  1119 
 | 
| 
 | 
  1120 =head2 Code Junk Attic
 | 
| 
 | 
  1121 
 | 
| 
 | 
  1122 =head3 Demonstrates how to reverse translate an amino acid sequence into DNA:  
 | 
| 
 | 
  1123 
 | 
| 
 | 
  1124 use Bio::Tools::CodonTable;
 | 
| 
 | 
  1125 use Bio::Seq;
 | 
| 
 | 
  1126 
 | 
| 
 | 
  1127 # print possible codon tables
 | 
| 
 | 
  1128   my $tables = Bio::Tools::CodonTable->tables;
 | 
| 
 | 
  1129   while ( (my $id, my $name) = each %{$tables} ) {
 | 
| 
 | 
  1130     print "$id = $name\n";
 | 
| 
 | 
  1131   }
 | 
| 
 | 
  1132  my $CodonTable   = Bio::Tools::CodonTable->new();
 | 
| 
 | 
  1133  
 | 
| 
 | 
  1134  my $ExampleSeq = Bio::PrimarySeq->new(-seq=>"WGxG", -alphabet => 'protein') or die "Cannot create sequence object\n";
 | 
| 
 | 
  1135   
 | 
| 
 | 
  1136 
 | 
| 
 | 
  1137 my $rvSeq = $CodonTable->reverse_translate_all($ExampleSeq); 
 | 
| 
 | 
  1138 print "D: '$rvSeq'\n";
 | 
| 
 | 
  1139 die "TEST OVER\n"; 
 | 
| 
 | 
  1140 
 | 
| 
 | 
  1141 =cut
 | 
| 
 | 
  1142 
 | 
| 
 | 
  1143 
 | 
| 
 | 
  1144 =head3 For processing the 'Alignment lines' section of the alginment table
 | 
| 
 | 
  1145 
 | 
| 
 | 
  1146 		#If we are ever interested; then enable the code below:
 | 
| 
 | 
  1147 #		print ": Alignment\n";
 | 
| 
 | 
  1148 #		$InfoPanel =~ s/^ +//;	$InfoPanel =~ s/ +$//;		#Clean off trailing spaces
 | 
| 
 | 
  1149 #		my ($Germclass, $PID, $PID_Counts, $Allele) 	=	split (/\s+/,$InfoPanel);	#Split on spaces
 | 
| 
 | 
  1150 ##Enable if you need to know what we just found: 
 | 
| 
 | 
  1151 #		#print "D: Fields are (Germclass, PID, PID_Counts, Allele) \t$Germclass, $PID, $PID_Counts, $Allele\n";
 | 
| 
 | 
  1152 #		#A reality check: we should have an Allele - or some text here.
 | 
| 
 | 
  1153 #		unless (defined $Allele && $Allele ne "")
 | 
| 
 | 
  1154 #			{	warn "Cannot get Allele for Line '$C_Line' - implies improper parsing: '",substr ($Lines[$C_Line],0,15),"...'\n";	}
 | 
| 
 | 
  1155 #		if (exists ($Alginments {$Germclass}{$Allele}))
 | 
| 
 | 
  1156 #			{	$Alginments {$Germclass}{$Allele}	=	$Alginments {$Germclass}{$Allele}.$CurrentAASequence;	}	#Carry on adding
 | 
| 
 | 
  1157 #		else	#more work needed as we need to 'pad' the sequence with fake gap characters)
 | 
| 
 | 
  1158 #			{
 | 
| 
 | 
  1159 ##Do we still need this padding?  I don't think so				
 | 
| 
 | 
  1160 #				
 | 
| 
 | 
  1161 #
 | 
| 
 | 
  1162 #			my $PaddingChars = ($ThisQueryStart-$Query_Start);
 | 
| 
 | 
  1163 #			print "D: New gene found: need to pad it with ($ThisQueryStart-$Query_Start) i.e. '$PaddingChars' characters\n";
 | 
| 
 | 
  1164 #			#To help testing, calculate this first:
 | 
| 
 | 
  1165 #			my $PaddingString = " "x $PaddingChars;
 | 
| 
 | 
  1166 #			$Alginments {$Germclass}{$Allele}	=	$CurrentAASequence;	
 | 
| 
 | 
  1167 #			}
 | 
| 
 | 
  1168 #		next
 | 
| 
 | 
  1169 
 | 
| 
 | 
  1170 =head3 Demonstration of Pattern match positions
 | 
| 
 | 
  1171 
 | 
| 
 | 
  1172 my $Text = "12345TTT   TTAAAAA";
 | 
| 
 | 
  1173 my $TestPat = "TTT\\s+TT";
 | 
| 
 | 
  1174 (my $Result)= $Text =~ m/$TestPat/;
 | 
| 
 | 
  1175 print "D: Two vars are: - = ",$-[0], " &  + =", $+[0]," for test pattern '$TestPat'\n";
 | 
| 
 | 
  1176 
 | 
| 
 | 
  1177 sub printCDR3 {
 | 
| 
 | 
  1178 
 | 
| 
 | 
  1179 =head3 Subroutine: printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence);
 | 
| 
 | 
  1180 
 | 
| 
 | 
  1181 ???? IS THIS FUNCTION IN USE ?????
 | 
| 
 | 
  1182 	
 | 
| 
 | 
  1183 Handles the printing of the output when passed information about the CDR3 region.
 | 
| 
 | 
  1184 
 | 
| 
 | 
  1185 
 | 
| 
 | 
  1186 The result is sent returned as a text string in this version hence use it like this if you want to send it to STDOUT:
 | 
| 
 | 
  1187 
 | 
| 
 | 
  1188  print printCDR3 ($CDR3_Start, $CDR3_End, "SUMMARY_TABLE", $AAQuerySequence, $DNAQuerySequence), "\n";
 | 
| 
 | 
  1189   
 | 
| 
 | 
  1190 #=cut 	
 | 
| 
 | 
  1191 
 | 
| 
 | 
  1192 #Despite the similarity in names, these are all local copies passed to us:
 | 
| 
 | 
  1193 
 | 
| 
 | 
  1194 my ($Start, $End, $Tag, $FullAAQuerySequence, $FullDNAQuerySequence) = @_;
 | 
| 
 | 
  1195 
 | 
| 
 | 
  1196 #For DNA:
 | 
| 
 | 
  1197 my ($CDR_DNA_Seq) = substr ($FullDNAQuerySequence, $Start, $Start+$End);
 | 
| 
 | 
  1198 my ($CDR_DNA_Length) = length ($CDR_DNA_Seq);
 | 
| 
 | 
  1199 
 | 
| 
 | 
  1200 #For AA:
 | 
| 
 | 
  1201 my ($CDR_AA_Seq) = substr ($FullAAQuerySequence, $Start, $Start+$End);
 | 
| 
 | 
  1202 my ($CDR_AA_Length) = length ($CDR_AA_Seq);
 | 
| 
 | 
  1203 
 | 
| 
 | 
  1204 my $ReturnString = join ("\t", $CDR_DNA_Seq, $CDR_DNA_Length, $CDR_AA_Seq, $CDR_AA_Length, $Tag); #Create here so we can inspect it / post process it if needed:
 | 
| 
 | 
  1205 print "D: SUB: printCDR3: As returned: '$ReturnString'\n";
 | 
| 
 | 
  1206 return ($ReturnString);
 | 
| 
 | 
  1207 
 | 
| 
 | 
  1208 }
 | 
| 
 | 
  1209 
 | 
| 
 | 
  1210 =cut 
 | 
| 
 | 
  1211 
 | 
| 
 | 
  1212 
 | 
| 
 | 
  1213 
 | 
| 
 | 
  1214 =head2 Change Log
 | 
| 
 | 
  1215 
 | 
| 
 | 
  1216 =head3 Version 1.2
 | 
| 
 | 
  1217 
 | 
| 
 | 
  1218  1) Fixed the 'Process recrod request' feature' [was failed increment in $Record]
 | 
| 
 | 
  1219  2) Deleted / Deactivated the function 'printCDR3' [wasn't in used; kept if useful for parts].  
 | 
| 
 | 
  1220  	This function is replaced by the more general printOUTPUTData()
 | 
| 
 | 
  1221  3) A tag for the CDR3 status is now output for every record / read.  
 | 
| 
 | 
  1222     Initially this is set to "NOT_FOUND" and changed if evidence for the CDR3 is found.   
 | 
| 
 | 
  1223 
 | 
| 
 | 
  1224 =head4 Version 1.3
 | 
| 
 | 
  1225 
 | 
| 
 | 
  1226  1) The tophit line was split on whitespace, however sometimes the VJFrame is something like “In-frame with stop codon”, 
 | 
| 
 | 
  1227   which means the line is also split on the spaces therein. It now splits on tabs only, and this seems to work properly.
 | 
| 
 | 
  1228   - found by Bas Horsman. 
 | 
| 
 | 
  1229 
 | 
| 
 | 
  1230 =head4 Version 1.3a
 | 
| 
 | 
  1231 
 | 
| 
 | 
  1232  1) "MOTIF_FOUND_IN_AA" reported correctly (was impossible previously due to addition error to the $MotifFound var (never could == 3)
 | 
| 
 | 
  1233  
 | 
| 
 | 
  1234 =cut 
 | 
| 
 | 
  1235 
 | 
| 
 | 
  1236 =head4 Version 1.4
 | 
| 
 | 
  1237 
 | 
| 
 | 
  1238  1) Now processes files using Mac/Unix/MS-DOS newline characters:
 | 
| 
 | 
  1239  
 | 
| 
 | 
  1240   $_ =~ s/\r\n/\n/g;		#In case line ends are MS-DOS
 | 
| 
 | 
  1241   $_ =~ s/\r/\n/g;		#In case line ends are Mac
 | 
| 
 | 
  1242   #The whole record - one per read - is now stored in $_
 | 
| 
 | 
  1243   my @Lines =split (/\R/,$_);	#Split on new lines 
 | 
| 
 | 
  1244 
 | 
| 
 | 
  1245 =head4 Version 1.4a
 | 
| 
 | 
  1246 
 | 
| 
 | 
  1247 1) Fixed the length of the CDR3 AA string being reported correctly:
 | 
| 
 | 
  1248 
 | 
| 
 | 
  1249  $OUTPUT_Data{"CDR3 Length"} = $CDR3_Length;  
 | 
| 
 | 
  1250  to: 
 | 
| 
 | 
  1251  $OUTPUT_Data{"CDR3 Length"} = $CDR3_Seq_AA_Length;
 | 
| 
 | 
  1252   |