I am fairly new to programming and trying to resolve this problem. I have the file like this.
CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam |
tg93 77 T C T T T T T |
tg93 79 C - C C C - - |
tg93 79 C G C C C C G C |
tg93 80 G A G G G G A A G |
tg93 81 A C A A A A C C C |
tg93 86 C A C C A A A A C |
tg93 105 A G A A A A A G A |
tg93 108 A G A A A A G A A |
tg93 114 T C T T T T T C T |
tg93 131 A C A A A A A A A |
tg93 136 G C C G C C G G G |
tg93 150 CTCTC - CTCTC - CTCTC CTCTC |
In this file, in the heading
CHROM - name POS - position REF - reference ALT - alternate 10 - 16_sample.bam - samplesd I Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row. For example In the first row, i have 'T' in REF and 'C' in ALT . I see in 7 samples, there are 5 T's and 2 blanks and no C. So i need to delete this row. In Second row, REF is 'C' and Alt is '-'. Now in seven samples we have 3 C's, 2 '-'s and 2 blanks. So we keep this row as C and - have repeated more than 2 times. Always we ignore the blanks while counting
The final file after filtering is
#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam |
tg93 79 C - C C C - - |
tg93 80 G A G G G G A A G |
tg93 81 A C A A A A C C C |
tg93 86 C A C C A A A A C |
tg93 108 A G A A A A G A A |
tg93 136 G C C G C C G G G |
I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.
Thank you for the help !!