Parse tab delimited file, check condition and delete row

09-18-2012

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

Parse tab delimited file, check condition and delete row

I am fairly new to programming and trying to resolve this problem. I have the file like this.

CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam

tg93 77 T C T T T T T

tg93 79 C - C C C - -

tg93 79 C G C C C C G C

tg93 80 G A G G G G A A G

tg93 81 A C A A A A C C C

tg93 86 C A C C A A A A C

tg93 105 A G A A A A A G A

tg93 108 A G A A A A G A A

tg93 114 T C T T T T T C T

tg93 131 A C A A A A A A A

tg93 136 G C C G C C G G G

tg93 150 CTCTC - CTCTC - CTCTC CTCTC

In this file, in the heading

CHROM - name POS - position REF - reference ALT - alternate 10 - 16_sample.bam - samplesd I Now i wanted to see how many times the letter in REF and ALT column occured. If either of them is repeated less than two times, i need to delete that row. For example In the first row, i have 'T' in REF and 'C' in ALT . I see in 7 samples, there are 5 T's and 2 blanks and no C. So i need to delete this row. In Second row, REF is 'C' and Alt is '-'. Now in seven samples we have 3 C's, 2 '-'s and 2 blanks. So we keep this row as C and - have repeated more than 2 times. Always we ignore the blanks while counting

The final file after filtering is

#CHROM POS REF ALT 10_sample.bam 11_sample.bam 12_sample.bam 13_sample.bam 14_sample.bam 15_sample.bam 16_sample.bam

tg93 79 C - C C C - -

tg93 80 G A G G G G A A G

tg93 81 A C A A A A C C C

tg93 86 C A C C A A A A C

tg93 108 A G A A A A G A A

tg93 136 G C C G C C G G G

I am able to read the columns in to arrays and display them in the code but i am not sure how to start the loops to read the base and count their occurences and remain the column. Can anyone tell me how i should be proceeding with this? Or it will be helpful if you have any example code i can modify up on.

Thank you for the help !!

empyrean

View Public Profile for empyrean

Find all posts by empyrean

09-18-2012

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

 
awk -f c.awk infile

where c.awk:

Code:

 
NR == 1 { print "#" $0; }
NR > 1 {
  l1c=l2c=0;
  for (i=5; i<=NF; i++) {
    if ($3 == $(i)) l1c++;
    if ($4 == $(i)) l2c++;
  }
  if (l1c>1 && l2c>1) print $0;
}

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

09-18-2012

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

I am not too familiar with awk.
This works perfect. I would love to understand this code. Can you explain briefly?
Thank you so much for the code

empyrean

View Public Profile for empyrean

Find all posts by empyrean

09-18-2012

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

 
NR == 1 { print "#" $0; }         # print record number 1 preceeded with "#"
NR > 1 {                          # for record number > 1
  l1c=l2c=0;                      # set counters
  for (i=5; i<=NF; i++) {         # for fields 5 and greater
    if ($3 == $(i)) l1c++;        # if field matches field 3 increment counter 1
    if ($4 == $(i)) l2c++;        # if field matches field 4 increment counter 2
  }
  if (l1c>1 && l2c>1) print $0;   # if counters are both > 1 print line;
}

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

09-18-2012

Registered User

58, 0

Join Date: Jun 2009

Last Activity: 13 March 2014, 4:17 PM EDT

Posts: 58

Thanks Given: 12

Thanked 0 Times in 0 Posts

cool.. much clear, easy and very few lines of code.. Thank you rdrtx1

empyrean

View Public Profile for empyrean

Find all posts by empyrean

Shell Programming and Scripting

Parse tab delimited file, check condition and delete row

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Replace a column in tab delimited file with column in other tab delimited file,based on match

Discussion started by: YogeshG

2. UNIX for Beginners Questions & Answers

awk to parse current and next row in tab-delimited file

Discussion started by: emiley

3. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Discussion started by: raja kakitapall

4. Shell Programming and Scripting

Delete and insert columns in a tab delimited file

Discussion started by: Hypesslearner

5. Shell Programming and Scripting

Delete an entire column from a tab delimited file

Discussion started by: sampoorna

6. UNIX for Dummies Questions & Answers

Delete header row and reformat from tab delimited to fixed width

Discussion started by: chumsky

7. UNIX for Dummies Questions & Answers

How do you delete cells from a space delimited text file given row and column number?

Discussion started by: evelibertine

8. Shell Programming and Scripting

Delete first column in tab-delimited text-file

Discussion started by: andmal

9. Shell Programming and Scripting

Delete parts of a string of character in one given column of a tab delimited file

Discussion started by: matlavmac

10. Shell Programming and Scripting

Check whether a given file is in ASCII format and data is tab-delimited

Discussion started by: Mandab