Help with awk in counting characters based on a column

11-26-2012

Registered User

67, 1

Join Date: Oct 2012

Last Activity: 9 December 2013, 9:57 AM EST

Posts: 67

Thanks Given: 29

Thanked 1 Time in 1 Post

Help with awk in counting characters based on a column

Hello,
I am using Awk in UBUNTU 12.04.

I have a file as follows with 2172 rows and 44707 columns. ABO and GPO are the names of my populations.

Code:

ABO_1  1  2
ABO_1  1  2
ABO_2  1  1 
ABO_2  1  2
GPO_1   1  1 
GPO_1  2  2
GPO_2   1  0 
GPO_2  2  0

I want to count the number of 1s and 2s in each population ignoring 0s if there is any but printing 0 if there is no 1 or 2 and have an output like this:

Code:

4 0 2 2 
1 3 1 1

Where 4 0 is the number of "1s" and "2s" in the second column of the first population. 1 3 is the number of "1s" and "2s" in the third column of the first population and so on.

Thank you very much for your help.

Last edited by Homa; 11-26-2012 at 09:40 AM.. Reason: Please use code tags for data and code samples

Homa

View Public Profile for Homa

Find all posts by Homa

11-26-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

Try

Code:

awk -F "[_ ]" 'function print_o(){
print X[1]?X[1]:0,X[2]?X[2]:0,Y[1]?Y[1]:0,Y[2]?Y[2]:0;
delete X[1];
delete Y[1];
delete X[2];
delete Y[2];
}
$1 != s && NR > 1{print_o()}
{X[$3]++;Y[$4]++;s=$1}END{print_o()}' file

pamu

View Public Profile for pamu

Find all posts by pamu

11-26-2012

Registered User

67, 1

Join Date: Oct 2012

Last Activity: 9 December 2013, 9:57 AM EST

Posts: 67

Thanks Given: 29

Thanked 1 Time in 1 Post

Thank you but I tried it on the test file as I have posted above and it is extremely slow, it has not finished calculating yet so it should take even longer for my real big file. I have a code as follows myself:

Code:

{
for (i=2; i<=NF; i++)
if ($i=="1") c_one[i]++
    else if ($i=="2") c_two[i]++}
END{
for(i=2; i<=NF; i++)
printf ("%d " " %d\n", c_one[i], c_two[i])
}

But this is for the case of having my populations separated that is ABO in one file and GPO in the other. Maybe this code can be modified for the new file for the populations together.

---------- Post updated at 09:00 AM ---------- Previous update was at 08:53 AM ----------

sorry, I had made a mistake. it is not slow but it gives me these numbers:

Code:

0 0 6 2

for the file above which is not correct.

Homa

View Public Profile for Homa

Find all posts by Homa

11-26-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

Quote:

Originally Posted by Homa

sorry, I had made a mistake. it is not slow but it gives me these numbers:

Code:

0 0 6 2

for the file above which is not correct.

Please check..

Code:

$ cat file
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2
GPO_1 1 1
GPO_1 2 2
GPO_2 1 0
GPO_2 2 0
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2

Code:

$ awk -F "[_ ]" 'function print_o(){
print X[1]?X[1]:0,X[2]?X[2]:0,Y[1]?Y[1]:0,Y[2]?Y[2]:0;
delete X[1];
delete Y[1];
delete X[2];
delete Y[2];
}
$1 != s && NR > 1{print_o()}
{X[$3]++;Y[$4]++;s=$1}END{print_o()}' file
4 0 1 3
2 2 1 1
4 0 1 3

Code:

$ awk -F "[_ ]" 'function print_o(){
print X[1,fn],X[2,fn],Y[1,fn],Y[2,fn];
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{X[$3,fn]++;Y[$4,fn]++;s=$1}END{print_o()}' OFS="\t" file
4               1       3
2       2       1       1
4               1       3

Code:

$ awk -F "[_ ]" 'function print_o(){
print X[1,fn]?X[1,fn]:0,X[2,fn]?X[2,fn]:0,Y[1,fn]?Y[1,fn]:0,Y[2,fn]?Y[2,fn]:0;
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{X[$3,fn]++;Y[$4,fn]++;s=$1}END{print_o()}'  file
4 0 1 3
2 2 1 1
4 0 1 3

Choose any option you want..

pamu

View Public Profile for pamu

Find all posts by pamu

11-26-2012

Registered User

67, 1

Join Date: Oct 2012

Last Activity: 9 December 2013, 9:57 AM EST

Posts: 67

Thanks Given: 29

Thanked 1 Time in 1 Post

That works, thanks a lot. I am so sorry but I have a problem because my original file is composed of 47 populations. The script works well for the test file but when I run it on my original file, it gives me 4 columns while it should give me 47*2 columns. I am sorry for my basic questions.

Homa

View Public Profile for Homa

Find all posts by Homa

11-26-2012

Registered User

1,650, 478

Join Date: Mar 2012

Last Activity: 11 September 2019, 8:06 AM EDT

Posts: 1,650

Thanks Given: 58

Thanked 478 Times in 474 Posts

Quote:

Originally Posted by Homa

Try sth like this..
Below i started i=3 because i have ignored population name while counting occurrences of 1 and 2.

Code:

awk -F "[_ ]" 'function print_o(){
for(i=3;i<=NF;i++){
print X[1,fn,i]?X[1,fn,i]:0,X[2,fn,i]?X[2,fn,i]:0,Y[1,fn,i]?Y[1,fn,i]:0,Y[2,fn,i]?Y[2,fn,i]:0;
}
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{for(i=3;i<=NF;i++){X[$i,fn,i]++;Y[$i,fn,i]++;s=$1}}END{print_o()}'  file

Last edited by pamu; 11-26-2012 at 10:55 AM..

pamu

View Public Profile for pamu

Find all posts by pamu

11-26-2012

Registered User

67, 1

Join Date: Oct 2012

Last Activity: 9 December 2013, 9:57 AM EST

Posts: 67

Thanks Given: 29

Thanked 1 Time in 1 Post

Unfortunately, it still gives me 4 columns. I will try to separate the populations into different files.

Homa

View Public Profile for Homa

Find all posts by Homa

Shell Programming and Scripting

Help with awk in counting characters based on a column

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Awk/sed summation of one column based on some entry in first column

Discussion started by: kshitij

2. Shell Programming and Scripting

Awk: split column if special characters

Discussion started by: genome

3. Shell Programming and Scripting

awk to print column number while ignoring alpha characters

Discussion started by: ncwxpanther

4. Shell Programming and Scripting

Precede and Append characters using sed/awk based on a pattern

Discussion started by: bikerboy

5. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Discussion started by: prashob123

6. Shell Programming and Scripting

Pick the column value based on another column using awk or CUT

Discussion started by: Ganesh L

7. Shell Programming and Scripting

Sed or awk : pattern selection based on special characters

Discussion started by: usha rao

8. Shell Programming and Scripting

counting lines containing two column field values with awk

Discussion started by: origamisven

9. Shell Programming and Scripting

Counting rows line by line from a specific column using Awk

Discussion started by: vnayak

10. Shell Programming and Scripting

awk count characters, sum, and divide by another column

Discussion started by: peromhc