Help with awk in counting characters based on a column


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Help with awk in counting characters based on a column
# 1  
Old 11-26-2012
Help with awk in counting characters based on a column

Hello,
I am using Awk in UBUNTU 12.04.

I have a file as follows with 2172 rows and 44707 columns. ABO and GPO are the names of my populations.
Code:
ABO_1  1  2
ABO_1  1  2
ABO_2  1  1 
ABO_2  1  2
GPO_1   1  1 
GPO_1  2  2
GPO_2   1  0 
GPO_2  2  0

I want to count the number of 1s and 2s in each population ignoring 0s if there is any but printing 0 if there is no 1 or 2 and have an output like this:
Code:
4 0 2 2 
1 3 1 1

Where 4 0 is the number of "1s" and "2s" in the second column of the first population. 1 3 is the number of "1s" and "2s" in the third column of the first population and so on.

Thank you very much for your help.

Last edited by Homa; 11-26-2012 at 09:40 AM.. Reason: Please use code tags for data and code samples
# 2  
Old 11-26-2012
Try

Code:
awk -F "[_ ]" 'function print_o(){
print X[1]?X[1]:0,X[2]?X[2]:0,Y[1]?Y[1]:0,Y[2]?Y[2]:0;
delete X[1];
delete Y[1];
delete X[2];
delete Y[2];
}
$1 != s && NR > 1{print_o()}
{X[$3]++;Y[$4]++;s=$1}END{print_o()}' file

# 3  
Old 11-26-2012
Thank you but I tried it on the test file as I have posted above and it is extremely slow, it has not finished calculating yet so it should take even longer for my real big file. I have a code as follows myself:
Code:
{
for (i=2; i<=NF; i++)
if ($i=="1") c_one[i]++
    else if ($i=="2") c_two[i]++}
END{
for(i=2; i<=NF; i++)
printf ("%d " " %d\n", c_one[i], c_two[i])
}

But this is for the case of having my populations separated that is ABO in one file and GPO in the other. Maybe this code can be modified for the new file for the populations together.

---------- Post updated at 09:00 AM ---------- Previous update was at 08:53 AM ----------

sorry, I had made a mistake. it is not slow but it gives me these numbers:
Code:
0 0 6 2

for the file above which is not correct.
# 4  
Old 11-26-2012
Quote:
Originally Posted by Homa

sorry, I had made a mistake. it is not slow but it gives me these numbers:
Code:
0 0 6 2

for the file above which is not correct.
Please check..

Code:
$ cat file
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2
GPO_1 1 1
GPO_1 2 2
GPO_2 1 0
GPO_2 2 0
ABO_1 1 2
ABO_1 1 2
ABO_2 1 1
ABO_2 1 2

Code:
$ awk -F "[_ ]" 'function print_o(){
print X[1]?X[1]:0,X[2]?X[2]:0,Y[1]?Y[1]:0,Y[2]?Y[2]:0;
delete X[1];
delete Y[1];
delete X[2];
delete Y[2];
}
$1 != s && NR > 1{print_o()}
{X[$3]++;Y[$4]++;s=$1}END{print_o()}' file
4 0 1 3
2 2 1 1
4 0 1 3

Code:
$ awk -F "[_ ]" 'function print_o(){
print X[1,fn],X[2,fn],Y[1,fn],Y[2,fn];
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{X[$3,fn]++;Y[$4,fn]++;s=$1}END{print_o()}' OFS="\t" file
4               1       3
2       2       1       1
4               1       3

Code:
$ awk -F "[_ ]" 'function print_o(){
print X[1,fn]?X[1,fn]:0,X[2,fn]?X[2,fn]:0,Y[1,fn]?Y[1,fn]:0,Y[2,fn]?Y[2,fn]:0;
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{X[$3,fn]++;Y[$4,fn]++;s=$1}END{print_o()}'  file
4 0 1 3
2 2 1 1
4 0 1 3

Choose any option you want..Smilie

pamu
# 5  
Old 11-26-2012
That works, thanks a lot. I am so sorry but I have a problem because my original file is composed of 47 populations. The script works well for the test file but when I run it on my original file, it gives me 4 columns while it should give me 47*2 columns. I am sorry for my basic questions.
# 6  
Old 11-26-2012
Quote:
Originally Posted by Homa
That works, thanks a lot. I am so sorry but I have a problem because my original file is composed of 47 populations. The script works well for the test file but when I run it on my original file, it gives me 4 columns while it should give me 47*2 columns. I am sorry for my basic questions.
Try sth like this..
Below i started i=3 because i have ignored population name while counting occurrences of 1 and 2.

Code:
awk -F "[_ ]" 'function print_o(){
for(i=3;i<=NF;i++){
print X[1,fn,i]?X[1,fn,i]:0,X[2,fn,i]?X[2,fn,i]:0,Y[1,fn,i]?Y[1,fn,i]:0,Y[2,fn,i]?Y[2,fn,i]:0;
}
fn=NR;
}
$1 != s && NR > 1{print_o()}
NR==1{fn=NR}
{for(i=3;i<=NF;i++){X[$i,fn,i]++;Y[$i,fn,i]++;s=$1}}END{print_o()}'  file


Last edited by pamu; 11-26-2012 at 10:55 AM..
# 7  
Old 11-26-2012
Unfortunately, it still gives me 4 columns. I will try to separate the populations into different files.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Awk/sed summation of one column based on some entry in first column

Hi All , I am having an input file as stated below Input file 6 ddk/djhdj/djhdj/Q 10 0.5 dhd/jdjd.djd.nd/QB 01 0.5 hdhd/jd/jd/jdj/Q 10 0.5 512 hd/hdh/gdh/Q 01 0.5 jdjd/jd/ud/j/QB 10 0.5 HD/jsj/djd/Q 01 0.5 71 hdh/jjd/dj/jd/Q 10 0.5 ... (5 Replies)
Discussion started by: kshitij
5 Replies

2. Shell Programming and Scripting

Awk: split column if special characters

Hi, I've data like these: Gene1,Gene2 snp1 Gene3 snp2 Gene4 snp3 I'd like to split line if comma and then print remaining information for the respective gene. My code: awk '{ if($1 ~ /,/){ n = split($0, t, ",") (7 Replies)
Discussion started by: genome
7 Replies

3. Shell Programming and Scripting

awk to print column number while ignoring alpha characters

I have the following script that will print column 4 ("25") when column 1 contains "123". However, I need to ignore the alpha characters that are contained in the input file. If I were to ignore the characters my output would be column 3. What is the best way to print my column of interest... (3 Replies)
Discussion started by: ncwxpanther
3 Replies

4. Shell Programming and Scripting

Precede and Append characters using sed/awk based on a pattern

I have an input file which is similar to what I have shown below. Pattern : Data followed by two blank lines followed by data again followed by two blank lines followed by data again etc.. The first three lines after every blank line combination(2 blank lines between data) should be... (2 Replies)
Discussion started by: bikerboy
2 Replies

5. Shell Programming and Scripting

awk to sum a column based on duplicate strings in another column and show split totals

Hi, I have a similar input format- A_1 2 B_0 4 A_1 1 B_2 5 A_4 1 and looking to print in this output format with headers. can you suggest in awk?awk because i am doing some pattern matching from parent file to print column 1 of my input using awk already.Thanks! letter number_of_letters... (5 Replies)
Discussion started by: prashob123
5 Replies

6. Shell Programming and Scripting

Pick the column value based on another column using awk or CUT

My scenario is that I need to pick value from third column based on fourth column value, if fourth column value is 1 then first value of third column.Third column (2|3|4|6|1) values are cancatenated. Please someone help me to resolve this issue. Source column1 column2 column3 column4... (2 Replies)
Discussion started by: Ganesh L
2 Replies

7. Shell Programming and Scripting

Sed or awk : pattern selection based on special characters

Hello All, I am here again scratching my head on pattern selection with special characters. I have a large file having around 200 entries and i have to select a single line based on a pattern. I am able to do that: Code: cat mytest.txt | awk -F: '/myregex/ { print $2}' ... (6 Replies)
Discussion started by: usha rao
6 Replies

8. Shell Programming and Scripting

counting lines containing two column field values with awk

Hello everybody, I'm trying to count the number of consecutive lines in a text file which have two distinctive column field values. These lines may appear in several line blocks within the file, but I only want a single block to be counted. This was my first approach to tackle the problem (I'm... (6 Replies)
Discussion started by: origamisven
6 Replies

9. Shell Programming and Scripting

Counting rows line by line from a specific column using Awk

Dear UNIX community, I would like to to count characters from a specific row and have them displayed line-by-line. I have a file called testAwk2.csv which contain the following data: rabbit penguin goat giraffe emu ostrich I would like to count in the middle row individually... (4 Replies)
Discussion started by: vnayak
4 Replies

10. Shell Programming and Scripting

awk count characters, sum, and divide by another column

Hi All, I am another biologist attempting to parse a large txt file containing several million lines like: tucosnp 56762 T Y 228 228 60 23 .CcCcc,,..c.c,cc,,.C... What I need to do is get the frequency of periods (.) plus commas (,) in column 9, and populate this number into another... (1 Reply)
Discussion started by: peromhc
1 Replies
Login or Register to Ask a Question