counting using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting counting using awk
# 1  
Old 05-27-2011
counting using awk

Hi,

I want to perform a task using shell script. I am new to awk programming and any help would be greatly appreciated.

I have the following 3 files (for example)

file1:

Code:
Name        count       Symbol
chr1_1_50  10           XXXX
chr3_101_150  30      YYYY

File2:

Code:
Name           Count     Symbol
chr_1_1_50   100        XXXX
chr3_101_150 57         YYYY

File3:

Code:
Name        Count     Symbol
chr1_1_50  120        XXXX
chr3_101_150  65     YYYY

Now I want to write top 10% counts (example if row 1,3,5 are top 10 from file 1 they need to be picked from file2 and file3 irrespective of they being in the top 10 or not, and repeat the same for file2 and file3 top 10% counts) and output all of them to a file. Also output the rest 90% to a different file.

Thanks,

Diya
# 2  
Old 05-27-2011
Hi and welcome to the forum.
Try to break your problem down into simpler sub-tasks. E.g: you want the top 10 counts, so it would make sense to sort your input files first:
Code:
sort -n -k2,2 -r file1

will do a numeric (-n) sort descending (reverse -r) on second field (-k2,2). Now to find the top 10, you just need to look at the first 10 lines.
So I'd approach this with feeding the sorted files into awk:
Code:
awk '#do the hard work' <(sort -nrk2,2 file1) <(sort -nrk2,2 file2)  <(sort -nrk2,2 file3)

Now to pull the maximum of the top 10 from each input, you could do something like:
Code:
awk 'FNR<=10{  #I only care about the first ten lines in each file
  if($2>cnt[$1])    #get the global max among the files
    cnt[$1]=$2
}
END{
  for(i in cnt)
    print i "  " cnt[i]
}' <(sort -nrk2,2 file1) <(sort -nrk2,2 file2)  <(sort -nrk2,2 file3) >output.txt

output.txt should now contain something like:
Code:
chr_1_1_50  100
chr3_101_150  65
chr1_1_50  120

(in random order, since 'for(i in cnt)' doesn't sort anything).

I don't quite understand what do you mean by
Quote:
they need to be picked from file2 and file3
or what is your desired output. But if you take it one small step at a time, you're gonna eventually get there.
E.g. you could read the lines from output.txt and grep for the name in the input files to get the other values:
Code:
while read name count ; do 
  grep $name file1 >> globalTop10inFile1.txt
done < output.txt

etc.

Approaching the problem in this step-by-step fashion, it's much easier to debug -- you can verify the intermediate results easily.
Give it a shot and let us know how it goes!

mirni
# 3  
Old 05-29-2011
Hi Mirni,

Thanks for the response.

Probably I was not very clear with my question.

I have 3 input files each with 339962 entries and I need to get top 10% counts from these 3 files.

For instance row 1 ,3,5 from file 1 are in top10% from file1 then those need to be picked for file2 and file3 irrespective of they being in the top 10% or not.

For file 2 suppose row 2,3,4 are in top 10% then they need to be picked for file 1 and file 3

For file 3 row 3,4,5 are in top 10% they need to be picked from file1 and file 2

So now the output file should have rows 1,2,3,4,5 from all the three files.

Also as I mentioned I need the rest 90% should be saved to a different file.

So I should have 2 outputs one with top10% counts from 3 files and rest 90% in other .

Thanks in advance,

Diya
# 4  
Old 05-30-2011
Code:
awk 'FNR<=10 {print $1 |"sort -u" }' <(sort -nrk2,2 file1) <(sort -nrk2,2 file2)  <(sort -nrk2,2 file3) > top.10.list

awk 'NR==FNR{a[$1];next} {print > ($1 in a?"top.10":"rest.90") }' top.10.list file1 file2 file3

after that, you will get two files:

top.10 - include all top 10 instances.
rest.90 - rest instances.
# 5  
Old 05-30-2011
Hi,

Thanks for the reply.. but some how the code does not work.

My sample files are (example I list only file 1) The file 2 and file 3 have columns 1&3 similar but the counts in the 2nd column differ.


chr22_16256301_16256350 0 PATF
chr22_16256351_16256400 0 CFTY
chr22_16256401_16256450 0 ATGC
chr22_16256451_16256500 0 RBRH
chr22_16256501_16256550 0 THGC
chr22_16256551_16256600 0 OTHV

When I tried the code above it listed all the rows in top.10 output file and did not produce any rest.90 file as output.

Thanks,

Diya
# 6  
Old 05-30-2011
Code:
awk 'FNR<=10 {print $1 FS $3 |"sort -u" }' <(sort -nrk2,2 file1) <(sort -nrk2,2 file2)  <(sort -nrk2,2 file3) > top.10.list

awk 'NR==FNR{a[$1 FS $3];next} {print > ($1 FS $3 in a?"top.10":"rest.90") }' top.10.list file1 file2 file3

# 7  
Old 05-30-2011
Thanks for the quick reply. Now it dint print anything in top.10 and placed everything in rest.90

Also I need them to be printed side by side.

Chr1_1_50 50 ACXA Chr1_1_50 10 ACXA Chr_1_1_50 20 ACXA

so on...

for both 90% and 10%.

Thanks,


Diya
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Word-counting and substitution with awk

Hi!! I am trying to write a program which allows me to count how many times I used the same word in a text: {$0 = tolower ($0) gsub (/_]/, "", $0) for (i = 1; i <= NF; i++) freq++ } END { for (word in freq) printf "%s\t%d\n", word, freq It seems work but... (3 Replies)
Discussion started by: ettore8888
3 Replies

2. Shell Programming and Scripting

Counting lines in a file using awk

I want to count lines of a file using AWK (only) and not in the END part like this awk 'END{print FNR}' because I want to use it. Does anyone know of a way? Thanks a lot. (7 Replies)
Discussion started by: guitarist684
7 Replies

3. UNIX for Dummies Questions & Answers

Awk: Counting occurrences between two files

Hi, I have two text files (1.txt and 2.txt). 2.txt contains two columns which are extracted from 1.txt using a simple if(condition) print. I want to: - count how many times the values contained in 2.txt appear in 1.txt -if they appear just one time, I have to delete the entire row in... (5 Replies)
Discussion started by: Pintug
5 Replies

4. Shell Programming and Scripting

awk counting question

Probably a simple to this, but unsure how to do it. I would prefer an AWK solution. Below is the data set. 1 2 3 2 5 7 4 6 9 1 5 4 8 5 7 1 1 10 15 3 12 3 7 9 9 8 10 4 5 2 9 1 10 4 7 9 7 12 6 9 13 8 For the second... (11 Replies)
Discussion started by: mollydog11
11 Replies

5. Shell Programming and Scripting

Counting Fields with awk

ok, so a user can specify options as is shown below: ExA: cpu.pl!23!25!-allow or ExB: cpu.pl!23!25!-block!all options are delimited by the exclamation mark. now, in example A, there are 4 options provided by the user. in example B, there are 5 options provided by the user. ... (3 Replies)
Discussion started by: SkySmart
3 Replies

6. Shell Programming and Scripting

Counting Instances of a String with AWK

I have a list of URLs and I want to be able to count the number of instances of addresses ending in a certain TLD and output and sort it like so. 5 bdcc.com 48 zrtzr.com 49 rvo.com Input is as so ync.org sduzj.edu sduzj.edu sduzj.edu sduzj.edu sduzj.edu sduzj.edu sduzj.edu... (1 Reply)
Discussion started by: Pjstaab
1 Replies

7. Shell Programming and Scripting

awk finding counting sequence

Can awk count numbers until it reaches the end of the sequence after the slash? input: serv1a, 32, 41/47, 53, 89/100, 108/11, 113. serv1b, 1/2, 114/18, 121/35, 139/40, 143/55, 159/64, serv2, 255/56, 274/77, 763, 774/75, 777, 1434/35, 1444/50, 1715, 2025/31, 2048. serv10b, 804, 808, 929/32,... (9 Replies)
Discussion started by: sdf
9 Replies

8. Shell Programming and Scripting

counting non integer number in awk

Hi, I am having the following number in the file tmp 31013.004 20675.336 43318.190 30512.926 48992.559 277893.111 41831.330 8749.113 415980.576 28273.054 I want to add these numbers, I am using following script awk 'END{print s}{s += $1}' tmp its giving answer 947239 which is correct,... (3 Replies)
Discussion started by: chaitubek
3 Replies

9. Shell Programming and Scripting

Counting with Awk

I need "awk solution" for simple counting! File looks like: STUDENT GRADE student1 A student2 A student3 B student4 A student5 B Desired Output: GRADE No.of Students A 3 B 2 Thanks for awking! (4 Replies)
Discussion started by: saint2006
4 Replies

10. Shell Programming and Scripting

Counting records with AWK

I've been working with an awk script and I'm wondeing id it's possible to count records in a file which DO NOT contain, in this instance fields 12 and 13. With the one script I am wanting to display the count for the records WITH fields 12 and 13 and a seperate count of records WITHOUT fields... (2 Replies)
Discussion started by: Glyn_Mo
2 Replies
Login or Register to Ask a Question