Speed : awk command to count the occurrences of fields from one file present in the other file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Speed : awk command to count the occurrences of fields from one file present in the other file
# 1  
Old 09-04-2014
Speed : awk command to count the occurrences of fields from one file present in the other file

Hi,

file1.txt
Code:
AAA
BBB
CCC
DDD

file2.txt
Code:
abc|AAA|AAAabcbcs|fnwufnq
bca|nwruqf|AAA|fwfwwefwef
fmimwe|BBB|fnqwufw|wufbqw
wcdbi|CCC|wefnwin|wfwwf
DDD|wabvfav|wqef|fwbwqfwfe

i need the count of rows of file1.txt present in the file2.txt
required output:
Code:
AAA 2
BBB 1
CCC 1
DDD 1

these are sample files

i cannot copy the original contents here as its a sensitive data
original file1.txt lines are 11142
original file2.txt lines are 602866

for this scenario i have written some awk commands
but they are very slow, please help me to improve the speed of the command

1)
Code:
awk '{ if(NR==FNR) { arr1[FNR]=$0;count1++;next }
          if(NR!=FNR) { count=0
                             for(i=1;i<=count;i++) { if (arr1[i] ~ $0)
                                                               { count ++ }
                                                           }
                             print $0,count
                           }
        } file2.txt file1.txt

in the real time this took 1 1/2 hour to execute

2)
Code:
awk 'NR==FNR{ arr[count]=$0;count++ }
       NR!=FNR{ for(i=0;i<count;i++)
                     { 
                       if( index($0,arr[i])!=0 ) { arr_count[i]++ }
                      }
                    }
       END{ for(i=0;i<count;I++)
               {
                 print arr[i],arrcount[i]
               }
             }' file1.txt file2.txt

this took around 1 hour to execute

please advice to improve the speed of the awk


Moderator's Comments:
Mod Comment Please use code tags next time for your code and data. Thanks

Last edited by vbe; 09-04-2014 at 05:28 AM.. Reason: code tags
# 2  
Old 09-04-2014
If a line in file2.txt has 2 matching patterns, what do you want to do
let's say, second line in file2.txt is bca|BBB|AAA|fwfwwefwef

Last edited by SriniShoo; 09-04-2014 at 05:44 AM.. Reason: typo
# 3  
Old 09-04-2014
I would use a slightly different approach

Code:
awk 'NR==FNR{C[$1]=0; next}{for (i=1; i<=NF; i++) if ($i in C) C[$i]++} END{for(i in C) print i, C[i]}' file1 FS=\| file2

You could also do it the other way around, but that will use a lot more memory, while the order will be preserved..:
Code:
awk 'NR==FNR {for (i=1; i<=NF; i++) C[$i]++; next} $1 in C{print $1, C[$1]}' FS=\| file2 file1

If possible, try to use mawk since that will usually be fastest..

---
A very different approach:
Code:
tr -s '|' '\n' < file2 | grep -xFf file1 - | sort | uniq -c

or in bash / ksh93 you could also use
Code:
grep -xFf file1 <(tr -s '|' '\n' < file2) | sort | uniq -c


Last edited by Scrutinizer; 09-05-2014 at 04:19 AM..
This User Gave Thanks to Scrutinizer For This Post:
# 4  
Old 09-04-2014
@ SriniShoo SriniShoo
It will not be the case
that will not happen in the real time , so dont consider that scenario
# 5  
Old 09-04-2014
Quote:
Originally Posted by mdkm
@ SriniShoo SriniShoo
It will not be the case
that will not happen in the real time , so dont consider that scenario
Code:
awk 'NR == FNR {a[$1]; next} {for(x in a) {if($0 ~ "(^||)" x "(||$)") {a[x]++; next}}} END {for(x in a) print x, a[x]}' file1 file2

This User Gave Thanks to SriniShoo For This Post:
# 6  
Old 09-04-2014
Hello,

Following may help in same.

Code:
awk 'NR==FNR{a[$1]=$1;next} {for(i=1;i<=NF;i++){if($i == a[$i]){v[$i]++}}} END{for(l in v){print l OFS v[l]}}' file FS="|" file2

Output will be as follows.

Code:
AAA 2
BBB 1
CCC 1
DDD 1

Thanks,
R. Singh
This User Gave Thanks to RavinderSingh13 For This Post:
# 7  
Old 09-05-2014
@scritinizer : what if the data in the file1 has

Code:
AAA
 
BBB
CCC
123
A12
32B
Z H4
M H4


Last edited by mdkm; 09-05-2014 at 08:08 AM.. Reason: fixed misplaced code tags
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

awk code to find difference in second file which is not present in first file .

Hi All, I want to find difference between two files and output only lines which are not present in second file .I am using awk and I am getting only the first difference but I want to get all the lines which are not present in file2 .Below is the code I am using . Please help to get the desired... (7 Replies)
Discussion started by: srinivasrao
7 Replies

2. UNIX for Beginners Questions & Answers

awk or sed script to count number of occurrences and creating an average

Hi Friends , I am having one problem as stated file . Having an input CSV file as shown in the code U_TOP_LOGIC/U_HPB2/U_HBRIDGE2/i_core/i_paddr_reg_2_/Q,1,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,1,0,0,0,0... (4 Replies)
Discussion started by: kshitij
4 Replies

3. Shell Programming and Scripting

Total record count of all the file present in a directory

Hi All , We need one help on the below requirement.We have multiple pipe delimited .txt file(around 100 .txt files) present on one directory.We need the total record count of all the files present in that directory without header.File format as below : ... (8 Replies)
Discussion started by: STCET22
8 Replies

4. Shell Programming and Scripting

awk Group By and count string occurrences

Hi Gurus, I'm scratching my head over and over and couldn't find the the right way to compose this AWK properly - PLEASE HELP :confused: Input: c,d,e,CLICK a,b,c,CLICK a,b,c,CONV c,d,e,CLICK a,b,c,CLICK a,b,c,CLICK a,b,c,CONV b,c,d,CLICK c,d,e,CLICK c,d,e,CLICK b,c,d,CONV... (6 Replies)
Discussion started by: Royi
6 Replies

5. Shell Programming and Scripting

Looping over a file to count common fields from another file

Hi, I would like to know how can I get the number of rows in file1 that: - the 1st and 2nd field should be the same (text) - the 3rd field should be less or equal (numeric) when comparing to file2. So for each row of file1, I would like to have the number of rows in file2 that follow the... (5 Replies)
Discussion started by: fadista
5 Replies

6. Shell Programming and Scripting

Count occurrences in awk

Hello, I have an output from GDB with many entries that looks like this 0x00007ffff7dece94 39 in dl-fini.c 0x00007ffff7dece97 39 in dl-fini.c 0x00007ffff7ab356c 50 in exit.c 0x00007ffff7aed9db in _IO_cleanup () at genops.c:1022 115 in dl-fini.c 0x00007ffff7decf7b in _dl_sort_fini (l=0x0,... (6 Replies)
Discussion started by: ikke008
6 Replies

7. Shell Programming and Scripting

Help with Unix and Awk to count number of occurrences

Hi, I have a file (movies.sh), this file contains list of movies such as I want to redirect the movies from movies.sh to file_to_process to allow me process the file with out losing anything. I have tried Movies.sh >> file_to_process But I want to add the row number to the data... (2 Replies)
Discussion started by: INHF
2 Replies

8. UNIX for Dummies Questions & Answers

Search and Count Occurrences of Pattern in a File

I need to search and count the occurrences of a pattern in a file. The catch here is it's a pattern and not a word ( not necessarily delimited by spaces). For eg. if ABCD is the pattern I need to search and count, it can come in all flavors like (ABCD, ABCD), XYZ.ABCD=100, XYZ.ABCD>=500,... (6 Replies)
Discussion started by: tektips
6 Replies

9. Shell Programming and Scripting

To find the count of records from tables present inside a file.

hi gurus, I am having a file containing a list of tables.i want to find the count of records inside thes tables. for this i have to connect into database and i have to put the count for all the tables inside another file i used the following loop once all the tablenames are inside the file. ... (1 Reply)
Discussion started by: navojit dutta
1 Replies
Login or Register to Ask a Question