Counting words from one file in another file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Counting words from one file in another file
# 1  
Old 01-27-2011
Counting words from one file in another file

Hi All,

I have written a script on this but it does not do the requisite job. My requirement is this:

1. I have two kinds of files each with different extensions. One set of files are *.dat (6000 unique DAT files all in one directory) and another set *.dic files (6000 unique DIC files in all in the same directory where DAT files are located)

2. The files only contain words all in new lines. For example:
1.dat contains something like this"
Code:
computer
red
apple
orange

1.dic looks like this:
Code:
computer
apple
red
blue

3. For every corresponding DAT file there is a DIC file. For 1.dat, I have 1.dic, 2.dat and 2.dic .......6000.dat and 6000.dic

4. What I want to do is to read every word from DIC files and search in the corresponding DAT file and find the number of times the word appears in the .dat file and write the result in .cnt file with the same number. For example:
1.dic contains 10 words, I read every word from 1.dic line by line and search in 1.dat as to how many times each word from 1.dic appears in 1.dat. Then I write the result (i.e. count values) in every line in 1.cnt. Similarly, I read every word in 2.dic line by line, search words in 2.dat and write the count values in 2.cnt. My 2.cnt should look something like this
Code:
2
3
1
3

i.e word in the first line (of 2.dic) appears 2 times in 2.dat. Same thing has to be done with all the 6000 files.

What I have done so far:
Code:
ls -1 *.dat | while read page
do
  ls -l *.dic | while read page1
    cat $page1 | while read var1
    do
      grep -c $var1 $page > $page.cnt
    done
  done
done


Last edited by Scott; 01-27-2011 at 01:03 PM.. Reason: Please use code tags
# 2  
Old 01-27-2011
Untested:
Code:
awk 'END { out() }
FNR == 1 {
  NR > 1 && out() 
  fn = FILENAME; split(x, w)
  sub(/.dic$/, x, fn); c = x
  dn = fn ".dat"; on = fn ".cnt"
  }
{  w[$0]; p[++c] = $0 } 
func out() {
  while ((getline dat < dn) > 0)
    dat in w && w[dat]++
  for (i = 0; ++i <= c;)
    printf "%d\n", w[p[i]] > on
  close(on)
  }  
  ' *dic


Last edited by radoulov; 01-27-2011 at 03:10 PM..
This User Gave Thanks to radoulov For This Post:
# 3  
Old 01-27-2011
Also untested, but pretty near :-)
Code:
for DAT in *.dat
do
CNT="`basename $DAT .dat`".cnt
DIC="`basename $DAT .dat`".dic
awk 'NR==FNR{dic[$1]++; next}dic[$1]{cnt[$1]++}END{ for(c in cnt) print cnt[c]; }' $DIC $DAT > $CNT
done

This User Gave Thanks to citaylor For This Post:
# 4  
Old 01-27-2011
citaylor's solution gets the counts but outputs them in the wrong order.

This fixes (tested)

Code:
for DAT in *.dat
do
CNT=$(basename "$DAT" .dat).cnt
DIC=$(basename "$DAT" .dat).dic
awk 'NR==FNR{dic[$1]=r++; cnt[r]=0; next}
   $1 in dic{cnt[dic[$1]]++} 
   END{for(i=0;i<r;i++) print cnt[i]}' $DIC $DAT > $CNT 
done


Last edited by Chubler_XL; 01-27-2011 at 06:15 PM.. Reason: fix formatting
This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 01-28-2011
CPU & Memory Now, Counting number of files that contain words stored in another file

Hi All,

Thanks for your replies.

Using some of the code above I have come up with a solution of my own to another problem using the same set of files.

What I want to do is to read every word from DIC files and search in "ALL" DAT files and find the "number" of DAT files that contain that word from the DIC file and store the result in FIL files. This means I have to only count once in the DAT files even if that word appears several times in that DAT file. For example:
1.dic contains 10 words, I read every word from 1.dic line by line and search in all DAT files as to how many DAT files contain that word from 1.dic. Then I write the result (i.e. count values) in every line in 1.fil. Similarly, I read every word in 2.dic line by line, search words in all DAT files and write the count values in 2.fil. My 2.fil should look something like this:
Code:
20
32
1
3

i.e word in the first line (of 2.dic) appears 20 times in all the DAT files (counting that word only once in all DAT files even if one DAT file contains that word several times). Same thing has to be done with all the 6000 DIC files.
Code:
for DAT in *.dat
do
for DIC in *.dic
do
while read word
CNT=$(basename "$DAT" .dat).fil
DIC=$(basename "$DAT" .dat).dic
grep -il "$word" | find . | wc -l $DIC $DAT > $FIL
done
done

# 6  
Old 01-30-2011
How about something like this:

Code:
for DIC in *.dic
do
    FIL=$(basename "$DIC" .dic).fil
    ls *.dat | awk '
       NR==FNR {dic[$1]=r++; cnt[r]=0; next}
       { FILE=$0
         while((getline < FILE))
             if($1 in dic && !($1 in fnd)) {
                 fnd[$1]++;
                 cnt[dic[$1]]++;
             }
         delete fnd
       }
       END {for(i=0;i<r;i++) print cnt[i]}' $DIC - > $FIL
done

This User Gave Thanks to Chubler_XL For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Splitting concatenated words in input file with words from the same file

Dear all, I am working with names and I have a large file of names in which some words are written together (upto 4 or 5) and their corresponding single forms are also present in the word-list. An example would make this clear annamarie mariechristine johnsmith johnjoseph smith john smith... (8 Replies)
Discussion started by: gimley
8 Replies

2. Shell Programming and Scripting

Counting occurrence of all words in a file

Hi, Given below is the input file: http://i53.tinypic.com/2vmvzb8.png Given below is what the output file should look like: http://i53.tinypic.com/1e6lfq.png I know how to count the occurrence of 1 word from a file, but not all of them. Can someone help please? An explanation on the... (1 Reply)
Discussion started by: r4v3n
1 Replies

3. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

4. Shell Programming and Scripting

Counting number of files that contain words stored in another file

Hi All, I have written a script on this but it does not do the requisite job. My requirement is this: 1. I have two kinds of files each with different extensions. One set of files are *.dat (6000 unique DAT files all in one directory) and another set *.dic files (6000 unique DIC files in... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

5. Shell Programming and Scripting

Help in counting the no of repeated words with count in a file

Hi Pls help in solving my doubt.Iam having file like below file1.txt priya jenny jenny priya raj radhika priya bharti bharti Output required: I need a output like count of repeated words with name for ex: priya 3 jenny 2 (4 Replies)
Discussion started by: bha148
4 Replies

6. Programming

Counting the words in a file

Please find the below program. It contains the purpose of the program itself. /* Program : Write a program to count the number of words in a given text file */ /* Date : 12-June-2010 */ # include <stdio.h> # include <stdlib.h> # include <string.h> int main( int argc, char *argv ) {... (6 Replies)
Discussion started by: ramkrix
6 Replies

7. Shell Programming and Scripting

Counting words

Hi Is there a way to count the no. of words in all files in directory. All are text files.I use wc -w but somehow i am not getting the rite answer. Is there an alternative. Thanks in advance (9 Replies)
Discussion started by: kinny
9 Replies

8. UNIX for Dummies Questions & Answers

counting words

if i have a long list of data, with every line beginning with an ip-address, like this: 62.165.8.187 - - "GET /bestandnaam.html HTTP/1.1" 200 5848 "http://www.domeinnaam.nl/bestandnaam.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" how do i count which ip-adresses are mentioned... (3 Replies)
Discussion started by: FOBoy
3 Replies

9. UNIX for Dummies Questions & Answers

counting words then amending to a file

i want to count the number of words in a file and then redirect this to a file echo 'total number of words=' wc -users>file THis isnt working, anyone any ideas. (1 Reply)
Discussion started by: iago
1 Replies

10. Shell Programming and Scripting

Counting words in a file

I'm trying to figure out a way to count the number of words in the follwing file: cal 2002 > file1 Is there anyway to do this without using wc but instead using the cut command? (1 Reply)
Discussion started by: r0mulus
1 Replies
Login or Register to Ask a Question