matching group of words


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting matching group of words
# 1  
Old 09-13-2011
Bug matching group of words

Hi,

I am stuck with a problem, will be thankful for your guidance and help.

I have two files. Each line is a group of words with first word as group Id. eg. 'gp1' in File1 and 'grp1' in File2.

Code:
 
<File1> 
gp1 : xyz xys3 syt2 ssx itt kty
gp2 : syt2 kgk iti op2 
gp3 : ppy yt5 itt sky utw yry


Code:
 
<File2>
grp1 : yt5 utw
grp2 : iti op2 kty
grp3 : kty xys3 syt2 
grp4 : utw ppy
grp5 : xyz iti yt5
grp6 : ssx xyz

I want to find out which (complete) group of words of file2 occur in group of words in File1. So, each line of file2 is searched against each line of file1.

The output file should give which groups of file2 were found in which groups of file1.

Code:
 
<OutFile>
gp1 : grp3 grp6
gp2 : None
gp3 : grp1 grp4


Last edited by mira; 09-13-2011 at 06:40 PM..
# 2  
Old 09-13-2011
Think of the files as database tables. You want to do a join, where you match key columns of table a to table b, creating a new view, virtual table c. However, your files have multiple keys in each line, variable numbers of keys, too. To be relational, you want to unstack the lines and compare column 2 to column 2. The file 2 query is only satisfied with N (2 or 3) matches. So, the count for each grep can be a third relation/table.

Often, the solution so such joins for smaller files is to load one file into content-addressable/associative arrays or vectors in perl, awk, ksh or bash, and then examine the other file field by field against the arrays.

For large data sets, some sort of sort-merge finds all matches, and another sort-merge allows the matches to be checked to see if there are enough matches to make a hit. This is a vary old EDP pragma, but very robust and efficient for large sets.
# 3  
Old 09-14-2011
Thanks DGPickett for your suggestions. I tried something with perl, comparing 2 arrays in some way but it didn't work.

Please help with other suggestions.
# 4  
Old 09-14-2011
What did you try? In what way did it "not work"?
# 5  
Old 09-14-2011
Quote:
Originally Posted by mira
Thanks DGPickett for your suggestions. I tried something with perl, comparing 2 arrays in some way but it didn't work.

Please help with other suggestions.
One approach is to compare each item of file2 with every line of file1...and grep won't work unless each line of file2 was matched in file1 exactly and this isnt the case. Take line "grp1 : yt5 utw" of file2...these items are found in "gp3 : ppy yt5 itt sky utw yry" of file1 but they are not side by side so grep will fail. That is why it is better to pick each item in file2 and see if it is contained in a line of file1. If all the items match then that line of file2 is contained in file1.
# 6  
Old 09-15-2011
File 1 should be put in arrays, and file 2 examined with it, or vice versa. How you strucutre the arrays depends on whether the words are unique in the file you store. This is because when you address an array with a value, it returns only one string. If there is uniqueness, that helps. One way to deal with not unique is to add a dimension to hold values 1 to N, and add a parallel vector to hold N.

File 2 not only has (unique on line) words but also the word count, which is needed as a minimum word count to achieve match, and which might go in a parallel array. However, you need to structure the array so the address is the question and the returned value is the answer, not vice versa, so it may not be unique that way.

The sort merge is to build three 2-column files: file 1 key to each one file 1 word, file 2 key to each file 2 word, and file 2 key to word count. Sort file 1 and 2 by word and merge the sorted output to build a three column file: file 2 key, file 1 key, word, sort that by file 2 key and merge with third file, seeing if the number of matched words is right. In some respects, this is simpler than the array solution, where you need to deal with unique and which field is the key to which array. This is robust against all file sizes, duplicates. If there are duplicates, sort can remove them, or the merge, knowing some files are not unique, can deal with that. You want to avoid the cartesian join problem, where the N records in one file for a key field match M records on the other file, for NxM output records. If this is the case, sort into flat files and use 'join' to do the walking.

Crude and lewd is many passes: For each line of file 2, for each line of file 1 for each word in file 1 line for each word of file 2 line if match set indicator and break, etc. File 2 gets read file 1 lines times.

Last edited by DGPickett; 09-15-2011 at 10:50 AM..
# 7  
Old 09-19-2011
Quote:
Originally Posted by DGPickett
The sort merge is to build three 2-column files: file 1 key to each one file 1 word, file 2 key to each file 2 word, and file 2 key to word count. Sort file 1 and 2 by word and merge the sorted output to build a three column file: file 2 key, file 1 key, word, sort that by file 2 key and merge with third file, seeing if the number of matched words is right. In some respects, this is simpler than the array solution, where you need to deal with unique and which field is the key to which array. This is robust against all file sizes, duplicates. If there are duplicates, sort can remove them, or the merge, knowing some files are not unique, can deal with that. You want to avoid the cartesian join problem, where the N records in one file for a key field match M records on the other file, for NxM output records. If this is the case, sort into flat files and use 'join' to do the walking.
Not sure how this solution is going to pan out because sorting file1 and file2 still wont lineup the records for a match...unless you provide the script to show how it works as i am unable to visualize it.

Here's my solution which stores file1 and file2 in arrays and matches items of file2 array against items of file1 array...incrementing a counter so no of matches equals no of items in each record of file2.
Code:
awk '{
   FS = " : "
   if (FILENAME == "file1") x[$1] = $2
   if (FILENAME == "file2") y[$1] = $2
} END {
   for (i in x) {
      for (j in y) {
         n = split(y[j], a, " ")
         for (p = 1; p <= n; p++)
            s += gsub(a[p], a[p], x[i])
         if (n == s) u[i] = u[i] ? u[i]" "j : i FS j
         s = 0
      }
      if (!u[i]) u[i] = i" : None"
   }
   for (i in u) print u[i]
}' file1 file2

These 2 Users Gave Thanks to shamrock For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Print ALL matching words in a string

Hi. str=" {aaID=z_701; time=2012-10-08 00:00:00.000}; {aaID=S_300; time=2012-10-08 00:00:00.000}]}; ansokningsunderlag={anmaln......} {aaID=x_500; time=2012-10-08 00:00:00.000}]}; ansokningsunderlag={anmaln......}" I want to print: z_701 S_300 x_500 if I use : echo $str | sed -n... (4 Replies)
Discussion started by: freddan25
4 Replies

2. Shell Programming and Scripting

regular expression matching whole words

Hi Consider the file this is a good line when running grep '\b(good|great|excellent)\b' file5 I expect it to match the line but it doesn't... what am i doing wrong?? (ultimately this regex will be in a awk script- just using grep to test it) Thanks, Storms (5 Replies)
Discussion started by: Storms
5 Replies

3. Shell Programming and Scripting

Get group of consecutive uppercase words using gawk

Hi I'd like to extract, from a text file, the strings starting with "The Thing" and only composed of words with a capital first letter and apostrophes, like for example: "The Thing I Only" from "those are the The Thing I Only go for whatever." or "The Thing That Are Like Men's Eyewear" ... (7 Replies)
Discussion started by: louisJ
7 Replies

4. Shell Programming and Scripting

How to move a group of words before another group of words

Hi I have a file containing lines with several consecutive words starting with a capital letter (i.e. Zuvaia Flex), followed by "de The New Foul", and I would like to put "The New Foul" before the group with capital letters and delete "de" From the line: Le short femme Zuvaia Flex de The... (2 Replies)
Discussion started by: louisJ
2 Replies

5. Shell Programming and Scripting

Adding numbers matching with words

Hi All, I have a file which looks like this: abc 1 abc 2 abc 3 abc 4 abc 5 bcd 1 bcd 3 bcd 3 bcd 5 cde 7 This file is just a miniature version of what I really have. Original file is some 1 million lines long. I have tried to come up with the code for what I wish to accomplish... (1 Reply)
Discussion started by: shoaibjameel123
1 Replies

6. Shell Programming and Scripting

Print only matching words

Hi All, I have searched the forum and tried to print only matching(pattern) words from the file, but its printing entire line. I tried with grep -w. I am on sunsolaris. Eg: cat file A|A|F1|F2|A|F3|A A|F10|F11|F14|A| F20|A|F21|A|F25 I have to search for F (F followed by numbers) and ... (5 Replies)
Discussion started by: gsjdrr
5 Replies

7. Shell Programming and Scripting

Matching words in Perl

Hi, I have an array in which one column can contain any statement. From multiple rows of that column I want to match the statement like "Execution Started." If that row contains "Execution started." then only I have to fetch other data of other columns of that particular row. I dont want... (2 Replies)
Discussion started by: monika
2 Replies

8. Shell Programming and Scripting

How to from grep command from a file which contains matching words?

Hi all I have a file with below content (content is variable whenever new product is launched). I need form a grep command like this egrep "Unknown product|Invalid symboland so on" How to do it using a script? Unknown product Invalid symbol No ILX exch found exceeds maximum size AFX... (4 Replies)
Discussion started by: johnl
4 Replies

9. UNIX for Advanced & Expert Users

matching words using regular expressions

following file is taken as input aaa bbb ccc ddd eee ffff grep -w aaa <filename> gives proper output. grep \<\(aaa\).*\> filename :- should give output, since aaa is at begining, however i dosen't get any ouput. Any discrepancy. machine details:- Linux anaconda... (1 Reply)
Discussion started by: bishweshwar
1 Replies

10. Programming

getting file words as pattern matching

Sir, I want to check for the repation of a user address in a file i used || as my delimiter and want to check repetaip0n of the address that is mailid and then i have to use IMAP and all. How can i do this... I am in linux ...and my file is linux file. ... (5 Replies)
Discussion started by: arunkumar_mca
5 Replies
Login or Register to Ask a Question