String matching using awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting String matching using awk
# 1  
Old 08-30-2012
String matching using awk

Hello,

I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.

Here is a sample line in the file:
HTML Code:
2.55  1.57        1992        10        20        30
The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:
i) Both should be only alphabets (no numbers, no punctuations)
ii) None of them should be present in an arraylist of strings (say 'list').
Can anyone help.

Thanks,
Shekhar

---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------

I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.

---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------

I have gotten so far.

For 2nd tab field > 1990:
Code:
cat InputFile | awk -F"\t" '{if ($2 > 1990) print $0}' > OutputFile

For 1st tab field only alphabets
Code:
cat InputFile | awk -F"\t" '{if ($1 == "[a-zA-Z ]+") print $0}' > OutputFile

But this is not working. How does pattern matching works in awk when using inside 'if' to match with a field?

---------- Post updated at 12:26 AM ---------- Previous update was at 12:06 AM ----------

I have gotten this far:

Code:
awk -F"\t" '{if ($1 ~ /^[a-zA-Z ]+$/ && $2 > 1990) print}' InputFile > OutputFile

The last thing remaining is checking if both the space separated field from the first tab field is not present in a list.

HTML Code:
aa bb       1991      10       15        20
I have a list of strings and want to check if list does not contain any of the two fields 'aa' and 'bb'.. Have to add this check in the code above...

Thanks.
# 2  
Old 08-30-2012
Hi

Appreciate your try on this.

Assuming the list file contains the list of strings :


Code:
$ cat list
aa
bb
cc


Code:
$ cat file
aa bb       1991      10       15        20
1e ff       1992      10       15        20
cc ff       1990      10       15        20
ee ff       1994      10       15        20


Output:
Code:
$ awk 'NR==FNR{a[$0]=1;next}($1 ~ /^[a-z]+$/ && $2 ~ /^[a-z]+$/ && (!a[$1]) && (!a[$2]) && $3>1990)' list file
ee ff       1994      10       15        20


Is this what you wanted?

Guru.
# 3  
Old 08-30-2012
what do you mean by not in a list. For which list you are talking.
# 4  
Old 08-30-2012
@guruprasadpr

Awesome answer. I knew 'awk' can solve this problem. To filter 1 file, java took 27 minutes and shell took 12 seconds :-)

I added a filter of length of both the first fields and my final answer is:


Code:
awk 'NR==FNR{a[$0]=1;next}($1 ~ /^[a-z]+$/ && $2 ~ /^[a-z]+$/ && (!a[$1]) && (!a[$2]) && $3>1990 && length($1) > 2 && length($2) > 2 )' list file

Thanks a lot.

---------- Post updated at 01:41 AM ---------- Previous update was at 01:24 AM ----------

I just ran the awk for a bigger file and it took 154 seconds to get the results.
Thanks to @guruprasadpr..
I am sure Java would have taken hours to do this.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to average field if matching string in another

In the awk below I am trying to get the average of the sum of $7 if the string in $4 matches in the line below it. The --- in the desired out is not needed, it is just to illustrate the calculation. The awk executes and produces the current out. I am not sure why the middle line is skipped and the... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

awk to combine all matching dates and remove non-matching

Using the awk below I am able to combine all the matching dates in $1, but I can not seem to remove the non-matching from the file. Thank you :). file 20161109104500.0+0000,x,5631 20161109104500.0+0000,y,2 20161109104500.0+0000,z,2 20161109104500.0+0000,a,4117... (3 Replies)
Discussion started by: cmccabe
3 Replies

3. Shell Programming and Scripting

Column string matching in awk

Hello, I want pick up rows from input with conditions: 1) col2 is 2 replicate of col1 joint with "/" 2) col3 is a joint string as replicate of each side of the "/" symbol 3) col2 not equal to col3 input: TG TG/TG TG/TGG C C/C C/CG C C/G CA/CA C C/C CA/CA AG AG/AG... (3 Replies)
Discussion started by: yifangt
3 Replies

4. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Discussion started by: vivek d r
10 Replies

5. Shell Programming and Scripting

Matching string from input to string of file

Hi, i want to know how to compare string of file with input string im trying following code: file_no=`paste -s -d "||||\n" a.txt | cut -c 1` #it will return collection number from file echo "enter number" read " curr_no" if ; then echo " current number already present" fi ... (4 Replies)
Discussion started by: a_smith
4 Replies

6. Shell Programming and Scripting

String matching

I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i... (4 Replies)
Discussion started by: nram_krishna@ya
4 Replies

7. Shell Programming and Scripting

awk BEGIN END and string matching problem

Hi, Contents of BBS-list file: foo foo foo awk ' BEGIN { print "Analysis of \"foo\"" } /foo/ { ++n } END { print "\"foo\" appears", n, "times." }' BBS-list Output: Analysis of "foo" "foo" appears 3 times. awk ' (3 Replies)
Discussion started by: cola
3 Replies

8. Shell Programming and Scripting

matching a string

I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File. For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which... (2 Replies)
Discussion started by: dsdev_123
2 Replies

9. UNIX for Dummies Questions & Answers

Matching string

Hello, i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word. (7 Replies)
Discussion started by: nehaquick
7 Replies

10. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79... (4 Replies)
Discussion started by: amangeles
4 Replies
Login or Register to Ask a Question