Sponsored Content
Full Discussion: String matching using awk
Top Forums Shell Programming and Scripting String matching using awk Post 302693869 by shekhar2010us on Thursday 30th of August 2012 12:26:00 AM
Old 08-30-2012
String matching using awk

Hello,

I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.

Here is a sample line in the file:
HTML Code:
2.55  1.57        1992        10        20        30
The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:
i) Both should be only alphabets (no numbers, no punctuations)
ii) None of them should be present in an arraylist of strings (say 'list').
Can anyone help.

Thanks,
Shekhar

---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------

I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.

---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------

I have gotten so far.

For 2nd tab field > 1990:
Code:
cat InputFile | awk -F"\t" '{if ($2 > 1990) print $0}' > OutputFile

For 1st tab field only alphabets
Code:
cat InputFile | awk -F"\t" '{if ($1 == "[a-zA-Z ]+") print $0}' > OutputFile

But this is not working. How does pattern matching works in awk when using inside 'if' to match with a field?

---------- Post updated at 12:26 AM ---------- Previous update was at 12:06 AM ----------

I have gotten this far:

Code:
awk -F"\t" '{if ($1 ~ /^[a-zA-Z ]+$/ && $2 > 1990) print}' InputFile > OutputFile

The last thing remaining is checking if both the space separated field from the first tab field is not present in a list.

HTML Code:
aa bb       1991      10       15        20
I have a list of strings and want to check if list does not contain any of the two fields 'aa' and 'bb'.. Have to add this check in the code above...

Thanks.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79... (4 Replies)
Discussion started by: amangeles
4 Replies

2. UNIX for Dummies Questions & Answers

Matching string

Hello, i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word. (7 Replies)
Discussion started by: nehaquick
7 Replies

3. Shell Programming and Scripting

matching a string

I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File. For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which... (2 Replies)
Discussion started by: dsdev_123
2 Replies

4. Shell Programming and Scripting

awk BEGIN END and string matching problem

Hi, Contents of BBS-list file: foo foo foo awk ' BEGIN { print "Analysis of \"foo\"" } /foo/ { ++n } END { print "\"foo\" appears", n, "times." }' BBS-list Output: Analysis of "foo" "foo" appears 3 times. awk ' (3 Replies)
Discussion started by: cola
3 Replies

5. Shell Programming and Scripting

String matching

I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i... (4 Replies)
Discussion started by: nram_krishna@ya
4 Replies

6. Shell Programming and Scripting

Matching string from input to string of file

Hi, i want to know how to compare string of file with input string im trying following code: file_no=`paste -s -d "||||\n" a.txt | cut -c 1` #it will return collection number from file echo "enter number" read " curr_no" if ; then echo " current number already present" fi ... (4 Replies)
Discussion started by: a_smith
4 Replies

7. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Discussion started by: vivek d r
10 Replies

8. Shell Programming and Scripting

Column string matching in awk

Hello, I want pick up rows from input with conditions: 1) col2 is 2 replicate of col1 joint with "/" 2) col3 is a joint string as replicate of each side of the "/" symbol 3) col2 not equal to col3 input: TG TG/TG TG/TGG C C/C C/CG C C/G CA/CA C C/C CA/CA AG AG/AG... (3 Replies)
Discussion started by: yifangt
3 Replies

9. Shell Programming and Scripting

awk to combine all matching dates and remove non-matching

Using the awk below I am able to combine all the matching dates in $1, but I can not seem to remove the non-matching from the file. Thank you :). file 20161109104500.0+0000,x,5631 20161109104500.0+0000,y,2 20161109104500.0+0000,z,2 20161109104500.0+0000,a,4117... (3 Replies)
Discussion started by: cmccabe
3 Replies

10. Shell Programming and Scripting

awk to average field if matching string in another

In the awk below I am trying to get the average of the sum of $7 if the string in $4 matches in the line below it. The --- in the desired out is not needed, it is just to illustrate the calculation. The awk executes and produces the current out. I am not sure why the middle line is skipped and the... (2 Replies)
Discussion started by: cmccabe
2 Replies
All times are GMT -4. The time now is 11:29 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy