String matching using awk Post: 302693869

Sponsored Content

Top Forums Shell Programming and Scripting String matching using awk Post 302693869 by shekhar2010us on Thursday 30th of August 2012 12:26:00 AM

08-30-2012

Registered User

String matching using awk

Hello,

I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.

Here is a sample line in the file:

HTML Code:

2.55  1.57        1992        10        20        30

The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:

i) Both should be only alphabets (no numbers, no punctuations)

ii) None of them should be present in an arraylist of strings (say 'list').

Can anyone help.

Thanks,
Shekhar

---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------

I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.

---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------

I have gotten so far.

For 2nd tab field > 1990:

Code:

cat InputFile | awk -F"\t" '{if ($2 > 1990) print $0}' > OutputFile

For 1st tab field only alphabets

Code:

cat InputFile | awk -F"\t" '{if ($1 == "[a-zA-Z ]+") print $0}' > OutputFile

But this is not working. How does pattern matching works in awk when using inside 'if' to match with a field?

---------- Post updated at 12:26 AM ---------- Previous update was at 12:06 AM ----------

I have gotten this far:

Code:

awk -F"\t" '{if ($1 ~ /^[a-zA-Z ]+$/ && $2 > 1990) print}' InputFile > OutputFile

The last thing remaining is checking if both the space separated field from the first tab field is not present in a list.

HTML Code:

aa bb       1991      10       15        20

I have a list of strings and want to check if list does not contain any of the two fields 'aa' and 'bb'.. Have to add this check in the code above...

Thanks.

shekhar2010us

View Public Profile for shekhar2010us

Find all posts by shekhar2010us

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Hi guys, I hope you can help me with my problem. I have a text file that contains lines like this: 78 ANGELO -809.05 79 ANGELO2 -5,000.06 I need to find all occurences of amounts that are negative and replace them with x's 78 ANGELO xxxxxxx 79...

2. UNIX for Dummies Questions & Answers

Matching string

Hello, i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word.

3. Shell Programming and Scripting

matching a string

I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File. For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which...

4. Shell Programming and Scripting

awk BEGIN END and string matching problem

Hi, Contents of BBS-list file: foo foo foo awk ' BEGIN { print "Analysis of \"foo\"" } /foo/ { ++n } END { print "\"foo\" appears", n, "times." }' BBS-list Output: Analysis of "foo" "foo" appears 3 times. awk '

5. Shell Programming and Scripting

String matching

I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i...

6. Shell Programming and Scripting

Matching string from input to string of file

Hi, i want to know how to compare string of file with input string im trying following code: file_no=`paste -s -d "||||\n" a.txt | cut -c 1` #it will return collection number from file echo "enter number" read " curr_no" if ; then echo " current number already present" fi ...

7. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb cat dump.sql INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10...

8. Shell Programming and Scripting

Column string matching in awk

Hello, I want pick up rows from input with conditions: 1) col2 is 2 replicate of col1 joint with "/" 2) col3 is a joint string as replicate of each side of the "/" symbol 3) col2 not equal to col3 input: TG TG/TG TG/TGG C C/C C/CG C C/G CA/CA C C/C CA/CA AG AG/AG...

9. Shell Programming and Scripting

awk to combine all matching dates and remove non-matching

Using the awk below I am able to combine all the matching dates in $1, but I can not seem to remove the non-matching from the file. Thank you :). file 20161109104500.0+0000,x,5631 20161109104500.0+0000,y,2 20161109104500.0+0000,z,2 20161109104500.0+0000,a,4117...

10. Shell Programming and Scripting

awk to average field if matching string in another

In the awk below I am trying to get the average of the sum of $7 if the string in $4 matches in the line below it. The --- in the desired out is not needed, it is just to illustrate the calculation. The awk executes and produces the current out. I am not sure why the middle line is skipped and the...

LEARN ABOUT V7

join

JOIN(1) 						      General Commands Manual							   JOIN(1)

NAME

       join - relational database operator

SYNOPSIS

       join [ options ] file1 file2

DESCRIPTION

       Join  forms,  on the standard output, a join of the two relations specified by the lines of file1 and file2.  If file1 is `-', the standard
       input is used.

       File1 and file2 must be sorted in increasing ASCII collating sequence on the fields on which they are to be joined, normally the  first	in
       each line.

       There  is  one line in the output for each pair of lines in file1 and file2 that have identical join fields.  The output line normally con-
       sists of the common field, then the rest of the line from file1, then the rest of the line from file2.

       Fields are normally separated by blank, tab or newline.	In this case, multiple separators count as one, and leading  separators  are  dis-
       carded.

       These options are recognized:

       -an    In addition to the normal output, produce a line for each unpairable line in file n, where n is 1 or 2.

       -e s   Replace empty output fields by string s.

       -jn m  Join on the mth field of file n.	If n is missing, use the mth field in each file.

       -o list
	      Each  output line comprises the fields specifed in list, each element of which has the form n.m, where n is a file number and m is a
	      field number.

       -tc    Use character c as a separator (tab character).  Every appearance of c in a line is significant.

SEE ALSO

       sort(1), comm(1), awk(1)

BUGS

       With default field separation, the collating sequence is that of sort -b; with -t, the sequence is that of a plain sort.

       The conventions of join, sort, comm, uniq, look and awk(1) are wildly incongruous.

																	   JOIN(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed problem - replacement string should be same length as matching string.

Discussion started by: amangeles

2. UNIX for Dummies Questions & Answers

Matching string

Discussion started by: nehaquick

3. Shell Programming and Scripting

matching a string

Discussion started by: dsdev_123

4. Shell Programming and Scripting

awk BEGIN END and string matching problem

Discussion started by: cola

5. Shell Programming and Scripting

String matching

Discussion started by: nram_krishna@ya

6. Shell Programming and Scripting

Matching string from input to string of file

Discussion started by: a_smith

7. Shell Programming and Scripting

sed or awk command to replace a string pattern with another string based on position of this string

Discussion started by: vivek d r

8. Shell Programming and Scripting

Column string matching in awk

Discussion started by: yifangt

9. Shell Programming and Scripting

awk to combine all matching dates and remove non-matching

Discussion started by: cmccabe

10. Shell Programming and Scripting

awk to average field if matching string in another

Discussion started by: cmccabe

LEARN ABOUT V7

join