I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.
Here is a sample line in the file:
HTML Code:
2.55 1.57 1992 10 20 30
The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:
i) Both should be only alphabets (no numbers, no punctuations)
ii) None of them should be present in an arraylist of strings (say 'list').
Can anyone help.
Thanks,
Shekhar
---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------
I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.
---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------
I have gotten so far.
For 2nd tab field > 1990:
For 1st tab field only alphabets
But this is not working. How does pattern matching works in awk when using inside 'if' to match with a field?
---------- Post updated at 12:26 AM ---------- Previous update was at 12:06 AM ----------
I have gotten this far:
The last thing remaining is checking if both the space separated field from the first tab field is not present in a list.
HTML Code:
aa bb 1991 10 15 20
I have a list of strings and want to check if list does not contain any of the two fields 'aa' and 'bb'.. Have to add this check in the code above...
Awesome answer. I knew 'awk' can solve this problem. To filter 1 file, java took 27 minutes and shell took 12 seconds :-)
I added a filter of length of both the first fields and my final answer is:
Thanks a lot.
---------- Post updated at 01:41 AM ---------- Previous update was at 01:24 AM ----------
I just ran the awk for a bigger file and it took 154 seconds to get the results.
Thanks to @guruprasadpr..
I am sure Java would have taken hours to do this.
In the awk below I am trying to get the average of the sum of $7 if the string in $4 matches in the line below it. The --- in the desired out is not needed, it is just to illustrate the calculation. The awk executes and produces the current out. I am not sure why the middle line is skipped and the... (2 Replies)
Using the awk below I am able to combine all the matching dates in $1, but I can not seem to remove the non-matching from the file. Thank you :).
file
20161109104500.0+0000,x,5631
20161109104500.0+0000,y,2
20161109104500.0+0000,z,2
20161109104500.0+0000,a,4117... (3 Replies)
Hello, I want pick up rows from input with conditions:
1) col2 is 2 replicate of col1 joint with "/"
2) col3 is a joint string as replicate of each side of the "/" symbol
3) col2 not equal to col3
input:
TG TG/TG TG/TGG
C C/C C/CG
C C/G CA/CA
C C/C CA/CA
AG AG/AG... (3 Replies)
here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb
cat dump.sql
INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Hi,
i want to know how to compare string of file with input string
im trying following code:
file_no=`paste -s -d "||||\n" a.txt | cut -c 1`
#it will return collection number from file
echo "enter number"
read " curr_no"
if ; then
echo " current number already present"
fi
... (4 Replies)
I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i... (4 Replies)
I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File.
For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which... (2 Replies)
Hello,
i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word. (7 Replies)
Hi guys, I hope you can help me with my problem.
I have a text file that contains lines like this:
78 ANGELO -809.05
79 ANGELO2 -5,000.06
I need to find all occurences of amounts that are negative and replace them with x's
78 ANGELO xxxxxxx
79... (4 Replies)