I am working with google ngram data set which is of size 100s of gb. Before using it with Java, I wanted to filter it out using shell script.
Here is a sample line in the file:
HTML Code:
2.55 1.57 1992 10 20 30
The first two fields (2.55 and 1.57) are separated by a space and the rest are separated by tabs. I need all the lines where:
a) Tab separated second field (1992 in this case) is greater than 1990
b) Both elements in the first tab fields (2.55 and 1.57 in this case) should satisfy two conditions:
i) Both should be only alphabets (no numbers, no punctuations)
ii) None of them should be present in an arraylist of strings (say 'list').
Can anyone help.
Thanks,
Shekhar
---------- Post updated at 11:56 PM ---------- Previous update was at 11:51 PM ----------
I have 300 files each containing tens millions of such lines (total data size: more than 500 giga bytes), so I need an efficient method to do this. Basically, that's the only reason I wanted shell to do this, otherwise I could have easily done this in Java.
---------- Post updated 08-30-12 at 12:06 AM ---------- Previous update was 08-29-12 at 11:56 PM ----------
---------- Post updated at 01:41 AM ---------- Previous update was at 01:24 AM ----------
I just ran the awk for a bigger file and it took 154 seconds to get the results.
Thanks to @guruprasadpr..
I am sure Java would have taken hours to do this.
In the awk below I am trying to get the average of the sum of $7 if the string in $4 matches in the line below it. The --- in the desired out is not needed, it is just to illustrate the calculation. The awk executes and produces the current out. I am not sure why the middle line is skipped and the... (2 Replies)
Using the awk below I am able to combine all the matching dates in $1, but I can not seem to remove the non-matching from the file. Thank you :).
file
20161109104500.0+0000,x,5631
20161109104500.0+0000,y,2
20161109104500.0+0000,z,2
20161109104500.0+0000,a,4117... (3 Replies)
Hello, I want pick up rows from input with conditions:
1) col2 is 2 replicate of col1 joint with "/"
2) col3 is a joint string as replicate of each side of the "/" symbol
3) col2 not equal to col3
input:
TG TG/TG TG/TGG
C C/C C/CG
C C/G CA/CA
C C/C CA/CA
AG AG/AG... (3 Replies)
here is what i want to achieve... consider a file contains below contents. the file size is large about 60mb
cat dump.sql
INSERT INTO `table1` (`id`, `action`, `date`, `descrip`, `lastModified`) VALUES (1,'Change','2011-05-05 00:00:00','Account Updated','2012-02-10... (10 Replies)
Hi,
i want to know how to compare string of file with input string
im trying following code:
file_no=`paste -s -d "||||\n" a.txt | cut -c 1`
#it will return collection number from file
echo "enter number"
read " curr_no"
if ; then
echo " current number already present"
fi
... (4 Replies)
I have a string like ab or abc of whatever length. But i want to know whether another string ( for example, abcfghijkl, OR a<space> bcfghijkl ab<space> cfghijkl OR a<space>bcfghijkl OR ab<space> c<space> fghijkl ) starts with ab or abc... space might existing on the longer string... If so, i... (4 Replies)
I have a requirement of shell script where i need to read the File name i.e ls -t | head -1 and Match that Filename with some delimited values which are in a separate File.
For Example i am reading the File name i.e (ls -t | head -1) after that i need to read one more sequential file which... (2 Replies)
Hello,
i have a program where i have to get a character from the user and check it against the word i have and then replace the character in a blank at the same position it is in the word. (7 Replies)
Hi guys, I hope you can help me with my problem.
I have a text file that contains lines like this:
78 ANGELO -809.05
79 ANGELO2 -5,000.06
I need to find all occurences of amounts that are negative and replace them with x's
78 ANGELO xxxxxxx
79... (4 Replies)