Thanks for your inputs.
The suggested solution works for me
The project involves searching a list of keywords in a master input file.
1. master_file.txt containing strings. Record Count : 20000000 String Records.
2. keywords.txt containing keywords. Record Count : 200,000 Unique Keywords.
I run a shell script which reads a keyword and runs the above mentioned command to search in master_file.txt to append desired output.
The end result is being achieved as per our expectations.
However, i am concerned about the performance and response time of this utility.
I tried with 16000 keywords and 20000 master records and the process took around 25 minutes.
I am looking to reduce this number and considered the following:
1. Split up file into n part and run searches in parallel and then collate results?
2. Possible tweaking for commands ?
3. Is text mining in shell correct from a design and feasibility perspective ?
You shouldn't run one command (= one process) per keyword, that's definitely too ineffective. On the other hand, the numbers you indicated might be too high for grep or awk. There's programs/applications/databases out there designed for text mining - I'm sure they'd be more appropriate for your task.
Running in parallel is certainly one option and i am exploring that.
Certainly agree there are specific programs designed for it; but i wanted to invest some time to find out something at the ground level.
I find shell script plus awk/sed/grep a great way to prototype a concept, but it's worth knowing the limitations of the tools you are using. I'd suggest sticking with a smaller subset and finalizing your prototype and in the background start researching text mining tools/relational databases etc.
I am trying to change the number in bold to 2400
01,000300032,193631306,190619,0640,1,80,,2/
02,193631306,000300032,1,190618,0640,CAD,2/
I'm not sure if sed or awk is the answer. I was going to use sed and do a character count up to that point, but that column directly before 0640 might... (8 Replies)
Today I changed the forum mysql database to permit 2 letter searches:
ft_min_word_len=2
I rebuilt the mysql search indexes as well.
Then, I added a "quick search bar" at the top of each page.
I have tested this and two letter searches are working; but it's not perfect,... (1 Reply)
I have an xml file dumped from rrd file, that I want to "patch" so the xml file doesn't contain any blank hole in the resulting graph of the rrd file.
Here is the file.
<!-- 2015-10-12 14:00:00 WIB / 1444633200 --> <row><v> 4.0419731265e+07 </v><v> 4.5045912770e+06... (2 Replies)
Have Pipe Delimited File:
> BRYAN BAKER|4/4/2015|518 VIRGINIA AVE|TEST
> JOE BAXTER|3/30/2015|2233 MockingBird RD|ROW2On 3rd column where the address is located, I want to add a space after every numeric value - basically doing a "s//&\ / ":
> BRYAN BAKER|4/4/2015|5 1 8 VIRGINIA AVE|TEST
> JOE... (5 Replies)
Can you search AWK array elements and return each index value for that element.
For example an array named car would have index make and element engine. I want to return all makes with engine size 1.6.
Array woulld look like this:
BMW 1.6
BMW 2.0
BMW 2.5
AUDI 1.8
AUDI 1.6
... (11 Replies)
I need to be able to search for a string in the first column and if that string exists than replace the nth column with "-9.99".
AW12000012012 2.38 1.51 3.01 1.66 0.90 0.91 1.22 0.82 0.57 1.67 2.31 3.63 0.00
AW12000012013 1.52 0.90 1.20 1.34 1.21 0.67 ... (14 Replies)
Hi,
I am searching for an awk-script that computes the mean values for the $2 column, but addicted to the values in the $1 column. It also should delete the unnecessary lines after computing...
An example (for some reason I cant use the code tag button):
cat list.txt
1 10
1 30
1 20... (2 Replies)
I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions:
1) longest string of each pattern in column 2, ignore any sub-string, as the index;
2) all the unique patterns after 1);
3) print the whole row;
input:
1 ABCDEFGHI longest_sequence1
2 ABCDEFGH... (12 Replies)
Hi,
I have the following text file:
8 T1mapping_flip02 ok 128 108 30 1 665000-000008-000001.dcm
9 T1mapping_flip05 ok 128 108 30 1 665000-000009-000001.dcm
10 T1mapping_flip10 ok 128 108 30 1 665000-000010-000001.dcm
11 T1mapping_flip15 ok 128 108 30... (2 Replies)
Hi,
I am having trouble converting a text file. I have been working for this whole day now, still i couldn't make it.
Here is how the text file looks:
_______________________________________________________
DEVICE STATUS INFORMATION FOR LOCATION 1:
OPER STATES: Disabled E:Enabled ... (5 Replies)