awk : search last index in specific column

11-13-2013

Registered User

7, 0

Join Date: Oct 2012

Last Activity: 23 October 2014, 7:02 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

All,

Thanks for your inputs.
The suggested solution works for me

Code:

awk -F\| -v X="an" 'match ($NF, ".*"X) {print $0, RLENGTH-length(X)+1}' OFS=\| file

The project involves searching a list of keywords in a master input file.
1. master_file.txt containing strings. Record Count : 20000000 String Records.
2. keywords.txt containing keywords. Record Count : 200,000 Unique Keywords.

I run a shell script which reads a keyword and runs the above mentioned command to search in master_file.txt to append desired output.

The end result is being achieved as per our expectations.
However, i am concerned about the performance and response time of this utility.
I tried with 16000 keywords and 20000 master records and the process took around 25 minutes.

I am looking to reduce this number and considered the following:
1. Split up file into n part and run searches in parallel and then collate results?
2. Possible tweaking for commands ?
3. Is text mining in shell correct from a design and feasibility perspective ?

Please provide your inputs.

tarun.trehan

View Public Profile for tarun.trehan

Find all posts by tarun.trehan

11-13-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You shouldn't run one command (= one process) per keyword, that's definitely too ineffective. On the other hand, the numbers you indicated might be too high for grep or awk. There's programs/applications/databases out there designed for text mining - I'm sure they'd be more appropriate for your task.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-18-2013

Registered User

7, 0

Join Date: Oct 2012

Last Activity: 23 October 2014, 7:02 AM EDT

Posts: 7

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks Rudi.

Running in parallel is certainly one option and i am exploring that.
Certainly agree there are specific programs designed for it; but i wanted to invest some time to find out something at the ground level.

tarun.trehan

View Public Profile for tarun.trehan

Find all posts by tarun.trehan

11-18-2013

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

I find shell script plus awk/sed/grep a great way to prototype a concept, but it's worth knowing the limitations of the tools you are using. I'd suggest sticking with a smaller subset and finalizing your prototype and in the background start researching text mining tools/relational databases etc.

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

Shell Programming and Scripting

awk : search last index in specific column

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Using awk to change a specific column and in a specific row

Discussion started by: juggernautjoee

2. What is on Your Mind?

Updated Forum Search Index Min Word Length to 2 Chars and Added Quick Search Bar

Discussion started by: Neo

3. Shell Programming and Scripting

Overwrite specific column in xml file with the specific column from adjacent line

Discussion started by: rk4k

4. Shell Programming and Scripting

Search Replace Specific Column using RegEx

Discussion started by: svn

5. Shell Programming and Scripting

awk Search Array Element Return Index

Discussion started by: u20sr

6. Shell Programming and Scripting

awk to search for specific line and replace nth column

Discussion started by: ncwxpanther

7. UNIX for Dummies Questions & Answers

Average by specific column value, awk

Discussion started by: bjoern456

8. Shell Programming and Scripting

awk uniq and longest string of a column as index

Discussion started by: yifangt

9. Shell Programming and Scripting

Assigning a specific format to a specific column in a text file using awk and printf

Discussion started by: goodbenito

10. Shell Programming and Scripting

Insert a text from a specific row into a specific column using SED or AWK

Discussion started by: Issemael