Search and replace multiple patterns in a particular column only - efficient script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Search and replace multiple patterns in a particular column only - efficient script
# 1  
Old 11-26-2014
Search and replace multiple patterns in a particular column only - efficient script

Hi Bigshots,

I have a pattern file with two columns. I have another data file. If column 1 in the pattern file appears as the 4th column in the data file, I need to replace it (4th column of data file) with column 2 of the pattern file. If the pattern is found in any other column, it should not be replaced.

Ex:
Pattern File:
Code:
opq,098
rst,765
xyz,321

Data File:
Code:
xyz,122,913,opq,876
rst,956,921,xyz,012
456,890,903,rst,467

Output File:
Code:
xyz,122,913,098,876
rst,956,921,321,012
456,890,903,765,467

Please note, first column of data file is not to be replaced.

I achieved this by looping the pattern file and doing awk line by line. But this is taking a lot of time. I want a faster script - shell script or PERL would do.

I saw corona688's reply in the thread titled "Replace column that matches specific pattern, with column data from another file".

However it is not addressing column specific search and replace requirement.

Last edited by Scrutinizer; 11-26-2014 at 03:15 PM.. Reason: CODE tags
# 2  
Old 11-26-2014
Please use code tags as required by forum rules!

Show us your unsatifactory attempt that you want enhanced.
# 3  
Old 11-26-2014
If you have sufficient memory to load the entire pattern file into an awk array I'd solve this problem like this:

Code:
awk -F, 'FNR==NR{p[$1]=$2; next}
$4 in p { $4=p[$4] }
1' OFS=, pattern_file data_file > output_file

# 4  
Old 11-26-2014
Needs testing with real data sets...
Code:
#!/bin/bash

lines=0
while read f1
do
    lines=$(( $lines + 1 ))
    kee2=$(echo $f1 | cut -d, -f4)
    val2=$(grep -e "^$kee2" trydata2)
    if [[ $val2 ]]; then
        repl=${val2##*,}
        sed -i "$lines s/\b$kee2/$repl/g" trydata1
    fi  
done < trydata1
cat trydata1

# output
# ------
# xyz,122,913,098,876
# rst,956,921,321,012
# 456,890,903,765,467


Last edited by ongoto; 11-26-2014 at 08:24 PM.. Reason: ...\b$kee2
# 5  
Old 11-26-2014
@ongoto, this is likely to be just as slow (or slower) than OP's original script, it also has potential to replace partial values and columns other than #4.

Consider input line:
Code:
000,5xyz1,000,xyz,000


Last edited by Chubler_XL; 11-26-2014 at 06:30 PM..
# 6  
Old 11-26-2014
@ Chubler_XL
You are abolutely right. I was only using the provided data.

This might work for the situation you presented...
sed -i "$lines s/\b$kee2/$repl/g" trydata1
I'll do the edit.

But that still doesn't cure the 'other columns' bit, does it?

---------- Post updated at 07:15 PM ---------- Previous update was at 04:22 PM ----------

The theory here is to reduce disk reads by a margin.
Disk writes can't be helped unless one builds a file
in memory and writes it out all at once.
Bash speed just is what it is...and so is my skill set. Smilie

Your AWK example is 9 times faster on my machine! That's BIG!

Code:
#!/bin/bash

# Load both data files into memory
< trydata1 mapfile data1
< trydata2 mapfile data2
for f1 in ${data1[*]};
do
    lines=$(( $lines + 1 ))
    kee2=$(echo $f1 | cut -d, -f4)
    for val2 in ${data2[*]};
    do
        if [[ $val2 =~ ^$kee2 ]]; then
            repl=${val2##*,}
            sed -i "$lines s/\b$kee2/$repl/g" trydata1
        fi
    done
done
cat trydata1

# Using this data set with no column issues...
# ----------------------
# xyz,122,913,opq,876
# rst,956,921,xyz,012
# 456,890,903,rst,467
# 000,5xyz1,000,xyz,000
# 4567,5rst1,opq,xyz,000
# rst,5opq1,rst,02opq,000
# 000,5xyz1,xyz,opq,000

# real    0m0.028s
# user    0m0.008s
# sys    0m0.009s

# On a machine running @ 2GHz
# If my math is right...
# 0.028 seconds to process 7 lines
# time to process 30,000 lines ~ 2 minutes


Last edited by ongoto; 11-26-2014 at 11:37 PM..
# 7  
Old 12-01-2014
@ongoto, reason#1 for the inefficiency is the many rewrites of the output file with sed.
The following uses bash for writing the output once. Still using awk for a fast+precise lookup of the pattern file. (Could be replaced by an inner while loop in bash using the same technique as the outer loop.)
Code:
while IFS="," read -a x
do
 lookup=`awk -F"," '$1==s {print $2}' s="${x[3]}" pattern_file`
 [ -n "$lookup" ] && x[3]=$lookup
 echo "${x[@]}"
done < data_file | tr ' ' ',' > output_file

Of course awk is much faster than bash, and this tasks fits 100% for awk.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Search Multiple patterns and display

Hi, I have scenario like below and need to search for multiple patterns Eg: Test Time Started= secs Time Ended = secc Green test Test Time Started= secs Time Ended = secc Green test Output: I need to display the text starting with Test and starting with Time... (2 Replies)
Discussion started by: weknowd
2 Replies

2. Shell Programming and Scripting

Replace multiple patterns together with retaining the text in between

Hi Team I have the following text in one of the file j1738-abc-system_id(in.value1)-2838 G566-deF-system_id(in.value2)-7489 I want to remove system_id(...) combination completely The output should look like this j1738-abc-in.value1-2838 G566-deF-in.value2-7489 Any help is appreciated... (4 Replies)
Discussion started by: Thierry Henry
4 Replies

3. Shell Programming and Scripting

Search patterns in multiple logs parallelly.

Hi All, I am starting a service which will redirect its out put into 2 logs say A and B. Now for succesful startup of the service i need to search pattern1 in log A and pattern2 in log B which are writen continuosly. Now my requirement is to find the patterns in the increasing logs A and B... (19 Replies)
Discussion started by: Girish19
19 Replies

4. Shell Programming and Scripting

a column containing multiple patterns perl

If U have a question if a file is 33 ABC 276 LRR pir UJU 45 BCD 777 HIGH pred IJJ 67 BGH 66 LRR_1 prcc KIK 77 GYH 88 LOW pol KKK perl -lne '$a++ if /LRR/,/LOW/, /HIGH/; END {print $a+0}' (2 Replies)
Discussion started by: cdfd123
2 Replies

5. Shell Programming and Scripting

How to search Multiple patterns in unix

Hi, I tried to search multiple pattern using awk trans=1234 reason=LN MISMATCH rec=`awk '/$trans/ && /'"$reason"'/' file` whenevr i tried to run on command promt it is executing but when i tried to implment same logic in shell script,it is failing i.e $rec is empty ... (6 Replies)
Discussion started by: ns64110
6 Replies

6. Shell Programming and Scripting

search multiple patterns

I have two lists in a file that look like a b b a e f c d f e d c I would like a final list a b c d e f I've tried multiple grep and awk but can't get it to work (8 Replies)
Discussion started by: godzilla07
8 Replies

7. Shell Programming and Scripting

Search multiple patterns in multiple files

Hi, I have to write one script that has to search a list of numbers in certain zipped files. For eg. one file file1.txt contains the numbers. File1.txt contains 5,00,000 numbers and I have to search each number in zipped files(The number of zipped files are around 1000 each file is 5 MB) I have... (10 Replies)
Discussion started by: vsachan
10 Replies

8. UNIX for Dummies Questions & Answers

replace multiple patterns in a string/filename

This should be somewhat simple, but I need some help with this one. I have a bunch of files with tags on the end like so... Filename {tag1}.ext Filename2 {tag1} {tag2}.ext I want to hold in a variable just the filename with all the " {tag}" removed. The tag can be anything so I'm looking... (4 Replies)
Discussion started by: kerppz
4 Replies

9. Shell Programming and Scripting

Complex Search/Replace Multiple Files Script Needed

I have a rather complicated search and replace I need to do among several dozen files and over a hundred occurrences. My site is written in PHP and throughout the old code, you will find things like die("Operation Aborted due to....."); For my new design skins for the site, I need to get... (2 Replies)
Discussion started by: UCCCC
2 Replies

10. UNIX for Dummies Questions & Answers

multiple input search and replace script

hi, i want to create a script that will search and replace the values inside a particular file. i have 5 files that i need to change some values inside and i don't want to use vi to edit these files. All the inputted values on the script below will be passed into the files. cho "" echo... (3 Replies)
Discussion started by: tungaw2004
3 Replies
Login or Register to Ask a Question