Shell script for search and replace by field


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Shell script for search and replace by field
# 1  
Old 09-30-2012
Shell script for search and replace by field

Hi,
I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern.
Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file.
Input file:
Code:
field1  field2
rc11    rc12
rc$21  rc#22
XX31   yy32
rc41    r!42

rules file:
Code:
field   search   condition   replace
field1 XX         Equals      rc
field1  $          contains   
field2  #          contains
field2  !           contauns  c
field2 yy          equals      rc

Output
Code:
field1 field2
rc11   rc12
rc21   rc22
rc31  rc32
rc41   rc42

reject file:
Code:
field1 field2
rc$21 rc#22
XX31 yy32
rc41 r!42

Appreciate your help with this.

Thank you!

Last edited by Scrutinizer; 09-30-2012 at 04:48 PM.. Reason: code tags
# 2  
Old 09-30-2012
Please use code tags when you post examples of file contents and program fragments!

What are the field separators in your input files and in your rules file? What field separator do you want in the output and reject files?

Am I correct in assuming that the reject file is supposed to contain the original contents of every line that was changed by one or more rules in the rules file while producing the output file?

What is the distinction between the conditions "Equals" (or "equals") and "contains" (or "contauns")? It looks like if the character(s) found in the search field in the rules file are found in the field specified by the field named in the rules file for either condition, the content of the replace field in the corresponding line in the rules file replaces what was matched by the search field.

Am I correct in assuming that the "contauns" was a typo?

Is the contents of the condition column supposed to be case insensitive or was the "Equals" also a typo?
# 3  
Old 10-05-2012
Don,
- The field separator is pipe (|). My apologies for not putting that in the initial post.
- Yes, The reject file is supposed to contain the original contents of every line that was changed by one or more rules in the rules file while producing the output file.
- My apologies for typos, It will have 'Contains' or 'Equals' type condition no other conditions.
-Yes, "contauns" was a typo.
-As I mentioned earlier, it is safe to assume only two conditions 'Contains' or Equals'. The example input given in my earlier post has typos as you pointed out.
-The distinction between two conditions:
1. 'Equals' is used for exact string match. (example, if field1 value equals XX, then replace it with rc).
2. 'Contains' is used for pattern match (example, if field1 value contains $, replace it with <blank>, example, field1 value of rc$21 will become rc21 as '$' gets replaced with '' as field one contained '$'.
Appreciate all your help with any unix script solution to this.

---------- Post updated at 11:15 PM ---------- Previous update was at 11:02 PM ----------

Also I updated the files with field/line delimiters:
Input:
Code:
field1|field2|
rc11|rc12| 
rc$21|rc#21|
XX31|yy32|
rc41|r!42|

Rules:
Code:
field|search|condition|replace|
field1|XX|Equals|rc|
field1|$|contains||   
field2|#|contains||
field2|!|contains|c|
field2|yy|Equals|rc|

Output:
Code:
field1|field2|
rc11|rc12|
rc21|rc22|
rc31|rc32|
rc41|rc42|

Rejects:
Code:
field1|field2|
rc$21|rc#22|
XX31|yy32|
rc41|r!42|

Thank you!

Moderator's Comments:
Mod Comment Please use code tags next time for your code and data.

Last edited by radoulov; 10-05-2012 at 05:57 AM..
# 4  
Old 10-05-2012
Quote:
Originally Posted by chandrath
Don,
- The field separator is pipe (|). My apologies for not putting that in the initial post.
- Yes, The reject file is supposed to contain the original contents of every line that was changed by one or more rules in the rules file while producing the output file.
- My apologies for typos, It will have 'Contains' or 'Equals' type condition no other conditions.
-Yes, "contauns" was a typo.
-As I mentioned earlier, it is safe to assume only two conditions 'Contains' or Equals'. The example input given in my earlier post has typos as you pointed out.
-The distinction between two conditions:
1. 'Equals' is used for exact string match. (example, if field1 value equals XX, then replace it with rc).
2. 'Contains' is used for pattern match (example, if field1 value contains $, replace it with <blank>, example, field1 value of rc$21 will become rc21 as '$' gets replaced with '' as field one contained '$'.
Appreciate all your help with any unix script solution to this.

---------- Post updated at 11:15 PM ---------- Previous update was at 11:02 PM ----------

Also I updated the files with field/line delimiters:
Input:
field1|field2|
rc11|rc12|
rc$21|rc#21|
XX31|yy32|
rc41|r!42|

Rules:
field|search|condition|replace|
field1|XX|Equals|rc|
field1|$|contains||
field2|#|contains||
field2|!|contains|c|
field2|yy|Equals|rc|

Output:
field1|field2|
rc11|rc12|
rc21|rc22|
rc31|rc32|
rc41|rc42|

Rejects:
field1|field2|
rc$21|rc#22|
XX31|yy32|
rc41|r!42|

Thank you!
OK. Let me try again. (And, PLEASE use code tags surrounding the contents of your input and output files.)

I do not see any difference in your sample output between condition Equals and condition Contains. If the string listed in the 2nd field in the Rules file appears in the field in the Input file field with the heading named by the first column in your Rules file, that string is replaced by the string in the 4th field in your Rules file. You said:
Quote:
-The distinction between two conditions:
1. 'Equals' is used for exact string match. (example, if field1 value equals XX, then replace it with rc).
2. 'Contains' is used for pattern match (example, if field1 value contains $, replace it with <blank>, example, field1 value of rc$21 will become rc21 as '$' gets replaced with '' as field one contained '$'.
but when field1 is XX31 (which is not equal to XX), your desired output changed the XX to rc anyway??? And you say that 'Contains' is a "pattern match", but don't define what pattern matching rules are to be used. (Is it shell pattern matching, filename pattern matching, basic regular expression matching, extended regular expression matching, or something else?) In the possible solution below, I assume that anytime the string in the 2nd field in the Rules file is found in the specified field in the Input file it will be replaced by the string in the 4th field in the Rules file. This matches the behavior shown given your Input file, Rules file, and Output file even though it doesn't match your description. Since your examples do not show any difference in the expected output between Equals and Contains, the possible solution below ignores the 3rd field in the Rules file.

You say that only Equals and Contains appear in the 3rd field in the Rules file. But, your sample Rules file 3rd field is contains on three lines and is never Contains (with an upper case C). But, since the possible solution below ignores the 3rd field in the Rules file, it doesn't make any difference.

The following produces the Output file you specify when given the Input and Rules files you specified, except for two issues:
  1. you have a <space> character at the end of the line:
    Code:
    rc11|rc12|

    in Input, but there is no space at the end of the corresponding line in the Output file, and
  2. the line in your Input file:
    Code:
    rc$21|rc#21|

    is transformed into:
    Code:
    rc21|rc#21|

    and then into:
    Code:
    rc21|rc21|

    by the rules:
    Code:
    field1|$|contains||   
        and
    field2|#|contains||

    but your Output file shows:
    Code:
    rc21|rc22|

    instead.
There are corresponding differences in what this script produces in the Reject file compared to what you said should appear in the Reject file.

Anyway, play around with the following to see how it works:
Code:
#!/bin/ksh
rejectfile="Reject"
awk -F "|" -v rejf="$rejectfile" 'BEGIN {OFS = "|"}
FNR==NR{
if(debug)printf("# rules record read: %s\n", $0)
        if(FNR == 1) next #skip Rules file header.
        ruleF[++rc] = $1
        ruleS[rc] = $2
        cnt = gsub(/./, "[[.&.]]", ruleS[rc])
        ruleC[rc] = $3
        ruleR[rc] = $4
        gsub(/\\/, "\\", ruleR[rc]);
        gsub(/[.&.]/, "\\\\&", ruleR[rc])
if(debug)printf("ruleF[%d]=%s, ruleS[%d]=%s (%d elements), ruleC[%d]=%s, ruleR[%d]=\"%s\"\n",
rc,ruleF[rc],rc,ruleS[rc],cnt,rc,ruleC[rc],rc,ruleR[rc])
        next
}
FNR==1{ # Process input file header
if(debug)printf("@ input header read:\n")
        for(i = 1; i <= NF; i++) {
                mF[$i] = i
if(debug)printf("@ mF[%s]=%d\n",$i,i)
        }
        fc = NF
        print
        print > rejf
        next
}
{       cc = 0 # of changes made to this line
        o0 = $0
if(debug)printf("@ input record read: %s\n", $0);
        for(i = 1; i <= rc; i++) {
if(debug)printf("@ f:%s(%d): s/%s/%s/\n", ruleF[i], mF[ruleF[i]], ruleS[i], ruleR[i])
                if((cnt = sub(ruleS[i], ruleR[i], $mF[ruleF[i]]))) {
                        cc += cnt
if(debug)printf("@ %s changed to \"%s\"\n", ruleF[i], $mF[ruleF[i]])
                }
        }
        if(cc) print o0 > rejf
        print
}' Rules Input > Output

Note that if you change the line:
Code:
awk -F "|" -v rejf="$rejectfile" 'BEGIN {OFS = "|"}

to:
Code:
awk -F "|" -v rejf="$rejectfile" 'BEGIN {OFS = "|"; debug = 1}

you'll get lots of debugging data in Outfile showing how it evaluates input lines, how it transforms search and replace patterns into extended regular expressions and replacement patterns, respectively, and which rules cause transformations of input fields.

If you want to make "Equals" behave as you described it (instead of as your expected Output file contents demonstrate, you just need to add ERE anchoring chracters to the start and end of "ruleS[x]" after the gsub() call converts each character to be matched into its corresponding collating symbol matching expression (which is used to avoid having "special" characters in EREs being treated specially).
# 5  
Old 10-24-2012
Don,

Thank you very much for your valuable inputs and comments!. Appreciate all your help!
I had slight variation in the rules file when there is a rule for a field that do not exist in input, as well the 'Contains' requirement as below:

Rules file:

Code:
field|search|condition|replace|
field1|XX|Equals|rc
field1|$|Contains|  
field2|#|Contains|
field2|!|Contains|c
field2|yy|Equals|rc
fieldx|{{|Equals|

Input file:
Code:
field1|field2|field3|
rc11|rc12|xxx|
rc$$$21|rc#21|yyy|
XX|yy|fff|
rc41|r!42||

I am getting error due to the extra Rule
Code:
fieldx|{{|Equals|

, for which fieldx does not exist in Inpurt file.
Error message: "awk: Field is not correct. The input line number is 2. "
Appreciate your help to handle this.

After I deleted the "fieldx|{{|Equals|" record from Rules file, I am getting below output:
Code:
field1|field2|field3|
rc11|rc12|xxx|
rc$$21|rc21|yyy|
rc|rc|fff|
rc41|rc42||

But, the expected output I need is :
Code:
rc11|rc12|xxx|
rc21|rc21|yyy|     ---> the field1 has '$$$' each '$' sign in this field to be replaced with ''.
rc|rc|fff|
rc41|rc42||

The Rule for "fieldx" can be reported into Reject file saying it did not exist in input.
Any help is highly appreciated in dealing with these 2 scenarios.

Thank you!
Chand.
# 6  
Old 10-25-2012
Quote:
Originally Posted by chandrath
Don,

Thank you very much for your valuable inputs and comments!. Appreciate all your help!
I had slight variation in the rules file when there is a rule for a field that do not exist in input, as well the 'Contains' requirement as below:

Rules file:

Code:
field|search|condition|replace|
field1|XX|Equals|rc
field1|$|Contains|  
field2|#|Contains|
field2|!|Contains|c
field2|yy|Equals|rc
fieldx|{{|Equals|

Input file:
Code:
field1|field2|field3|
rc11|rc12|xxx|
rc$$$21|rc#21|yyy|
XX|yy|fff|
rc41|r!42||

I am getting error due to the extra Rule
Code:
fieldx|{{|Equals|

, for which fieldx does not exist in Inpurt file.
Error message: "awk: Field is not correct. The input line number is 2. "
Appreciate your help to handle this.

After I deleted the "fieldx|{{|Equals|" record from Rules file, I am getting below output:
Code:
field1|field2|field3|
rc11|rc12|xxx|
rc$$21|rc21|yyy|
rc|rc|fff|
rc41|rc42||

But, the expected output I need is :
Code:
rc11|rc12|xxx|
rc21|rc21|yyy|     ---> the field1 has '$$$' each '$' sign in this field to be replaced with ''.
rc|rc|fff|
rc41|rc42||

The Rule for "fieldx" can be reported into Reject file saying it did not exist in input.
Any help is highly appreciated in dealing with these 2 scenarios.

Thank you!
Chand.
Chand,
I have rewritten the script to process multiple input files and allow fields specified in the rules file to be skipped if the field named in a rule does not appear as an input file's field header. (If this happens, a note will be included in the reject file as you requested stating that a rule is invalid.)

Making the script replace every occurrence of a search string rather than just the first occurrence was done by just changing a call to sub() to be a call to gsub(). I have expanded the shell portion of the script to support several options and provide a built-in man page. The in-line comments explaining what the script does have also been expanded in hopes that you will be able to make further enhancements yourself.

Note that the rule that you said should convert the input line:rc$$$21|rc#21|yyy|to:rc21|rc#21|yyy|, it won't do that because the rule in you rules file has two spaces in the replace field. Therefore, the result of applying that rule to every occurrence that matches will instead produce the output:
Code:
rc      21|rc#21|yyy|

Because the script is so large now, I have attached it rather than including it in-line here. The name of the script is substitute, but to upload it I had to use the name substitute.sh. It is written as a Korn shell script that calls awk. You should be able to use a Bourne shell, or bash, as well as a Korn shell if you just change the first line of the script from #!/bin/kshto the path to your shell. (However, it won't work with csh or any of its variants.)
If you are on a Solaris system, use nawk or /usr/xpg4/bin/awk instead of awk.

I hope this helps,
Don
# 7  
Old 10-29-2012
Don,
Thank you very much.
I did some changes, and it's working fine for English char data. But, when I have international language data in the input file, getting into below error:
Code:
FNR=2) fatal: Invalid collation character: /[[.".]][[.¿.]][[.".]]/

Appreciate your help regarding this.
Thanks
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Search and replace the last field

Hi All, Seeking for your assistance on how to search and replace the last field/column. please see sample below: inputfile1.csv ="8923523434",="543623534"="afd23535623",="100"="200" ="8923523431",="543623536"="afd23535626",="101"="201"... (3 Replies)
Discussion started by: poginiks
3 Replies

2. Shell Programming and Scripting

Search for a value and replace other field in the same set

Hello friends, I have huge file with many sets where each "set" has few lines and each set always begins with "Set" in Sq brackets as shown above. # cat file1 (2 Replies)
Discussion started by: magnus29
2 Replies

3. Shell Programming and Scripting

Search field in text file and replace value

Hi there, First of all this is my first post here. Thank you in advance for your help. What I am trying to do is the following. I have a text file where each field of each row is separated by a tabulator. Looks like this: ATOM 1 N HSE A 26 3.033 -10.429 -2.262 1.00 17.07 ... (8 Replies)
Discussion started by: doom4
8 Replies

4. Shell Programming and Scripting

awk search and replace in a targeted field instead of $0

Hi I would like to apply this gawk command: gawk '{$0=gensub(/\y+\y/,"","g"); print}' file not to the whole $0 but just to the part of $0 that is between: (a number)"> and </mrk> Is it possible? thanks for your help. (4 Replies)
Discussion started by: louisJ
4 Replies

5. Shell Programming and Scripting

Search and replace field?

I have 2 files A.txt and B.txt A.txt 3 fields and separate by a comma some,thing,florida any1,thing1,california some2,thing2,dallas just,fun,kansas B.txt has 8 fields and separate by a comma what,ever,florida-state,,,,,, some,one,dallas_state,,,,,, You will see 3rd fields are the... (5 Replies)
Discussion started by: sabercats
5 Replies

6. Shell Programming and Scripting

Search a string,get line and replace with second field

Hi, I need to search for source path in file2 , as per file1 and if found get the next line and take the field value and put it in URL value of file1. In file1, NF is not same for all the lines. file1: <type source="/home/USER/Desktop" Dest="/home/USER/DIR1/Desktop" URL="ssh/path"/> <type... (8 Replies)
Discussion started by: greet_sed
8 Replies

7. Shell Programming and Scripting

Search duplicate field and replace one of them with new value

Dear All, I have file with 4 columns: 1 AA 0 21 2 BB 0 31 3 AA 0 21 4 CC 0 41 I would like to find the duplicate record based on column 2 and replace the 4th column of the duplicate by a new value. So, the output will be: 1 AA 0 21 2 BB 0 31 3 AA 0 -21 4 CC 0 41 Any suggestions... (3 Replies)
Discussion started by: ezhil01
3 Replies

8. Shell Programming and Scripting

Perl - search and replace a particular field

Hi, I have a file having around 30 records. Each record has 5 fields delimited by PIPE. Few records in the file having Junk characters in the field2 and field4. I found the junk charcter and I tested it and replace the junk with space with the command below perl -i -p -e "s/\x00/ /g"... (1 Reply)
Discussion started by: ramkrix
1 Replies

9. Shell Programming and Scripting

awk search and replace field

I am writing a c++ program that has many calls of pow(input,2). I now realize that this is slowing down the program and these all should be input * input for greater speed. There should be a simple way of doing this replacement throughout my file with awk, but I am not very familiar with awk.... (2 Replies)
Discussion started by: bluejayek
2 Replies

10. Shell Programming and Scripting

search and replace dynamic data in a shell script

Hi, I have a file that looks something like this: ... 0,6,256,87,0,0,0,1187443420 0,6,438,37,0,0,0,1187443380 0,2,0,0,0,10,0,1197140320 0,3,0,0,0,10,0,1197140875 0,2,0,0,0,23,0,1197140332 0,3,0,0,0,23,0,1197140437 0,2,0,0,0,17,0,1197140447 0,3,0,0,0,17,0,1197140543... (8 Replies)
Discussion started by: csejl
8 Replies
Login or Register to Ask a Question