Shell script for search and replace by field | Unix Linux Forums | UNIX for Dummies Questions & Answers

  Go Back    


UNIX for Dummies Questions & Answers If you're not sure where to post a UNIX or Linux question, post it here. All UNIX and Linux newbies welcome !!

Shell script for search and replace by field

UNIX for Dummies Questions & Answers


Tags
search and replace, unox script

Closed Thread    
 
Thread Tools Search this Thread Display Modes
    #1  
Old 09-30-2012
chandrath chandrath is offline
Registered User
 
Join Date: Sep 2012
Last Activity: 17 November 2012, 5:43 PM EST
Posts: 7
Thanks: 1
Thanked 0 Times in 0 Posts
Shell script for search and replace by field

Hi,
I have an input file with below data and rules file to apply search and replace by each field in the input based on exact value or pattern.
Could you please help me with unix script to read input file and rules file and then create the output and reject files based on the rules file.
Input file:

Code:
field1  field2
rc11    rc12
rc$21  rc#22
XX31   yy32
rc41    r!42

rules file:

Code:
field   search   condition   replace
field1 XX         Equals      rc
field1  $          contains   
field2  #          contains
field2  !           contauns  c
field2 yy          equals      rc

Output

Code:
field1 field2
rc11   rc12
rc21   rc22
rc31  rc32
rc41   rc42

reject file:

Code:
field1 field2
rc$21 rc#22
XX31 yy32
rc41 r!42

Appreciate your help with this.

Thank you!

Last edited by Scrutinizer; 09-30-2012 at 03:48 PM.. Reason: code tags
Sponsored Links
    #2  
Old 09-30-2012
Don Cragun's Avatar
Don Cragun Don Cragun is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 16 April 2014, 2:59 PM EDT
Location: San Jose, CA, USA
Posts: 3,443
Thanks: 140
Thanked 1,188 Times in 1,007 Posts
Please use code tags when you post examples of file contents and program fragments!

What are the field separators in your input files and in your rules file? What field separator do you want in the output and reject files?

Am I correct in assuming that the reject file is supposed to contain the original contents of every line that was changed by one or more rules in the rules file while producing the output file?

What is the distinction between the conditions "Equals" (or "equals") and "contains" (or "contauns")? It looks like if the character(s) found in the search field in the rules file are found in the field specified by the field named in the rules file for either condition, the content of the replace field in the corresponding line in the rules file replaces what was matched by the search field.

Am I correct in assuming that the "contauns" was a typo?

Is the contents of the condition column supposed to be case insensitive or was the "Equals" also a typo?
Sponsored Links
    #3  
Old 10-05-2012
chandrath chandrath is offline
Registered User
 
Join Date: Sep 2012
Last Activity: 17 November 2012, 5:43 PM EST
Posts: 7
Thanks: 1
Thanked 0 Times in 0 Posts
Don,
- The field separator is pipe (|). My apologies for not putting that in the initial post.
- Yes, The reject file is supposed to contain the original contents of every line that was changed by one or more rules in the rules file while producing the output file.
- My apologies for typos, It will have 'Contains' or 'Equals' type condition no other conditions.
-Yes, "contauns" was a typo.
-As I mentioned earlier, it is safe to assume only two conditions 'Contains' or Equals'. The example input given in my earlier post has typos as you pointed out.
-The distinction between two conditions:
1. 'Equals' is used for exact string match. (example, if field1 value equals XX, then replace it with rc).
2. 'Contains' is used for pattern match (example, if field1 value contains $, replace it with <blank>, example, field1 value of rc$21 will become rc21 as '$' gets replaced with '' as field one contained '$'.
Appreciate all your help with any unix script solution to this.

---------- Post updated at 11:15 PM ---------- Previous update was at 11:02 PM ----------

Also I updated the files with field/line delimiters:
Input:

Code:
field1|field2|
rc11|rc12| 
rc$21|rc#21|
XX31|yy32|
rc41|r!42|

Rules:

Code:
field|search|condition|replace|
field1|XX|Equals|rc|
field1|$|contains||   
field2|#|contains||
field2|!|contains|c|
field2|yy|Equals|rc|

Output:

Code:
field1|field2|
rc11|rc12|
rc21|rc22|
rc31|rc32|
rc41|rc42|

Rejects:

Code:
field1|field2|
rc$21|rc#22|
XX31|yy32|
rc41|r!42|

Thank you!

Moderator's Comments:
Please use code tags next time for your code and data.

Last edited by radoulov; 10-05-2012 at 04:57 AM..
    #4  
Old 10-05-2012
Don Cragun's Avatar
Don Cragun Don Cragun is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 16 April 2014, 2:59 PM EDT
Location: San Jose, CA, USA
Posts: 3,443
Thanks: 140
Thanked 1,188 Times in 1,007 Posts
Quote:
Originally Posted by chandrath View Post
Don,
- The field separator is pipe (|). My apologies for not putting that in the initial post.
- Yes, The reject file is supposed to contain the original contents of every line that was changed by one or more rules in the rules file while producing the output file.
- My apologies for typos, It will have 'Contains' or 'Equals' type condition no other conditions.
-Yes, "contauns" was a typo.
-As I mentioned earlier, it is safe to assume only two conditions 'Contains' or Equals'. The example input given in my earlier post has typos as you pointed out.
-The distinction between two conditions:
1. 'Equals' is used for exact string match. (example, if field1 value equals XX, then replace it with rc).
2. 'Contains' is used for pattern match (example, if field1 value contains $, replace it with <blank>, example, field1 value of rc$21 will become rc21 as '$' gets replaced with '' as field one contained '$'.
Appreciate all your help with any unix script solution to this.

---------- Post updated at 11:15 PM ---------- Previous update was at 11:02 PM ----------

Also I updated the files with field/line delimiters:
Input:
field1|field2|
rc11|rc12|
rc$21|rc#21|
XX31|yy32|
rc41|r!42|

Rules:
field|search|condition|replace|
field1|XX|Equals|rc|
field1|$|contains||
field2|#|contains||
field2|!|contains|c|
field2|yy|Equals|rc|

Output:
field1|field2|
rc11|rc12|
rc21|rc22|
rc31|rc32|
rc41|rc42|

Rejects:
field1|field2|
rc$21|rc#22|
XX31|yy32|
rc41|r!42|

Thank you!
OK. Let me try again. (And, PLEASE use code tags surrounding the contents of your input and output files.)

I do not see any difference in your sample output between condition Equals and condition Contains . If the string listed in the 2nd field in the Rules file appears in the field in the Input file field with the heading named by the first column in your Rules file, that string is replaced by the string in the 4th field in your Rules file. You said:
Quote:
-The distinction between two conditions:
1. 'Equals' is used for exact string match. (example, if field1 value equals XX, then replace it with rc).
2. 'Contains' is used for pattern match (example, if field1 value contains $, replace it with <blank>, example, field1 value of rc$21 will become rc21 as '$' gets replaced with '' as field one contained '$'.
but when field1 is XX31 (which is not equal to XX ), your desired output changed the XX to rc anyway??? And you say that 'Contains' is a "pattern match", but don't define what pattern matching rules are to be used. (Is it shell pattern matching, filename pattern matching, basic regular expression matching, extended regular expression matching, or something else?) In the possible solution below, I assume that anytime the string in the 2nd field in the Rules file is found in the specified field in the Input file it will be replaced by the string in the 4th field in the Rules file. This matches the behavior shown given your Input file, Rules file, and Output file even though it doesn't match your description. Since your examples do not show any difference in the expected output between Equals and Contains , the possible solution below ignores the 3rd field in the Rules file.

You say that only Equals and Contains appear in the 3rd field in the Rules file. But, your sample Rules file 3rd field is contains on three lines and is never Contains (with an upper case C). But, since the possible solution below ignores the 3rd field in the Rules file, it doesn't make any difference.

The following produces the Output file you specify when given the Input and Rules files you specified, except for two issues:
  1. you have a <space> character at the end of the line:
    Code:
    rc11|rc12|

in Input , but there is no space at the end of the corresponding line in the Output file, and
  • the line in your Input file:
    Code:
    rc$21|rc#21|

    is transformed into:
    Code:
    rc21|rc#21|

    and then into:
    Code:
    rc21|rc21|

    by the rules:
    Code:
    field1|$|contains||   
        and
    field2|#|contains||

    but your Output file shows:
    Code:
    rc21|rc22|

    instead.
  • There are corresponding differences in what this script produces in the Reject file compared to what you said should appear in the Reject file.

    Anyway, play around with the following to see how it works:

    Code:
    #!/bin/ksh
    rejectfile="Reject"
    awk -F "|" -v rejf="$rejectfile" 'BEGIN {OFS = "|"}
    FNR==NR{
    if(debug)printf("# rules record read: %s\n", $0)
            if(FNR == 1) next #skip Rules file header.
            ruleF[++rc] = $1
            ruleS[rc] = $2
            cnt = gsub(/./, "[[.&.]]", ruleS[rc])
            ruleC[rc] = $3
            ruleR[rc] = $4
            gsub(/\\/, "\\", ruleR[rc]);
            gsub(/[.&.]/, "\\\\&", ruleR[rc])
    if(debug)printf("ruleF[%d]=%s, ruleS[%d]=%s (%d elements), ruleC[%d]=%s, ruleR[%d]=\"%s\"\n",
    rc,ruleF[rc],rc,ruleS[rc],cnt,rc,ruleC[rc],rc,ruleR[rc])
            next
    }
    FNR==1{ # Process input file header
    if(debug)printf("@ input header read:\n")
            for(i = 1; i <= NF; i++) {
                    mF[$i] = i
    if(debug)printf("@ mF[%s]=%d\n",$i,i)
            }
            fc = NF
            print
            print > rejf
            next
    }
    {       cc = 0 # of changes made to this line
            o0 = $0
    if(debug)printf("@ input record read: %s\n", $0);
            for(i = 1; i <= rc; i++) {
    if(debug)printf("@ f:%s(%d): s/%s/%s/\n", ruleF[i], mF[ruleF[i]], ruleS[i], ruleR[i])
                    if((cnt = sub(ruleS[i], ruleR[i], $mF[ruleF[i]]))) {
                            cc += cnt
    if(debug)printf("@ %s changed to \"%s\"\n", ruleF[i], $mF[ruleF[i]])
                    }
            }
            if(cc) print o0 > rejf
            print
    }' Rules Input > Output

    Note that if you change the line:

    Code:
    awk -F "|" -v rejf="$rejectfile" 'BEGIN {OFS = "|"}

    to:

    Code:
    awk -F "|" -v rejf="$rejectfile" 'BEGIN {OFS = "|"; debug = 1}

    you'll get lots of debugging data in Outfile showing how it evaluates input lines, how it transforms search and replace patterns into extended regular expressions and replacement patterns, respectively, and which rules cause transformations of input fields.

    If you want to make "Equals" behave as you described it (instead of as your expected Output file contents demonstrate, you just need to add ERE anchoring chracters to the start and end of "ruleS[x]" after the gsub() call converts each character to be matched into its corresponding collating symbol matching expression (which is used to avoid having "special" characters in EREs being treated specially).
    Sponsored Links
        #5  
    Old 10-23-2012
    chandrath chandrath is offline
    Registered User
     
    Join Date: Sep 2012
    Last Activity: 17 November 2012, 5:43 PM EST
    Posts: 7
    Thanks: 1
    Thanked 0 Times in 0 Posts
    Don,

    Thank you very much for your valuable inputs and comments!. Appreciate all your help!
    I had slight variation in the rules file when there is a rule for a field that do not exist in input, as well the 'Contains' requirement as below:

    Rules file:


    Code:
    field|search|condition|replace|
    field1|XX|Equals|rc
    field1|$|Contains|  
    field2|#|Contains|
    field2|!|Contains|c
    field2|yy|Equals|rc
    fieldx|{{|Equals|

    Input file:

    Code:
    field1|field2|field3|
    rc11|rc12|xxx|
    rc$$$21|rc#21|yyy|
    XX|yy|fff|
    rc41|r!42||

    I am getting error due to the extra Rule
    Code:
    fieldx|{{|Equals|

    , for which fieldx does not exist in Inpurt file.
    Error message: "awk: Field is not correct. The input line number is 2. "
    Appreciate your help to handle this.

    After I deleted the "fieldx|{{|Equals|" record from Rules file, I am getting below output:

    Code:
    field1|field2|field3|
    rc11|rc12|xxx|
    rc$$21|rc21|yyy|
    rc|rc|fff|
    rc41|rc42||

    But, the expected output I need is :

    Code:
    rc11|rc12|xxx|
    rc21|rc21|yyy|     ---> the field1 has '$$$' each '$' sign in this field to be replaced with ''.
    rc|rc|fff|
    rc41|rc42||

    The Rule for "fieldx" can be reported into Reject file saying it did not exist in input.
    Any help is highly appreciated in dealing with these 2 scenarios.

    Thank you!
    Chand.
    Sponsored Links
        #6  
    Old 10-25-2012
    Don Cragun's Avatar
    Don Cragun Don Cragun is offline Forum Staff  
    Moderator
     
    Join Date: Jul 2012
    Last Activity: 16 April 2014, 2:59 PM EDT
    Location: San Jose, CA, USA
    Posts: 3,443
    Thanks: 140
    Thanked 1,188 Times in 1,007 Posts
    Quote:
    Originally Posted by chandrath View Post
    Don,

    Thank you very much for your valuable inputs and comments!. Appreciate all your help!
    I had slight variation in the rules file when there is a rule for a field that do not exist in input, as well the 'Contains' requirement as below:

    Rules file:


    Code:
    field|search|condition|replace|
    field1|XX|Equals|rc
    field1|$|Contains|  
    field2|#|Contains|
    field2|!|Contains|c
    field2|yy|Equals|rc
    fieldx|{{|Equals|

    Input file:

    Code:
    field1|field2|field3|
    rc11|rc12|xxx|
    rc$$$21|rc#21|yyy|
    XX|yy|fff|
    rc41|r!42||

    I am getting error due to the extra Rule
    Code:
    fieldx|{{|Equals|

    , for which fieldx does not exist in Inpurt file.
    Error message: "awk: Field is not correct. The input line number is 2. "
    Appreciate your help to handle this.

    After I deleted the "fieldx|{{|Equals|" record from Rules file, I am getting below output:

    Code:
    field1|field2|field3|
    rc11|rc12|xxx|
    rc$$21|rc21|yyy|
    rc|rc|fff|
    rc41|rc42||

    But, the expected output I need is :

    Code:
    rc11|rc12|xxx|
    rc21|rc21|yyy|     ---> the field1 has '$$$' each '$' sign in this field to be replaced with ''.
    rc|rc|fff|
    rc41|rc42||

    The Rule for "fieldx" can be reported into Reject file saying it did not exist in input.
    Any help is highly appreciated in dealing with these 2 scenarios.

    Thank you!
    Chand.
    Chand,
    I have rewritten the script to process multiple input files and allow fields specified in the rules file to be skipped if the field named in a rule does not appear as an input file's field header. (If this happens, a note will be included in the reject file as you requested stating that a rule is invalid.)

    Making the script replace every occurrence of a search string rather than just the first occurrence was done by just changing a call to sub() to be a call to gsub(). I have expanded the shell portion of the script to support several options and provide a built-in man page. The in-line comments explaining what the script does have also been expanded in hopes that you will be able to make further enhancements yourself.

    Note that the rule that you said should convert the input line: rc$$$21|rc#21|yyy| to: rc21|rc#21|yyy| , it won't do that because the rule in you rules file has two spaces in the replace field. Therefore, the result of applying that rule to every occurrence that matches will instead produce the output:

    Code:
    rc      21|rc#21|yyy|

    Because the script is so large now, I have attached it rather than including it in-line here. The name of the script is substitute , but to upload it I had to use the name substitute.sh . It is written as a Korn shell script that calls awk. You should be able to use a Bourne shell, or bash, as well as a Korn shell if you just change the first line of the script from #!/bin/ksh to the path to your shell. (However, it won't work with csh or any of its variants.)
    If you are on a Solaris system, use nawk or /usr/xpg4/bin/awk instead of awk .

    I hope this helps,
    Don
    Attached Files
    File Type: sh substitute.sh (6.1 KB, 14 views)
    Sponsored Links
        #7  
    Old 10-29-2012
    chandrath chandrath is offline
    Registered User
     
    Join Date: Sep 2012
    Last Activity: 17 November 2012, 5:43 PM EST
    Posts: 7
    Thanks: 1
    Thanked 0 Times in 0 Posts
    Don,
    Thank you very much.
    I did some changes, and it's working fine for English char data. But, when I have international language data in the input file, getting into below error:

    Code:
    FNR=2) fatal: Invalid collation character: /[[.".]][[..]][[.".]]/

    Appreciate your help regarding this.
    Thanks
    Sponsored Links
    Closed Thread

    Thread Tools Search this Thread
    Search this Thread:

    Advanced Search
    Display Modes

    More UNIX and Linux Forum Topics You Might Find Helpful
    Thread Thread Starter Forum Replies Last Post
    Search and replace field? sabercats Shell Programming and Scripting 5 06-29-2012 03:01 AM
    Search duplicate field and replace one of them with new value ezhil01 Shell Programming and Scripting 3 04-26-2012 10:23 AM
    Perl - search and replace a particular field ramkrix Shell Programming and Scripting 1 04-27-2011 08:43 AM
    awk search and replace field bluejayek Shell Programming and Scripting 2 08-09-2010 02:53 PM
    search and replace dynamic data in a shell script csejl Shell Programming and Scripting 8 10-21-2003 10:33 PM



    All times are GMT -4. The time now is 06:17 PM.