Shell script for search and replace by field


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Shell script for search and replace by field
# 8  
Old 10-29-2012
If you want to process REs for characters that are not in the portable character set defined by the C and POSIX standards, you have to be sure that the characters you're looking for exist in your current locale. The character '¿' is not in the portable character set defined by C and POSIX. You probably want a Spanish or Portuguese locale if that character is present in your data. You haven't said anything about what system you're using, but I probably can't help here. You need to talk to your system administrator to find out what locales have been installed on your system that support the languages and character sets you're using in your data. (Note that there is nothing wrong with the awk script, the problem is the environment in which it is being run.) To find out what locale you're using, run the command:locale. If the setting of LC_COLLATE reported is something like LC_COLLATE="en_US.UTF-8", '¿' is probably not a collating element in your current locale.

Note also that I am not offering to localize the error messages from this awk script for you to produce diagnostic messages in anything other than English.
# 9  
Old 10-30-2012
Thanks Don!

My system Locale :
Code:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Is there a way to override locage LC_COLLATE in the script/my session only to support all international languages as my source data contains multiple languages?

Appreciate all your help.
Thanks,
# 10  
Old 10-30-2012
There is no locale that applies to local conventions around the world (because by definition local conventions such as the spelling of calendar months varies from location to location), but I suppose it is theoretically possible to have an incomplete locale that would have character type and collation tables that contained all of the characters in UTF-8. Assuming that your system administrator had created and installed such an incomplete locale on your system as utf8.UTF-8, you could then invoke your script by using the command:
Code:
LANG=en_US.UTF-8 LC_COLLATE=utf8.UTF-8 LC_CtYPE=utf8.UTF-8 substitute [options] input_data_file...

instead of the command:
Code:
substitute [options] input_data_file...

Note that the setting of LC_COLLATE will affect everything in your script that compares characters such as the sort utility (other than just testing that two strings are the same or that two strings are different) and LC_CTYPE will affect everything in your script that tries to determine whether a character is in a certain class (such as uppercase letter, lowercase letter, alphanumeric, ...) so setting up an LC_CTYPE category for this locale to work correctly when used with other categories in other locales might not be a trivial task. Smilie

Good luck,
Don
This User Gave Thanks to Don Cragun For This Post:
# 11  
Old 11-01-2012
Don,

Thanks for your inputs.
One help on 'Equals' rule, it should replace the value only when the source field value is matching with the rules field value. Sorry, earlier I gave wrong output example for Equals.

Input:
Code:
field1|field2|field3
rc11|rc12|rec1
rc$$$21|rc#21|rec2
x|y|rec3
xx11|yy11|rec$4
rc41|r!42|rec5

rules:
Code:
field|search|condition|replace
field1|x|Equals|rc
field1|$|Contains|
field2|#|Contains|
field2|!|Contains|c
field2|y|Equals|rc
fieldx|{{|Equals|

Output:
Code:
field1|field2|field3
rc11|rc12|rec1
rc21|rc21|rec2
rc|rc|rec3
xx11|yy11|rec$4        --> not altered due to equality rules.
rc41|rc42|rec5

Could you please suggest the changes for equals condition to replace whole field value when matched.

Thank you!
# 12  
Old 11-02-2012
You are a little vague on what is supposed to happen. If what you mean is that a rule with a "condition" field value that is "Equals" should only be applied when the "search" field value matches the entire contents of the field specified by the "field" field value then change the following line in the script I gave you:
Code:
  ruleC[rc] = $3  # Save condition (even though it is not used

to:
Code:
  if($3 == "Equals") ruleS[rc] = "^" ruleS[rc] "$"

# 13  
Old 11-17-2012
Don,
Appreciate all your help!
Is there a way to improve performance of this script? Currently it's taking 1hr to process 200,000 source records with around 1000 records in rules file.

Thank you!
# 14  
Old 11-17-2012
Quote:
Originally Posted by chandrath
Don,
Appreciate all your help!
Is there a way to improve performance of this script? Currently it's taking 1hr to process 200,000 source records with around 1000 records in rules file.

Thank you!
There are lots of ways to improve the performance of this script. Most of them involve reducing the tremendous flexibility that this script provides.
  1. Have different rules files for different input files. Use field numbers instead of field names in the rules files (getting rid of a level of indirection in every rule processed on every input line). Don't have rules in your rules file for fields that don't exist in your input file (every rule has to be evaluated for every line even if the rule can never match since it may apply in a later input file). Use a single character to indicate Equals and contains (less data to process and faster comparisons).
  2. Restrict the set of characters that appear in (or can be matched and replaced in rules). You could then use regular characters instead of character class expressions which might evaluate faster.
  3. Comment out or remove the debugging stuff from the script. (But only do this after you have finished changing the script forever and you are sure how it works.) Better would be to keep the debugging stuff in your source script, but strip out the debugging tests in a version you use for production runs.
  4. Get rid of the reject file and just log errors instead of logging all changes.
  5. Rewrite it in C (or some other language that is compiled rather than interpreted).
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Search and replace the last field

Hi All, Seeking for your assistance on how to search and replace the last field/column. please see sample below: inputfile1.csv ="8923523434",="543623534"="afd23535623",="100"="200" ="8923523431",="543623536"="afd23535626",="101"="201"... (3 Replies)
Discussion started by: poginiks
3 Replies

2. Shell Programming and Scripting

Search for a value and replace other field in the same set

Hello friends, I have huge file with many sets where each "set" has few lines and each set always begins with "Set" in Sq brackets as shown above. # cat file1 (2 Replies)
Discussion started by: magnus29
2 Replies

3. Shell Programming and Scripting

Search field in text file and replace value

Hi there, First of all this is my first post here. Thank you in advance for your help. What I am trying to do is the following. I have a text file where each field of each row is separated by a tabulator. Looks like this: ATOM 1 N HSE A 26 3.033 -10.429 -2.262 1.00 17.07 ... (8 Replies)
Discussion started by: doom4
8 Replies

4. Shell Programming and Scripting

awk search and replace in a targeted field instead of $0

Hi I would like to apply this gawk command: gawk '{$0=gensub(/\y+\y/,"","g"); print}' file not to the whole $0 but just to the part of $0 that is between: (a number)"> and </mrk> Is it possible? thanks for your help. (4 Replies)
Discussion started by: louisJ
4 Replies

5. Shell Programming and Scripting

Search and replace field?

I have 2 files A.txt and B.txt A.txt 3 fields and separate by a comma some,thing,florida any1,thing1,california some2,thing2,dallas just,fun,kansas B.txt has 8 fields and separate by a comma what,ever,florida-state,,,,,, some,one,dallas_state,,,,,, You will see 3rd fields are the... (5 Replies)
Discussion started by: sabercats
5 Replies

6. Shell Programming and Scripting

Search a string,get line and replace with second field

Hi, I need to search for source path in file2 , as per file1 and if found get the next line and take the field value and put it in URL value of file1. In file1, NF is not same for all the lines. file1: <type source="/home/USER/Desktop" Dest="/home/USER/DIR1/Desktop" URL="ssh/path"/> <type... (8 Replies)
Discussion started by: greet_sed
8 Replies

7. Shell Programming and Scripting

Search duplicate field and replace one of them with new value

Dear All, I have file with 4 columns: 1 AA 0 21 2 BB 0 31 3 AA 0 21 4 CC 0 41 I would like to find the duplicate record based on column 2 and replace the 4th column of the duplicate by a new value. So, the output will be: 1 AA 0 21 2 BB 0 31 3 AA 0 -21 4 CC 0 41 Any suggestions... (3 Replies)
Discussion started by: ezhil01
3 Replies

8. Shell Programming and Scripting

Perl - search and replace a particular field

Hi, I have a file having around 30 records. Each record has 5 fields delimited by PIPE. Few records in the file having Junk characters in the field2 and field4. I found the junk charcter and I tested it and replace the junk with space with the command below perl -i -p -e "s/\x00/ /g"... (1 Reply)
Discussion started by: ramkrix
1 Replies

9. Shell Programming and Scripting

awk search and replace field

I am writing a c++ program that has many calls of pow(input,2). I now realize that this is slowing down the program and these all should be input * input for greater speed. There should be a simple way of doing this replacement throughout my file with awk, but I am not very familiar with awk.... (2 Replies)
Discussion started by: bluejayek
2 Replies

10. Shell Programming and Scripting

search and replace dynamic data in a shell script

Hi, I have a file that looks something like this: ... 0,6,256,87,0,0,0,1187443420 0,6,438,37,0,0,0,1187443380 0,2,0,0,0,10,0,1197140320 0,3,0,0,0,10,0,1197140875 0,2,0,0,0,23,0,1197140332 0,3,0,0,0,23,0,1197140437 0,2,0,0,0,17,0,1197140447 0,3,0,0,0,17,0,1197140543... (8 Replies)
Discussion started by: csejl
8 Replies
Login or Register to Ask a Question