Replacing 12 columns of one file by second file based on mapping in third file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Replacing 12 columns of one file by second file based on mapping in third file
# 1  
Old 07-29-2016
Replacing 12 columns of one file by second file based on mapping in third file

i have a real data prod file with 80+ fields containing 1k -2k records. i have to extract say 12 columns out of this which are sensitive fields along with one primary key say SEQ_ID (like DOB,account no, name, SEQ_ID, govtid etc) in a lookup file. i have to replace these sensitive fields in lookup file by mocked data contained in second file say mocked_Store.dat but the order of fields in second file may be different from first file. The mapping of which field number in first file to be replaced by which field no of second file is contained in a third file called mapping.dat. Once the replacement is done, those 12 sensitive fields need to be put back in original prod file. In the end i need, prod file having mocked data on few fields+ lookup file having original fields and mocked fields value. Please help.TIA
# 2  
Old 07-29-2016
How about some decent sample input data (sensitive info concealed), abbreviated if need be, the desired output, and a description of the logics/algorithms that connect the two?
# 3  
Old 07-29-2016
So this is to anonymise the production data I think. Is that right?

Do you need the overwritten details to be unique or could they be the same, or perhaps a simple sequential value, e.g. Name becomes Fnamaaaaaaaa Snamaaaaaaaa through to Fnamzzzzzzzz Snamzzzzzzzz

If you need them to be random then that could probably be done or if you need to generate the random details and be able to reuse them then if we can store them in a file, we might be able to use awk to read two files in and merge the output. It might even be possible with paste

What we will need is some good input data (sanitised of course) and with a fuller description of what/how you need to achieve.

If you could paste your input/output in CODE tags, it will make it easier to read and preserve multiple spaces for fixed width data.



Kind regards,
Robin
This User Gave Thanks to rbatte1 For This Post:
# 4  
Old 07-29-2016
reposting my question with more details as advised:
i have a real data prod file with 80+ fields. Need to replace say 12 columns out of this(based on field#) which are sensitive fields by mocked data contained in second file say mocked_Store.dat row by row. The mapping of which field number in first file to be replaced by which field no of second file is contained in a third file called mapping.dat. In the end i need, rea file having mocked data on given fields and a lookup file having original fields and corresponding mocked fields value. it also needs to have a primary key from prod file <say SEQ_ID, position given in mapping file>. Please help.TIA. Please see expected input/output through below example with sample data.

Input 1: sample prod file:
Code:
col a|SEQ_ID| first_name|last_name|full_name|DOB|col b| col c|Govt_id|col d
value a|100000|vijayendra|kumar|vijayendra kumar|10/101984|value b|value c |AOYUGH9282P|value d
value a1|100003|ravi|kumar|ravi kumar|01/01/1987|value b1|value c1|AOJKUYT0908P|value d1

Input 2 : mocked_store.dat:(containing mocked data)
Code:
DOB|full_name|Govt_id|first_name|last_name|
02/02/1981|Meena Kumari|ABCDEF1232F|Meena|Kumari
02/02/1982|Dhyan Chand|ABCD4567M|Dhyan|Chand

Input 3: mapping file:
Code:
Prod file Field number| mocked_store file Field number
3|4  #first_name
4|5 #last_name
5|2 #full_name
6|1#DOB
9|3#Govt_id
2|  #SEQ_ID not to be replaced

OUTPUT1: prod file <actual data replaced by mocked data>
Code:
col a|SEQ_ID| first_name|last_name|full_name|DOB|col b| col c|Govt_id|col d
value a|100000|Meena |kumari|Meena kumari|02/02/1981|value b|value c|ABCDEF1232F|value d
value a1|100003|Dhyan |chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1

OUTPUT 2: lookup file < prod data+ mocked up data ,fields in order of mapping file>

Code:
first_name|first_name_mocked|last_name|last_name_mocked|full_name|full_name_mocked|DOB|DOB_mocked|Govt_id|Govt_id_mocked|SEQ_ID

vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/10/1984|02/02/1981| AOYUGH9282P| ABCDEF1232F|100000

ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P| ABCD4567M|100003


Please NOTE: The real data file does not have field headers. The output lookup file data fields can be in any order. currently i am thinking of having it in same order as mapping file.

---------- Post updated at 03:59 AM ---------- Previous update was at 03:49 AM ----------

Thankyou for suggesting edit in the post. yes this is to mask data such that data remains valid but not the real. so we are provided with mock stores which are created by us and is completely fake data. Please share the relevant awk commands to accomplish this. i have added sample data in my example to make clear expected input output.





Moderator's Comments:
Mod Comment Please use code tags as required by forum rules!

Last edited by rbatte1; 07-29-2016 at 06:23 AM.. Reason: RudiC Added code tags. rbatte1 wrapped NOPARSE tags round Input 2 line to prevent emoticon conversion
# 5  
Old 07-29-2016
You connect fields in the prod file with fields in the "mock" file. How are lines (resp. records) in those files connected? By line No.?
# 6  
Old 07-29-2016
Assuming the lines are connected by line No. Try

Code:
awk -F"|" '
FNR == 1        {NoF++
                 next
                }
NoF == 1        {sub (/ *#.*$/, _)
                 TR1[$2] = $1
                 COL[$1]
                 next
                }
NoF == 2        {for (i=1; i<=NF; i++) TR2[FNR,TR1[i]] = $i
                 next
                }

NoF == 3        {for (c in COL)         {printf "%s%s%s%s", $c, OFS, TR2[FNR,c], OFS > "lookupfile"
                                         if (TR2[FNR,c]) $c = TR2[FNR,c]
                                        }
                 printf RS > "lookupfile"
                }

1

' mapfile mockfile OFS="|" prodfile
value a|100000|Meena|Kumari|Meena Kumari|02/02/1981|value b|value c |ABCDEF1232F|value d
value a1|100003|Dhyan|Chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1
cat lookupfile 
100000||vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/101984|02/02/1981|AOYUGH9282P|ABCDEF1232F|
100003||ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P|ABCD4567M|

If the lookupfile's structure doesn't suit you, additional measures must be taken.

Last edited by RudiC; 07-29-2016 at 08:22 AM.. Reason: Changed file names
This User Gave Thanks to RudiC For This Post:
# 7  
Old 07-29-2016
Thanks a lot RudiC for the detailed reply. yes we will do the replacement row wise line by line. I put your code in a script create_testdata.ksh and ran it. Below are my observations please:

Code:
 $ cat prodfile
value a|100000|vijayendra|kumar|vijayendra kumar|10/101984|value b|value c |AOYUGH9282P|value d
value a1|100003|ravi|kumar|ravi kumar|01/01/1987|value b1|value c1|AOJKUYT0908P|value d1
value a2|100005|nisha|verma|nisha verma|12/12/1987|value b2|value c2|AOJYGFT345F|value d2
  
 $ cat mockfile
DOB|full_name|Govt_id|first_name|last_name|
02/02/1981|Meena Kumari|ABCDEF1232F|Meena|Kumari|
02/02/1982|Dhyan Chand|ABCD4567M|Dhyan|Chand|
02/02/1983|John Abraham|ABCDEF234M|John|Abrahm|
  
 $ cat mapfile
prodfile field number|store file Field number
3|4  #first_name
4|5 #last_name
5|2 #full_name
6|1#DOB
9|3#Govt_id
2|  #SEQ_ID not to be replaced

output:
Code:
$ ./create_testdata.ksh
value a1|100003|Meena|Kumari|Meena Kumari|02/02/1981|value b1|value c1|ABCDEF1232F|value d1
value a2|100005|Dhyan|Chand|Dhyan Chand|02/02/1982|value b2|value c2|ABCD4567M|value d2
  
 $ cat lookupfile
kumar|Kumari|ravi kumar|Meena Kumari|01/01/1987|02/02/1981|AOJKUYT0908P|ABCDEF1232F|100003||ravi|Meena|
verma|Chand|nisha verma|Dhyan Chand|12/12/1987|02/02/1982|AOJYGFT345F|ABCD4567M|100005||nisha|Dhyan|

Mostly it ran perfect except few minors please:
  1. we are missing first line of prod file always
  2. the first line of mockfile is replacing second line of prodfile & so on not row wise< if we remove the header in mockfile the order becomes correct but the header is needed in this file>
  3. Lookup file fields are in same order as map file except the first field<first_name> which is coming in end in place of begining.
Thanks again for your kind patience . please elaborate on the code a little so I could understand it and expand it if needed.

Last edited by Don Cragun; 07-29-2016 at 04:31 PM.. Reason: Add CODE and ICODE tags and a list.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Hello, I have to fish out some specific columns from a file based on the header value. I have the list of columns I need in a different file. I thought I could read in the list of headers I need, # file with header names of required columns in required order headers_file=$2 # read contents... (11 Replies)
Discussion started by: LMHmedchem
11 Replies

2. Shell Programming and Scripting

Help with awk replacing identical columns based on another file

Hello, I am using Awk in UBUNTU 12.04. I have a file like following with three fields and 44706 rows. F1 A A F2 G G F3 A T I have another file like this: AL_1 F1 A A AL_2 F1 A T AL_3 F1 A A AL_1 F2 G G AL_2 F2 G A AL_3 F2 G G BO_1 F1 A A BO_2 F1 A T... (6 Replies)
Discussion started by: Homa
6 Replies

3. Shell Programming and Scripting

Search and replace with mapping from a mapper file in a target file

Hello, I have a special problem. I have a file in 8 bit and would like to convert the whole database to 16Bit unicode. The mapping file has the following structure: The mapper is provided as a zip file The target file to be converted contains data in English and 8 bit Urdu mapping, a... (4 Replies)
Discussion started by: gimley
4 Replies

4. Shell Programming and Scripting

Filtering first file columns based on second file column

Hi friends, I have one file like below. (.csv type) SNo,data1,data2 1,1,2 2,2,3 3,3,2 and another file like below. Exclude data1 where Exclude should be treated as column name in file2. I want the output shown below. SNo,data2 1,2 2,3 3,2 Where my data1 column got removed from... (2 Replies)
Discussion started by: ks_reddy
2 Replies

5. Shell Programming and Scripting

Finding/replacing strings in some files based on a file

Hi, We have a file (e.g. a .csv file, but could be any other format), with 2 columns: the old value and the new value. We need to modify all the files within the current directory (including subdirectories), so find and replace the contents found in the first column within the file, with the... (9 Replies)
Discussion started by: Talkabout
9 Replies

6. Shell Programming and Scripting

Replacing headers based on a second file

I have a file with thousands of sequences that looks like this: I need to replace the headers using a second file Thus, I will end up having the following file: I am looking for an AWK script that I can easily plug in my current pipeline. Any help will be greatly appreciated! (6 Replies)
Discussion started by: Xterra
6 Replies

7. UNIX for Dummies Questions & Answers

Script for replacing text in a file based on list

Hi All, I am fairly new to the world of Unix, and I am looking for a way to replace a line of text in a file with a delimited array of values. I have an aliases file that is currently in use on our mail server that we are migrating off of. Until the migration is complete, the server must stay... (8 Replies)
Discussion started by: phoenixjc
8 Replies

8. Shell Programming and Scripting

Replacing Character in a file based on element

Hi, I have file like below. Unix:/pclls/turc>cat tibc.property executeReceiver=Y executeSender=Y I want to replace executeSender=N in the file. My file should be like below. executeReceiver=Y executeSender=N I tried with the below command, its giving error. cat tibc.property |... (2 Replies)
Discussion started by: senthil_is
2 Replies

9. Shell Programming and Scripting

sorting file based on two or more columns

Hi gang. I'm using a unix/mac system and i'm trying to sort a file (more than 1,000,000 lines). chr1 100000965 100001001 - chr1 100002155 100002191 + chr1 100002165 100002201 + chr1 100002525 100002561 - chr1 10000364 ... (2 Replies)
Discussion started by: labrazil
2 Replies

10. Shell Programming and Scripting

Replacing columns into another file

Hi, I have input file. File1: Seqno Name 121 name1 122 name2 123 name3 124 name4 We will send the file1 to some other team. They will replace name column with place in file1 and send back to us as file2. file2: Seqno Place 121 place1 122 place2 124 place3 (2 Replies)
Discussion started by: manneni prakash
2 Replies
Login or Register to Ask a Question