Replacing 12 columns of one file by second file based on mapping in third file

07-29-2016

Registered User

5, 0

Join Date: Jul 2016

Last Activity: 5 August 2016, 5:42 AM EDT

Posts: 5

Thanks Given: 2

Thanked 0 Times in 0 Posts

Replacing 12 columns of one file by second file based on mapping in third file

i have a real data prod file with 80+ fields containing 1k -2k records. i have to extract say 12 columns out of this which are sensitive fields along with one primary key say SEQ_ID (like DOB,account no, name, SEQ_ID, govtid etc) in a lookup file. i have to replace these sensitive fields in lookup file by mocked data contained in second file say mocked_Store.dat but the order of fields in second file may be different from first file. The mapping of which field number in first file to be replaced by which field no of second file is contained in a third file called mapping.dat. Once the replacement is done, those 12 sensitive fields need to be put back in original prod file. In the end i need, prod file having mocked data on few fields+ lookup file having original fields and mocked fields value. Please help.TIA

megh12

View Public Profile for megh12

Find all posts by megh12

07-29-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

How about some decent sample input data (sensitive info concealed), abbreviated if need be, the desired output, and a description of the logics/algorithms that connect the two?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-29-2016

Moderator

3,843, 841

Join Date: Jun 2007

Last Activity: 29 June 2020, 12:30 PM EDT

Location: Lancashire, UK

Posts: 3,843

Thanks Given: 2,004

Thanked 841 Times in 727 Posts

So this is to anonymise the production data I think. Is that right?

Do you need the overwritten details to be unique or could they be the same, or perhaps a simple sequential value, e.g. Name becomes Fnamaaaaaaaa Snamaaaaaaaa through to Fnamzzzzzzzz Snamzzzzzzzz

If you need them to be random then that could probably be done or if you need to generate the random details and be able to reuse them then if we can store them in a file, we might be able to use awk to read two files in and merge the output. It might even be possible with paste

What we will need is some good input data (sanitised of course) and with a fuller description of what/how you need to achieve.

If you could paste your input/output in CODE tags, it will make it easier to read and preserve multiple spaces for fixed width data.

Kind regards,
Robin

This User Gave Thanks to rbatte1 For This Post:

rbatte1

View Public Profile for rbatte1

Visit rbatte1's homepage!

Find all posts by rbatte1

07-29-2016

Registered User

5, 0

Join Date: Jul 2016

Last Activity: 5 August 2016, 5:42 AM EDT

Posts: 5

Thanks Given: 2

Thanked 0 Times in 0 Posts

reposting my question with more details as advised:
i have a real data prod file with 80+ fields. Need to replace say 12 columns out of this(based on field#) which are sensitive fields by mocked data contained in second file say mocked_Store.dat row by row. The mapping of which field number in first file to be replaced by which field no of second file is contained in a third file called mapping.dat. In the end i need, rea file having mocked data on given fields and a lookup file having original fields and corresponding mocked fields value. it also needs to have a primary key from prod file <say SEQ_ID, position given in mapping file>. Please help.TIA. Please see expected input/output through below example with sample data.

Input 1: sample prod file:

Code:

col a|SEQ_ID| first_name|last_name|full_name|DOB|col b| col c|Govt_id|col d
value a|100000|vijayendra|kumar|vijayendra kumar|10/101984|value b|value c |AOYUGH9282P|value d
value a1|100003|ravi|kumar|ravi kumar|01/01/1987|value b1|value c1|AOJKUYT0908P|value d1

Input 2 : mocked_store.dat:(containing mocked data)

Code:

DOB|full_name|Govt_id|first_name|last_name|
02/02/1981|Meena Kumari|ABCDEF1232F|Meena|Kumari
02/02/1982|Dhyan Chand|ABCD4567M|Dhyan|Chand

Input 3: mapping file:

Code:

Prod file Field number| mocked_store file Field number
3|4  #first_name
4|5 #last_name
5|2 #full_name
6|1#DOB
9|3#Govt_id
2|  #SEQ_ID not to be replaced

OUTPUT1: prod file <actual data replaced by mocked data>

Code:

col a|SEQ_ID| first_name|last_name|full_name|DOB|col b| col c|Govt_id|col d
value a|100000|Meena |kumari|Meena kumari|02/02/1981|value b|value c|ABCDEF1232F|value d
value a1|100003|Dhyan |chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1

OUTPUT 2: lookup file < prod data+ mocked up data ,fields in order of mapping file>

Code:

first_name|first_name_mocked|last_name|last_name_mocked|full_name|full_name_mocked|DOB|DOB_mocked|Govt_id|Govt_id_mocked|SEQ_ID

vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/10/1984|02/02/1981| AOYUGH9282P| ABCDEF1232F|100000

ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P| ABCD4567M|100003

Please NOTE: The real data file does not have field headers. The output lookup file data fields can be in any order. currently i am thinking of having it in same order as mapping file.

---------- Post updated at 03:59 AM ---------- Previous update was at 03:49 AM ----------

Thankyou for suggesting edit in the post. yes this is to mask data such that data remains valid but not the real. so we are provided with mock stores which are created by us and is completely fake data. Please share the relevant awk commands to accomplish this. i have added sample data in my example to make clear expected input output.

Moderator's Comments:

Please use code tags as required by forum rules!

Last edited by rbatte1; 07-29-2016 at 06:23 AM.. Reason: RudiC Added code tags. rbatte1 wrapped NOPARSE tags round Input 2 line to prevent emoticon conversion

megh12

View Public Profile for megh12

Find all posts by megh12

07-29-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

You connect fields in the prod file with fields in the "mock" file. How are lines (resp. records) in those files connected? By line No.?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-29-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Assuming the lines are connected by line No. Try

Code:

awk -F"|" '
FNR == 1        {NoF++
                 next
                }
NoF == 1        {sub (/ *#.*$/, _)
                 TR1[$2] = $1
                 COL[$1]
                 next
                }
NoF == 2        {for (i=1; i<=NF; i++) TR2[FNR,TR1[i]] = $i
                 next
                }

NoF == 3        {for (c in COL)         {printf "%s%s%s%s", $c, OFS, TR2[FNR,c], OFS > "lookupfile"
                                         if (TR2[FNR,c]) $c = TR2[FNR,c]
                                        }
                 printf RS > "lookupfile"
                }

1

' mapfile mockfile OFS="|" prodfile
value a|100000|Meena|Kumari|Meena Kumari|02/02/1981|value b|value c |ABCDEF1232F|value d
value a1|100003|Dhyan|Chand|Dhyan Chand|02/02/1982|value b1|value c1|ABCD4567M|value d1
cat lookupfile 
100000||vijayendra|Meena|kumar|Kumari|vijayendra kumar|Meena Kumari|10/101984|02/02/1981|AOYUGH9282P|ABCDEF1232F|
100003||ravi|Dhyan|kumar|Chand|ravi kumar|Dhyan Chand|01/01/1987|02/02/1982|AOJKUYT0908P|ABCD4567M|

If the lookupfile's structure doesn't suit you, additional measures must be taken.

Last edited by RudiC; 07-29-2016 at 08:22 AM.. Reason: Changed file names

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

07-29-2016

Registered User

5, 0

Join Date: Jul 2016

Last Activity: 5 August 2016, 5:42 AM EDT

Posts: 5

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanks a lot RudiC for the detailed reply. yes we will do the replacement row wise line by line. I put your code in a script create_testdata.ksh and ran it. Below are my observations please:

Code:

 $ cat prodfile
value a|100000|vijayendra|kumar|vijayendra kumar|10/101984|value b|value c |AOYUGH9282P|value d
value a1|100003|ravi|kumar|ravi kumar|01/01/1987|value b1|value c1|AOJKUYT0908P|value d1
value a2|100005|nisha|verma|nisha verma|12/12/1987|value b2|value c2|AOJYGFT345F|value d2
  
 $ cat mockfile
DOB|full_name|Govt_id|first_name|last_name|
02/02/1981|Meena Kumari|ABCDEF1232F|Meena|Kumari|
02/02/1982|Dhyan Chand|ABCD4567M|Dhyan|Chand|
02/02/1983|John Abraham|ABCDEF234M|John|Abrahm|
  
 $ cat mapfile
prodfile field number|store file Field number
3|4  #first_name
4|5 #last_name
5|2 #full_name
6|1#DOB
9|3#Govt_id
2|  #SEQ_ID not to be replaced

output:

Code:

$ ./create_testdata.ksh
value a1|100003|Meena|Kumari|Meena Kumari|02/02/1981|value b1|value c1|ABCDEF1232F|value d1
value a2|100005|Dhyan|Chand|Dhyan Chand|02/02/1982|value b2|value c2|ABCD4567M|value d2
  
 $ cat lookupfile
kumar|Kumari|ravi kumar|Meena Kumari|01/01/1987|02/02/1981|AOJKUYT0908P|ABCDEF1232F|100003||ravi|Meena|
verma|Chand|nisha verma|Dhyan Chand|12/12/1987|02/02/1982|AOJYGFT345F|ABCD4567M|100005||nisha|Dhyan|

Mostly it ran perfect except few minors please:

we are missing first line of prod file always
the first line of mockfile is replacing second line of prodfile & so on not row wise< if we remove the header in mockfile the order becomes correct but the header is needed in this file>
Lookup file fields are in same order as map file except the first field<first_name> which is coming in end in place of begining.

Thanks again for your kind patience . please elaborate on the code a little so I could understand it and expand it if needed.

Last edited by Don Cragun; 07-29-2016 at 04:31 PM.. Reason: Add CODE and ICODE tags and a list.

megh12

View Public Profile for megh12

Find all posts by megh12

Shell Programming and Scripting

Replacing 12 columns of one file by second file based on mapping in third file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find columns in a file based on header and print to new file

Discussion started by: LMHmedchem

2. Shell Programming and Scripting

Help with awk replacing identical columns based on another file

Discussion started by: Homa

3. Shell Programming and Scripting

Search and replace with mapping from a mapper file in a target file

Discussion started by: gimley

4. Shell Programming and Scripting

Filtering first file columns based on second file column

Discussion started by: ks_reddy

5. Shell Programming and Scripting

Finding/replacing strings in some files based on a file

Discussion started by: Talkabout

6. Shell Programming and Scripting

Replacing headers based on a second file

Discussion started by: Xterra

7. UNIX for Dummies Questions & Answers

Script for replacing text in a file based on list

Discussion started by: phoenixjc

8. Shell Programming and Scripting

Replacing Character in a file based on element

Discussion started by: senthil_is

9. Shell Programming and Scripting

sorting file based on two or more columns

Discussion started by: labrazil

10. Shell Programming and Scripting

Replacing columns into another file

Discussion started by: manneni prakash