Data imputation with scaling

01-20-2015

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

Data imputation with scaling

Hello masters, this is difficult to explain and maybe complicated to implement...looks beyond what I taught myself (from this forum), some help is greatly appreciated.

I have a base file

Code:

I have a non-base file

Code:

I want to impute into the base file, values from the non-base file absent in the base. Imputed values must be scaled.

So when imputed into the base file , its value is scaled according to its range of flanking values.

the rule is

Code:

  imputed_value = low_base + range_base x ( diff_low_nonbase /range_nonbase)


The scaled imputed values are calculated as

  b12 = 10 + (15-10)*(175 - 170)/(191 - 170) = 11.19
  c12 = 10 + (15-10)*(180 - 170)/(191 - 170)
  d12 = 10 + (15-10)*(190 - 170)/(191 - 170)
  b23 = 15 + (20-15)*(567- 191)/(1000 - 191)

So the scaled imputed output looks like

Code:

Note that I have made up the names of the variables for ease of understanding, they do not follow pattern like b23.

senhia83

View Public Profile for senhia83

Find all posts by senhia83

01-20-2015

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Here is an awk framework that lists the required values.

Code:

awk 'NR==FNR {s[$1]=$2; next} ($1 in s) {lo=hi; nlo=nhi; hi=$2; nhi=$1; for (i=1;i<=bc;i++) {print name[i],s[nlo],s[nhi],ns[i],hi,lo} bc=0; next} {ns[++bc]=$2; name[bc]=$1}' base non-base

Adding the formula is left as an exercise...

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-21-2015

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Might be even more cryptic than MadeInGermany's, and certainly need some polishing/elegance, but try:

Code:

awk     'NR==FNR        {s[$1]=$2; if (L) DX[$1]=$2-L; L=$2; next}
         ($1 in s)      {if (K) D2=$2-K; K=$2; D1=DX[$1]
                         if (D1) for (i=C; i< NR; i++) print NM[i], s[$1]+D1*(VL[i]-K)/D2
                         print $1, s[$1];
                         C=NR+1
                         delete s[$1]
                        }
                        {NM[NR]=$1; VL[NR]=$2
                        }
         END            {for (i in s) print i, s[i]}
        ' OFMT="%.2f" file1 file2
a1 10
b12 11.19
c12 12.38
d12 14.76
a2 15
b23 17.32
a3 20
a4 21

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-23-2015

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello senhia83,

Could you please try following too and let me know if this helps.

Code:

awk 'FNR==1{f++} FNR==NR{R[++h]=$1;sub(/[[:alpha:]]/,Z,$1);X[$1]=$2;next} (f==2){sub(/[[:alpha:]]/,Z,$1);{if(length($1)==1){Y[$1]=$2}}} (f==3){S=$1;Q=$2;sub(/[[:alpha:]]/,Z,$1);{if(length($1)>1){split($1, A,"");if(Y[A[1]] != "" && Y[A[2]] != ""){D=X[A[1]]+(X[A[2]] - X[A[1]]) * ($2 - Y[A[1]])/(Y[A[2]]-Y[A[1]])};print S OFS D;} else {print S OFS X[$1];m++}}} END{for(u in X){if(u==m){m++;print R[m] OFS X[m]}}}' base_file non_base_file non_base_file

Output will be as follows.

Code:

a1 10
b12 11.1905
c12 12.381
d12 14.7619
a2 15
b23 17.3239
a3 20
a4 21

I haven't tried with many senarios this code, please let me know if this helps.

EDIT: Adding non one liner form for same.

Code:

awk 'FNR==1     {f++}
     FNR==NR    {R[++h]=$1;sub(/[[:alpha:]]/,Z,$1);X[$1]=$2;next}
    (f==2)      {sub(/[[:alpha:]]/,Z,$1);
                                                        {if(length($1)==1)
                                                                                {Y[$1]=$2}
                                                        }
                }
     (f==3)     {S=$1;Q=$2;sub(/[[:alpha:]]/,Z,$1);     {if(length($1)>1)
                                                                                {split($1, A,"");
                                                                                 if(Y[A[1]] != "" && Y[A[2]] != ""){
                                                                                                                        D=X[A[1]]+(X[A[2]] - X[A[1]]) * ($2 - Y[A[1]])/(Y[A[2]]-Y[A[1]])}
                                                                                                                        ;print S OFS D;
                                                                                                                   }
                                                                                 else                              {
                                                                                                                        print S OFS X[$1];m++;
                                                                                                                   }
                                                        }
                }
     END        {for(u in X){
                                if(u==m)                {
                                                         print R[m] OFS X[m]
                                                         m++;
                                                        }
                           }
               }
   ' base_file non_base_file non_base_file

EDIT: Tried with following senario and it seems to be working fine.

Code:

cat non_base_file
a1 170
b12 175
c12 180
d12 190
a2  191
b23 567
a3  1000
a4  121
a5  675
a6  1100
f56 1200
 
 
cat base_file
a1 10
a2 15
a3 20
a4 21
a5 11
a6 17
a7 12
a8 123

Running the code we will get as follows.

Code:

./basic_non_basic1.ksh
a1 10
b12 11.1905
c12 12.381
d12 14.7619
a2 15
b23 17.3239
a3 20
a4 21
a5 11
a6 17
f56 18.4118
a7 12
a8 123

NOTE: Where basic_non_basic1.ksh is the above pasted script.

Thanks,
R. Singh

Last edited by RavinderSingh13; 01-23-2015 at 05:26 AM.. Reason: Added a non one liner form for solution

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

Shell Programming and Scripting

Data imputation with scaling

1 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Re-scaling values - perl

Discussion started by: @man