Data imputation with scaling


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Data imputation with scaling
# 1  
Old 01-20-2015
Data imputation with scaling

Hello masters, this is difficult to explain and maybe complicated to implement...looks beyond what I taught myself (from this forum), some help is greatly appreciated.


I have a base file

Code:
 a1 10
 a2 15
 a3 20
 a4 21

I have a non-base file

Code:
a1 170
b12 175
c12 180
d12 190
a2  191
b23 567
a3  1000

I want to impute into the base file, values from the non-base file absent in the base. Imputed values must be scaled.

So when imputed into the base file , its value is scaled according to its range of flanking values.

the rule is

Code:
  imputed_value = low_base + range_base x ( diff_low_nonbase /range_nonbase)


The scaled imputed values are calculated as

  b12 = 10 + (15-10)*(175 - 170)/(191 - 170) = 11.19
  c12 = 10 + (15-10)*(180 - 170)/(191 - 170)
  d12 = 10 + (15-10)*(190 - 170)/(191 - 170)
  b23 = 15 + (20-15)*(567- 191)/(1000 - 191)


So the scaled imputed output looks like

Code:
 a1 10
 b12 11.19
 c12 12.38
 d12 14.78
 a2 15 
 b23 17.32
 a3 20
 a4 21

Note that I have made up the names of the variables for ease of understanding, they do not follow pattern like b23.
# 2  
Old 01-20-2015
Here is an awk framework that lists the required values.
Code:
awk 'NR==FNR {s[$1]=$2; next} ($1 in s) {lo=hi; nlo=nhi; hi=$2; nhi=$1; for (i=1;i<=bc;i++) {print name[i],s[nlo],s[nhi],ns[i],hi,lo} bc=0; next} {ns[++bc]=$2; name[bc]=$1}' base non-base

Adding the formula is left as an exercise...
This User Gave Thanks to MadeInGermany For This Post:
# 3  
Old 01-21-2015
Might be even more cryptic than MadeInGermany's, and certainly need some polishing/elegance, but try:
Code:
awk     'NR==FNR        {s[$1]=$2; if (L) DX[$1]=$2-L; L=$2; next}
         ($1 in s)      {if (K) D2=$2-K; K=$2; D1=DX[$1]
                         if (D1) for (i=C; i< NR; i++) print NM[i], s[$1]+D1*(VL[i]-K)/D2
                         print $1, s[$1];
                         C=NR+1
                         delete s[$1]
                        }
                        {NM[NR]=$1; VL[NR]=$2
                        }
         END            {for (i in s) print i, s[i]}
        ' OFMT="%.2f" file1 file2
a1 10
b12 11.19
c12 12.38
d12 14.76
a2 15
b23 17.32
a3 20
a4 21

This User Gave Thanks to RudiC For This Post:
# 4  
Old 01-23-2015
Hello senhia83,

Could you please try following too and let me know if this helps.
Code:
awk 'FNR==1{f++} FNR==NR{R[++h]=$1;sub(/[[:alpha:]]/,Z,$1);X[$1]=$2;next} (f==2){sub(/[[:alpha:]]/,Z,$1);{if(length($1)==1){Y[$1]=$2}}} (f==3){S=$1;Q=$2;sub(/[[:alpha:]]/,Z,$1);{if(length($1)>1){split($1, A,"");if(Y[A[1]] != "" && Y[A[2]] != ""){D=X[A[1]]+(X[A[2]] - X[A[1]]) * ($2 - Y[A[1]])/(Y[A[2]]-Y[A[1]])};print S OFS D;} else {print S OFS X[$1];m++}}} END{for(u in X){if(u==m){m++;print R[m] OFS X[m]}}}' base_file non_base_file non_base_file

Output will be as follows.
Code:
a1 10
b12 11.1905
c12 12.381
d12 14.7619
a2 15
b23 17.3239
a3 20
a4 21

I haven't tried with many senarios this code, please let me know if this helps.

EDIT: Adding non one liner form for same.
Code:
awk 'FNR==1     {f++}
     FNR==NR    {R[++h]=$1;sub(/[[:alpha:]]/,Z,$1);X[$1]=$2;next}
    (f==2)      {sub(/[[:alpha:]]/,Z,$1);
                                                        {if(length($1)==1)
                                                                                {Y[$1]=$2}
                                                        }
                }
     (f==3)     {S=$1;Q=$2;sub(/[[:alpha:]]/,Z,$1);     {if(length($1)>1)
                                                                                {split($1, A,"");
                                                                                 if(Y[A[1]] != "" && Y[A[2]] != ""){
                                                                                                                        D=X[A[1]]+(X[A[2]] - X[A[1]]) * ($2 - Y[A[1]])/(Y[A[2]]-Y[A[1]])}
                                                                                                                        ;print S OFS D;
                                                                                                                   }
                                                                                 else                              {
                                                                                                                        print S OFS X[$1];m++;
                                                                                                                   }
                                                        }
                }
     END        {for(u in X){
                                if(u==m)                {
                                                         print R[m] OFS X[m]
                                                         m++;
                                                        }
                           }
               }
   ' base_file non_base_file non_base_file

EDIT: Tried with following senario and it seems to be working fine.
Code:
cat non_base_file
a1 170
b12 175
c12 180
d12 190
a2  191
b23 567
a3  1000
a4  121
a5  675
a6  1100
f56 1200
 
 
cat base_file
a1 10
a2 15
a3 20
a4 21
a5 11
a6 17
a7 12
a8 123

Running the code we will get as follows.
Code:
./basic_non_basic1.ksh
a1 10
b12 11.1905
c12 12.381
d12 14.7619
a2 15
b23 17.3239
a3 20
a4 21
a5 11
a6 17
f56 18.4118
a7 12
a8 123

NOTE: Where basic_non_basic1.ksh is the above pasted script.


Thanks,
R. Singh

Last edited by RavinderSingh13; 01-23-2015 at 05:26 AM.. Reason: Added a non one liner form for solution
This User Gave Thanks to RavinderSingh13 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

1 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Re-scaling values - perl

Hey folks I have a big tab delimited file with 3 columns looks like this: chr2L 552 0.85 chr2R 135 1.06 chr3L 820 2.89 chr3R 581 3.93 chr4 585 0.94 chrX 605 1.93 All I want to do is re-scaling the third column to be between 0-1. Which means that the highest valu in 3rd column will... (5 Replies)
Discussion started by: @man
5 Replies
Login or Register to Ask a Question