Inserting column data based on category assignment

06-27-2015

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Inserting column data based on category assignment

please help with the following.

I have 4 col data .. instrument , category, variable and value. the instruments belong to particular categories and they all measure some variables (var1 and var2 in this example), the last column is the value an instrument outputs for a variable.

I have used some blank rows for ease of understanding , there is no blank row in the actual dataset.

In this example instruments (ab,bc,pt and ef) belong to cat1 ; instruments (cd,gh and pt ) belong to cat2.

Code:

ab cat1 var1 aa
bc cat1 var1 aa
pt cat1 var1 tt

cd cat2 var1 tt
gh cat2 var1 gg

ab cat1 var2 aa
ef cat1 var2 aa

pd cat2 var2 tt
gh cat2 var2 tt

As you can see , there are some rows missing like

Code:

( ef cat1 var1 )
( cd cat2 var2)
...
...

I want to impute these rows , if there is a consensus value within the same (cat var ) combination above 60%.

For example, in the part of data

Code:

ab cat1 var1 aa
bc cat1 var1 aa
pt cat1 var1 tt

(cat1 var1) has a value of aa 2 out of 3 times (66%). Since this is greater than cutoff of 60% , we can impute the missing instrument (ef) value in this category (cat1) and variable(var1) as aa.

Code:

ab cat1 var1 aa data
bc cat1 var1 aa data
pt cat1 var1 tt data
ef cat1 var1 aa imputed

This is my desired output, row order doesn't matter and blank rows not needed.

Code:

ab cat1 var1 aa data
bc cat1 var1 aa data
pt cat1 var1 tt data
ef cat1 var1 aa imputed

ab cat1 var2 aa data 
ef cat1 var2 aa data
bc cat1 var2 aa imputed
pt cat1 var2 aa imputed

cd cat2 var1 tt data
gh cat2 var1 gg data

pd cat2 var2 tt data
gh cat2 var2 tt data
cd cat2 var2 tt imputed

what I tried but getting lost in arrays

Code:

awk   'NR==FNR{
f1[$2][$1]=$3;
cat[$1][$2]++;
ins[$1]=$2;
val[$1$2]=$4;
 next
}
{
 for(ins in f1[cat]){
 for(var in f1){
 if(f1[var][ins]){
 num[cat][var][f1[cat][ins]]++
 len[cat][var]++;
  }
  }
for(val in num[cat][var){
if(num[cat][var][val]/len[cat][var] >= .60){
 printf  ins,cat,var,val
}
}
}' data data

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

Shell Programming and Scripting

Inserting column data based on category assignment

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Transposing data based on 1st column

Discussion started by: Kanja

2. Shell Programming and Scripting

To append new data at the end of each line based on substring of last column

Discussion started by: null7

3. Shell Programming and Scripting

Data filtering and category assigning

Discussion started by: jianp83

4. Shell Programming and Scripting

Generate tabular data based on a column value from an existing data file

Discussion started by: himanish

5. Shell Programming and Scripting

Help with analysis data based on particular column content

Discussion started by: perl_beginner

6. UNIX for Dummies Questions & Answers

Grep bunch of gzip files to count based on category

Discussion started by: jinxx

7. Shell Programming and Scripting

Parsing out the first (top) data lines of each category

Discussion started by: Lucky Ali

8. UNIX for Dummies Questions & Answers

How to sum rows in e.g. column 1 by a category in e.g. column 2

Discussion started by: auburn

9. Shell Programming and Scripting

Remove duplicate line detail based on column one data

Discussion started by: patrick87

10. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

Discussion started by: patrick87