Inserting column data based on category assignment


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Inserting column data based on category assignment
# 1  
Old 06-27-2015
Inserting column data based on category assignment

please help with the following.

I have 4 col data .. instrument , category, variable and value. the instruments belong to particular categories and they all measure some variables (var1 and var2 in this example), the last column is the value an instrument outputs for a variable.

I have used some blank rows for ease of understanding , there is no blank row in the actual dataset.


In this example instruments (ab,bc,pt and ef) belong to cat1 ; instruments (cd,gh and pt ) belong to cat2.

Code:
ab cat1 var1 aa
bc cat1 var1 aa
pt cat1 var1 tt

cd cat2 var1 tt
gh cat2 var1 gg

ab cat1 var2 aa
ef cat1 var2 aa

pd cat2 var2 tt
gh cat2 var2 tt

As you can see , there are some rows missing like

Code:
( ef cat1 var1 )
( cd cat2 var2)
...
...

I want to impute these rows , if there is a consensus value within the same (cat var ) combination above 60%.

For example, in the part of data

Code:
ab cat1 var1 aa
bc cat1 var1 aa
pt cat1 var1 tt

(cat1 var1) has a value of aa 2 out of 3 times (66%). Since this is greater than cutoff of 60% , we can impute the missing instrument (ef) value in this category (cat1) and variable(var1) as aa.

Code:
ab cat1 var1 aa data
bc cat1 var1 aa data
pt cat1 var1 tt data
ef cat1 var1 aa imputed


This is my desired output, row order doesn't matter and blank rows not needed.


Code:
ab cat1 var1 aa data
bc cat1 var1 aa data
pt cat1 var1 tt data
ef cat1 var1 aa imputed

ab cat1 var2 aa data 
ef cat1 var2 aa data
bc cat1 var2 aa imputed
pt cat1 var2 aa imputed

cd cat2 var1 tt data
gh cat2 var1 gg data

pd cat2 var2 tt data
gh cat2 var2 tt data
cd cat2 var2 tt imputed

what I tried but getting lost in arrays

Code:
awk   'NR==FNR{
f1[$2][$1]=$3;
cat[$1][$2]++;
ins[$1]=$2;
val[$1$2]=$4;
 next
}
{
 for(ins in f1[cat]){
 for(var in f1){
 if(f1[var][ins]){
 num[cat][var][f1[cat][ins]]++
 len[cat][var]++;
  }
  }
for(val in num[cat][var){
if(num[cat][var][val]/len[cat][var] >= .60){
 printf  ins,cat,var,val
}
}
}' data data

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Transposing data based on 1st column

I do have a big tab delimited file of the following format aa 344 456 aa 34 67 bb 34 90 bb 23 100 bb 1 89 d 0 12 e 45 678 e 78 90 e 56 90 .... .... .... I would like to transpose the data based on the category on column one and get the output file in the following tab delimited... (8 Replies)
Discussion started by: Kanja
8 Replies

2. Shell Programming and Scripting

To append new data at the end of each line based on substring of last column

Hi guys, I need to append new data at the end of each line of the files. This new data is based on substring (3rd fields) of last column. Input file xxx.csv: U1234|1-5X|orange|1-5X|Act|1-5X|0.1 /sac/orange 12345 0 U5678|1-7X|grape|1-7X|Act|1-7X|0.1 /sac/grape 5678 0... (5 Replies)
Discussion started by: null7
5 Replies

3. Shell Programming and Scripting

Data filtering and category assigning

Please consider the following file, I have many groups which can be of 3 types, T1 (Serial_Number 1) T2 (Serial_Number 2) and T1*T2 (all other Serial_Number). I want to only consider groups that have both T1 and T2 present and their values are different from each other. In the example file,... (8 Replies)
Discussion started by: jianp83
8 Replies

4. Shell Programming and Scripting

Generate tabular data based on a column value from an existing data file

Hi, I have a data file with : 01/28/2012,1,1,98995 01/28/2012,1,2,7195 01/29/2012,1,1,98995 01/29/2012,1,2,7195 01/30/2012,1,1,98896 01/30/2012,1,2,7083 01/31/2012,1,1,98896 01/31/2012,1,2,7083 02/01/2012,1,1,98896 02/01/2012,1,2,7083 02/02/2012,1,1,98899 02/02/2012,1,2,7083 I... (1 Reply)
Discussion started by: himanish
1 Replies

5. Shell Programming and Scripting

Help with analysis data based on particular column content

Input file: Total_counts 1306726155 100% Number_of_count_true 855020282 Number_of_count_true_1 160014283 Number_of_count_true_2 44002825 Number_of_count_true_3 18098424 Number_of_count_true_4 24693745 Number_of_count_false 115421870 Number_of_count_true 51048447 Total_number_of_false ... (2 Replies)
Discussion started by: perl_beginner
2 Replies

6. UNIX for Dummies Questions & Answers

Grep bunch of gzip files to count based on category

Started using unix commands recently. I have 50 gzip files. I want to grep each of these files for a line count based particular category in column 3. How can I do that? For example Sr.No Date City Description Code Address 1 06/09 NY living here 0909 10st st nyc 2 ... (5 Replies)
Discussion started by: jinxx
5 Replies

7. Shell Programming and Scripting

Parsing out the first (top) data lines of each category

Hi All, I need some help in parsing out the first (top) data lines of each category (categories are based on the first column a, b, c, d, e.( see example file below) from a big file a dfg 3 6 8 9 a fgh 5 7 0 9 a gkl 5 2 4 7 a glo 7 0 1 5 b ghj 9 0 4 2 b mkl 7 8 0 5 b jkl 9 0 4 5 c jkl 2... (1 Reply)
Discussion started by: Lucky Ali
1 Replies

8. UNIX for Dummies Questions & Answers

How to sum rows in e.g. column 1 by a category in e.g. column 2

Hi, I've shown an example of what I would like to achieve below. In the example file, I would like to sum the values in column 2 for each distinct category in column 3 (presumably making an array?) and print the sum as well as the category name and length (note:length always corresponds with... (8 Replies)
Discussion started by: auburn
8 Replies

9. Shell Programming and Scripting

Remove duplicate line detail based on column one data

My input file: AVI.out <detail>named as the RRM .</detail> AVI.out <detail>Contains 1 RRM .</detail> AR0.out <detail>named as the tellurite-resistance.</detail> AWG.out <detail>Contains 2 HTH .</detail> ADV.out <detail>named as the DENR family.</detail> ADV.out ... (10 Replies)
Discussion started by: patrick87
10 Replies

10. Shell Programming and Scripting

Extract data based on match against one column data from a long list data

My input file: data_5 Ali 422 2.00E-45 102/253 140/253 24 data_3 Abu 202 60.00E-45 12/23 140/23 28 data_1 Ahmad 256 7.00E-45 120/235 140/235 22 data_4 Aman 365 8.00E-45 15/65 140/65 20 data_10 Jones 869 9.00E-45 65/253 140/253 18... (12 Replies)
Discussion started by: patrick87
12 Replies
Login or Register to Ask a Question