How to group similar data?


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to group similar data?
# 1  
Old 03-25-2015
How to group similar data?

Hello everybody:

I want rearrange about 5 million rows (with 300 columns) into groups.

Data looks like the following: where there were various experiments (column 2) conducted at different locations (column headers in top row column 4 onwards) in different years (column 1) using instruments (column 3). The numbers in the matrix ( row 2 onwards, column 4 onwards) indicate how many instances of experiments were successful.

What I want is to rearrange the rows

Input
Code:
			345	A346	347	348	349	350	351	352
2014	Exp1	IBM		24			45	22		
2014	Exp2	LEN	23		32	34				
2014	Exp3	LEN	2		34	34				
2014	Exp4	IBM		34			44	43		
2014	Exp5	IBM		2	45		51	45		
2014	Exp6	IBM			34		23	54		
2014	Exp7	IBM		23			23	24		
2014	Exp8	IBM			34		45	56		
2014	Exp9	LEN	24		45	45				
2014	Exp10	LEN	43		45	32				
2015	Exp11	IBM		34			55	33		34
2015	Exp12	IBM	1		33			4		5
2015	Exp13	IBM		43			55	34		43
2015	Exp14	IBM		45			32	43		4
2015	Exp15	IBM			23			4		5
2015	Exp16	IBM	32	34		43				
2015	Exp17	IBM	32	34		46				
2015	Exp18	LEN			32		54		67	
2015	Exp19	SCL		56	6		4	45		56
2015	Exp20	LEN			67		56		76	
2015	Exp21	LEN			45		56		65	
2015	Exp22	SCL		45				55		54
2015	Exp23	SCL		4				55		45

What I would like to have are the rows rearranged into groups such that

1) Within the same year
2) Using the same instrument

create groups such that,

each group has at least 3 locations in common, each of which has at least 20 successful experiments.


Requested Output

Code:
				345	A346	347	348	349	350	351	352
1	2014	Exp1	IBM		24			45	22		
1	2014	Exp4	IBM		34			44	43		
1	2014	Exp7	IBM		23			23	24		
2	2014	Exp2	LEN	23		32	34				
2	2014	Exp9	LEN	24		45	45				
2	2014	Exp10	LEN	43		45	32				
3	2014	Exp5	IBM		2	45		51	45		
3	2014	Exp6	IBM			34		23	54		
3	2014	Exp8	IBM			34		45	56		
4	2015	Exp11	IBM		34			55	33		34
4	2015	Exp13	IBM		43			55	34		43
4	2015	Exp14	IBM		45			32	43		4
5	2015	Exp16	IBM	32	34		43				
5	2015	Exp17	IBM	32	34		46				
6	2015	Exp18	LEN			32		54		67	
6	2015	Exp20	LEN			67		56		76	
6	2015	Exp21	LEN			45		56		65	
7	2015	Exp19	SCL		56	6		4	45		56
7	2015	Exp22	SCL		45				55		54
	2014	Exp3	LEN	2		34	34				
	2015	Exp12	IBM	1		33			4		5
	2015	Exp15	IBM			23			4		5
	2015	Exp23	SCL		4				55		45

The group number assignment is in the first column, it doesn't matter to me what number is assigned to which group as long as the members are assigned to the correct group. Also if experiments cant be assigned to groups, they should be left without a group number. Assignment should be with the group with maximum common locations, if equal then it can be assigned to multiple groups.


Please assist, I am learning some very basic awk, this seems beyond me..
I tried this but I dont think this will get anywhere near to what I require


Code:
awk ' NR>1{ for (i=4;i<=NF;i++) if ($i!="") arr1[$1,$2,$3]=$i ; next } 
	$1,$2,$3 in arr1 { 
		for (j=1;j<length(arr1);j++)) 
			{if (arr1[j] > 20)
			group++;
 END {
   for (j in n) {
      print group, arr1[j]
   }
}' input input


Last edited by sheetalk; 03-25-2015 at 07:02 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Need to format string to collate data of similar IP address.

I have a string which has IP and data of files for that ip which is printed using the below code: Sample string code below: str1="10.9.11.128\n-rwxr-xr-x user1 2019-12-29 17:53 /var/branch/custom/tg.xml 286030210\n10.9.12.129\n-rwxr-xr-x user1 2019-12-29 17:53 /app/branch/custom/tg.xml... (6 Replies)
Discussion started by: mohtashims
6 Replies

2. Shell Programming and Scripting

To group the text (rows) by similar columns-names in a file

As part of some report generation, I've written a script to fetch the values from DB. But, unluckily, for certain Time ranges(1-9.99,10-19.99 etc), I don't have data in DB. In such cases, I would like to write zero (0) instead of empty. The desired output will be exported to csv file. ... (1 Reply)
Discussion started by: kumar_karpuram
1 Replies

3. Shell Programming and Scripting

De-group data

Hi, some help is highly appreciated, I want to de-group my data for statistical analysis. I made up some sample data, there shouldnt be repeated lines in the output. My data is in excel but I can make it tab-delimited text. A B,C A B,D,E X Y X Y,ZExpected output A B A C A D A E X Y... (3 Replies)
Discussion started by: ritakadm
3 Replies

4. Shell Programming and Scripting

Kindly check:remove duplicates with similar data in front of it

Hi all, I have 2 files containing data like this: so if there is same entry repeated in the column like1,2,3,4 I have to check if there is different entries column like 2,4 but similar entries for duplicatein column 2 like1,3 the output shuld be like this for first file ... (5 Replies)
Discussion started by: manigrover
5 Replies

5. Solaris

Create file for group of data:

Hi folks, I have the following data.Any help is greatly appreciated. order File_name 7222245 7222245.pdf 7222245 7222245a.pdf 7222245 7222245b.pdf 7222245 7222245c.pdf 7222245 7222245d.pdf 7222250 ... (1 Reply)
Discussion started by: kumar444
1 Replies

6. Shell Programming and Scripting

How to pick a group of data using awk/ksh

Hi gurus, I have data coming in as shown below. And in each case, I need to pick the data in the last group as shown below. Data Set 1: DC | 18161621 LA | 15730880 NY | 16143237 DC | 18161621 LA | 17316397 NY | 17915905 DC | 18161621 LA | 17993534 NY | 18161621 DC | 18161621... (11 Replies)
Discussion started by: calredd
11 Replies

7. Shell Programming and Scripting

appending data from similar files

I am familiar with scripting, but I am trying to see if there is an easy way to append files from similar files into one file. For example, if there is file1_20121201, file1_20121202, file1_20121203, file2_20121201, file2_20121202, file2_20121203 I want to be able to combine all the data from... (3 Replies)
Discussion started by: mrbean1975
3 Replies

8. Shell Programming and Scripting

Sort Data by Group !

Hello, I have a file and i want to sort by third column and extract the three top lines of each group, it is determined by the second column (144, 89, 55, etc). Could you please help me with the appropiate awk shell script XLY-XLP 144 0.592772 XLY-XLE 144 0.798121 ... (3 Replies)
Discussion started by: csierra
3 Replies

9. Shell Programming and Scripting

Split, Search and Reformat by Data Group

Hi, I am writing just to share my appreciation for help I have received from this site in the past. In a previous post Split File by Data Group I received a lot of help with a troublesome awk script to reformat some complicated data blocks. What I learned really came in hand recently when I... (1 Reply)
Discussion started by: mkastin
1 Replies

10. Shell Programming and Scripting

Split file by data group

Hi all, I'm having a little trouble solving a file split I need to get done. I have the following data: 1. Light 1A. Light Soft texture: it's soft color: the color value is that of something light vital statistics: srm: 23 og: 1.035 sp: 1.065 comment: this is nice if you like... (8 Replies)
Discussion started by: mkastin
8 Replies
Login or Register to Ask a Question