Match child with parents and form matrix


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Match child with parents and form matrix
# 1  
Old 02-01-2015
Match child with parents and form matrix

thank you for letting me join this forum, lots of learning opportunities looks like.
Myself a biologist, very new into unix, so please excuse if I use incorrect language. I am using cygwin on windows, it can run perl, awk , sed etc.

I have 2 files, the first sample sheet, tells which parent and children are in which sample. Parents are represented as P1, P2, P3 and corresponding children groups are represented as P1/P2 , P2/P3 etc.

Code:
index,line,sample
1,p1,s1
2,p2,s2
3,p1/p2,s3
4,p1/p2,s4
5,p1/p2,s5
6,p1/p2,s6
7,p1/p3,s7
8,p1/p3,s8
9,p1/p3,s9
10,p1/p3,s10
11,p2/p3,s11
12,p2/p3,s12
13,p2/p3,s13
14,p2/p3,s14
15,p3,s15

The second file contains data, having sample number, variable name and value. The parents always can be aa,tt,gg,cc (same character repeated twice)

Code:
sample,var,value
s1,v1,aa
s1,v2,tt
s1,v3,aa
s1,v4,gg
s2,v1,tt
s2,v2,aa
s2,v3,aa
s2,v4,gg
s3,v1,at
s3,v3,aa
s3,v4,tt
s4,v1,tt
s4,v2,at
s4,v3,aa
s4,v4,gt
s5,v1,aa
s5,v2,tt
s5,v3,aa
s5,v4,gt
s6,v1,aa
s6,v2,aa
s6,v3,aa
s6,v4,tt
s7,v1,aa
s7,v2,aa
s7,v3,at
s7,v4,ag
s8,v1,aa
s8,v2,tt
s8,v3,at
s8,v4,ag
s9,v1,aa
s9,v2,at
s9,v3,tt
s9,v4,gg
s10,v1,aa
s10,v2,at
s10,v3,aa
s10,v4,ag
s11,v1,aa
s11,v2,aa
s11,v3,tt
s11,v4,gg
s12,v1,tt
s12,v2,tt
s12,v3,tt
s12,v4,ag
s13,v1,aa
s13,v2,at
s13,v3,aa
s13,v4,ag
s14,v1,at
s14,v2,aa
s14,v3,at
s14,v4,aa
s15,v1,aa
s15,v2,aa
s15,v3,tt
s15,v4,aa

I am only interested in variables in which a pair of parents dont match. If parents have same value, that variable is not considered in the output, also if one/both parents are absent for a variable, I dont want to consider that one.

What I need to do is create new files for all sets of children with same parents, and assign the variables values a (if matching first parent) , b (if matching second parent) and m (mixture of both) . If data is missing in child variable, hyphen (-) can be used.

So my desired output are 3 files, all in matrix form.


Code:
file p1_p2

    s3  s4  s5 s6 
v1  m   b   a   a
v2  -   m   a   b


file p1_p3

    s7  s8  s9 s10 
v2  b   a   m   m
v3  m   m   b   a
v4  m   m   a   m


file p2_p3

    s11  s12  s13 s14 
v1   b    a    b   m 
v3   b    b    a   m
v4   a    m    m   b


I`m ready to answer questions that you may have. please guide me to achieve the output.

Last edited by jalaj841; 02-01-2015 at 12:48 PM..
# 2  
Old 02-01-2015
Hi,

Here, an awk command file:
Code:
$ cat matrx.awk 
BEGIN{FS=",";T="null";I=-1}
FNR == NR {
	if ( $2 !~ /\// ) {
		P[$3]=$2
		Q[$2]=$3
	}
	else {
		C[$3]=$2
	}
	next
}
{
	if ( P[$1] ) {
		V[P[$1]$1$2]=$3
	}
	else {
		split(C[$1],A,"/")
		T == "null" ? T=C[$1] : 0 
		if ( V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] ) {
			if ( C[$1] == T ) {
				I++
			}
			else {
				J=0
				print T":"
				while (J <= I){
					split(E[J],li,":")
					L[li[1]li[2]]=li[3]
					if(V1[li[1]]!=1){
						V1[li[1]]=1
						V2[G++]=li[1]
					}
					if(M[li[2]]!=1){
						M[li[2]]=1
						B[D++]=li[2]
					}
					J++
				}
				for(Y=0;Y<D;Y++){
					K=K"\t"B[Y]
				}
				print K
				for(Z=0;Z<G;Z++){
					K=V2[Z]
					for(Y=0;Y<D;Y++){
						if(L[V2[Z]B[Y]]) {
							K=K"\t"L[V2[Z]B[Y]]
						}
						else{
							K=K"\t-"
						}
					}
					print K
				}
				I=0
				T=C[$1]
				K=""
				split("",L)
				split("",B)
				split("",V2)
				split("",V1)
				split("",M)
				G=D=0
			}
			V[P[Q[A[1]]]Q[A[1]]$2] == $3 ? X="a" : V[P[Q[A[2]]]Q[A[2]]$2] == $3 ? X="b" : X="m"
			E[I]=$2":"$1":"X
		}
	}
}
END{
	print T":"
	J=0
	while (J <= I){
		split(E[J],li,":")
		L[li[1]li[2]]=li[3]
		if(V1[li[1]]!=1){
			V1[li[1]]=1
			V2[G++]=li[1]
		}
		if(M[li[2]]!=1){
			M[li[2]]=1
			B[D++]=li[2]
		}
		J++
	}
	for(Y=0;Y<D;Y++){
		K=K"\t"B[Y]
	}
	print K
	for(Z=0;Z<G;Z++){
		K=V2[Z]
		for(Y=0;Y<D;Y++){
			if(L[V2[Z]B[Y]]) {
				K=K"\t"L[V2[Z]B[Y]]
			}
			else{
				K=K"\t-"
			}
		}
		print K
	}
}

For this code work fine, we must use as this:
Code:
$ awk -F,  'FNR == NR && !/\// {T[$3]=1;next};T[$1] {print}' file1 file2 | awk -f matrx.awk file1 - file2
p1/p2:
	s3	s4	s5	s6
v1	m	b	a	a
v2	-	m	a	b
p1/p3:
	s7	s8	s9	s10
v2	b	a	m	m
v3	m	m	b	a
v4	m	m	a	m
p2/p3:
	s11	s12	s13	s14
v1	b	a	b	m
v3	b	b	a	m
v4	a	m	m	b

Syntax: file1 - file2 is correct, hyphen (-) is output of awk commande before pipe (|)
Regards.
This User Gave Thanks to disedorgue For This Post:
# 3  
Old 02-02-2015
wow..unbelievable, thank you from my heart..never thought it will require so much time and effort..

I will do testing and then confirm results...meanwhile if you can also post a short description of the steps, it will help me learn specially and others also in this forum..

---------- Post updated at 12:06 PM ---------- Previous update was at 12:30 AM ----------

Hi disedorgue,

I found a couple of potential issues, would you please consider these scenarios.

If I use the files

Code:
20,p5,s20
21,p5/p6,s21
22,p5/p6,s22
23,p6,s23

and

Code:
s20,v5,gg
s20,v6,tt
s21,v5,tt
s21,v6,gg
s22,v7,tt
s22,v5,gg
s23,v5,tt

I get the output
Code:
p5/p6:
        s21     s22
v5      b       a
v6      m       -

since v6 is absent in p6, output should just be the first line only

Code:
        s21     s22
v5      b       a

I tested some more with the following set

Code:
1,p1,s1
16,p4,s16
17,p1/p4,s17
18,p1/p4,s18
19,p1/p4,s19

and

Code:
s1,v1,aa
s1,v2,tt
s1,v3,aa
s1,v4,gg
s16,v1,aa
s16,v2,aa
s16,v3,tt
s16,v4,aa
s17,v2,tt
s17,v3,tt
s17,v4,ag
s18,v2,at
s18,v3,aa
s19,v1,aa
s19,v2,aa
s19,v3,at
s19,v4,aa

It gives me

Code:
p1/p4:
        s17     s18     s19
v2      m       m       m
v3      m       m       m
v4      m       -       m


It should rather be

Code:
    s17 s18 s19  
v2   a   m   b
v3   b   a   m
v4   m   -   b

# 4  
Old 02-02-2015
First issue, is normal, because the script not check this case, to correct this:
replace:
Code:
if ( V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] )

by:
Code:
if ( V[P[Q[A[1]]]Q[A[1]]$2] && V[P[Q[A[2]]]Q[A[2]]$2] && V[P[Q[A[1]]]Q[A[1]]$2] != V[P[Q[A[2]]]Q[A[2]]$2] )

Second issue is strange, because this case work fine at home, it return:
Code:
$ awk -F,  'FNR == NR && !/\// {T[$3]=1;next};T[$1] {print}' matrx_4 matrx_5 | awk -f matrx.awk matrx_4 - matrx_5
p1/p4:
	s17	s18	s19
v2	a	m	b
v3	b	a	m
v4	m	-	b

Regards.
This User Gave Thanks to disedorgue For This Post:
# 5  
Old 02-02-2015
Thank you disedorgue, does your code require any sorting of the first or second files?
I`m running this on a huge dataset and getting weird output for some pair of parents.
# 6  
Old 02-02-2015
The first file is stocking in memory (array P,Q and C in code)
Only lines sample number parent of the second file is stocking in memory (Array V in code) and so, only these lines must be at begin of the second file.
If you launch only the awk commande before pipe, you see it return only lines of second file that represent parents.

EDIT: each children pair in second file must be consecutive.

Last edited by disedorgue; 02-02-2015 at 07:49 PM..
This User Gave Thanks to disedorgue For This Post:
# 7  
Old 02-02-2015
Can I do this with code? Since the data file has 800 million records, it is impossible to do this manually, sorry for so many questions.
Login or Register to Ask a Question

Previous Thread | Next Thread

7 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Form balanced matrix by filtering data

I need to form a matrix out of unbalanced set of records. First eliminate the sample that do not have at least 3 variables (col2). So, in the example, samples 4 and 5 get eliminated. Then form a matrix of values (col3) from the samples using only variables that are present accross all samples.... (3 Replies)
Discussion started by: senhia83
3 Replies

2. Shell Programming and Scripting

Comparing two strings receiving form two different loops and execute if condition when single match

I want to read a file contain sub-string and same string need to match in file name I got from for loop. I am using below code: #!/bin/bash C_UPLOADEDSUFFIX='.uploaded' files=$(find . -iname '*'$C_UPLOADEDSUFFIX -type f) # find files having .uploaded prefix for file in $files do ... (1 Reply)
Discussion started by: ketanraut
1 Replies

3. Shell Programming and Scripting

Reformatting data in matrix form

Hi, Some assistance with respect to the following problem will be very helpful. I want to reformat my dataset in the following manner for subsequent analysis. I have first column values (which repeat for each value of 2nd column) which are names, the second column specifies position ad the... (1 Reply)
Discussion started by: newbie83
1 Replies

4. Shell Programming and Scripting

fetch last line no form file which is match with specific pattern by grep command

Hi i have a file which have a pattern like this Nov 10 session closed Nov 10 Nov 9 08:14:27 EST5EDT 2010 on tty . Nov 10 Oct 19 02:14:21 EST5EDT 2010 on pts/tk . Nov 10 afrtetryytr Nov 10 session closed Nov 10 Nov 10 03:21:04 EST5EDT 2010 Dec 8 Nov 10 05:03:02 EST5EDT 2010 ... (13 Replies)
Discussion started by: Himanshu_soni
13 Replies

5. Shell Programming and Scripting

Cut and paste data in matrix form

I have large formatted data file with five columns. This has to be rearranged in lower order matrix form as shown below for sample data. 1 2 3 4 5 1.0 3.0 2.0 5.0 3.0 2.0 4.0 3.0 1.0 6.0 2.0 3.0 4.0 5.0 1.0 1.0 4.0 2.0 3.0 5.0 3.0 5.0 4.0 2.0 8.0 1.0 3.0 2.0 4.0 5.0 2.0... (7 Replies)
Discussion started by: dhilipumich
7 Replies

6. UNIX for Dummies Questions & Answers

changing data into matrix form

Hi, I have a file whose structure is like this 7 7 1 2 3 4 5 1 3 4 8 6 1 4 5 6 0 2 6 8 3 8 2 5 7 8 0 5 7 9 4 1 3 8 0 2 2 3 5 6 8 basically first two row tell the number of rows and column but the data following them are not arranged in that format. now i want to create another... (1 Reply)
Discussion started by: g0600014
1 Replies

7. UNIX for Advanced & Expert Users

Changing Unix form to Microsoft Word form to be able to email it to someone.

Please someone I need information on how to change a Unix form/document into a microsoft word document in order to be emailed to another company. Please help ASAP. Thankyou :confused: (8 Replies)
Discussion started by: Cheraunm
8 Replies
Login or Register to Ask a Question