Matrix parsing help !

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanks to take time re reply me I am very grateful.
Now it's seem better ! but the third line DE I have to ignore it because my original file is very very big ! I will have repeated information in my output.

mchimich

View Public Profile for mchimich

Find all posts by mchimich

01-03-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

-- deleted --

Last edited by ctsgnb; 01-03-2012 at 10:45 AM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

This is your code :

Code:

awk 'NR>1&&$3>=80{A[$1]=$1;B[A[$1]]=(B[A[$1]]?B[A[$1]]:$1)" "$2}END{for(i in A) print B[A[i]]}' test.tttt

And this is the output :

Code:

A D E
B C
D E

--> The DE is not a single group it's normally a part of the group 1 (ADE) I don't now if I'm clear
What i want to do after it's to get every group ID and using Bioperl to check the corresponding fasta files in a database. So i need just a output with two line (for this exemple).
Thanks

Moderator's Comments:

Please use code tags when posting data and code samples!

Last edited by vgersh99; 01-03-2012 at 10:53 AM..

mchimich

View Public Profile for mchimich

Find all posts by mchimich

01-03-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

you can get all single pairs belonging to at least one group that is 80 or more with the following :

Code:

$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
A B 70
A C 50
A D 90
A E 40
$ awk 'NR>1&&$3>=80{i=$1" "$2;j=$2" "$1;t=i<j?i:j;C[t]}END{for(k in C) print k}' f2
A D
A E
B C
D E
$

NOTE that this code assume that an A D association is just another D A association, letters are just displayed from lower to higher :
consider the following example :

Code:

$ cat f3
A B 10
B A 80
C D 70
E D 90
D B 80
A D 10
D A 93
$ awk 'NR>1&&$3>=80{i=$1" "$2;j=$2" "$1;t=i<j?i:j;C[t]}END{for(k in C) print k}' f3
A B
A D
B D
D E
$

---------- Post updated at 04:48 PM ---------- Previous update was at 04:24 PM ----------

you can also try the following code

Code:

$ cat f2
ID1 ID2 Identity
A B 70
A C 50
A D 90
A E 80
B C 95
B D 66
B E 47
C D 35
C E 25
D E 98
A B 70
A C 50
A D 90
A E 40
$ awk 'NR>1&&$3>=80{x=$1" "$2;for(i in A) {if (A[i]~x) next};A[$1]=(A[$1]?A[$1]:$1)" "$2}END{for(i in A) print A[i]}' f2
A D E
B C

---------- Post updated at 05:00 PM ---------- Previous update was at 04:48 PM ----------

To avoid that a same $2 appear more than once within a group you can also try :

Code:

awk 'NR>1&&$3>=80{A[$1]=(A[$1]?A[$1]:$1)(A[$1]~$2?z:" "$2)}END{for(i in A) print A[i]}' yourfile

Not sure to get what final result you expected.

Last edited by ctsgnb; 01-03-2012 at 12:07 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

Thanks a lot it works !! but when i use the code for my initial file that i post in the first message it don't work ): ! I never use before the awk code i must learn it. It is possible to just change the A in your code with the noun of my first column ? Other thing this code can work with a very big data ? or just adapted for this specific case ?

mchimich

View Public Profile for mchimich

Find all posts by mchimich

01-03-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Did you make sure you've used the right threshold in your code (depending on your input file) ?
0.8 vs 80

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

you are right sir I dont !! the output is not good unfortunately :

Code:

chromosome01_100293 chromosome01_168057 chromosome07_194379
chromosome01_29385 chromosome01_168057 chromosome07_194379 chromosome01_100293
chromosome08_116839 chromosome01_293853

---------- Post updated at 04:30 PM ---------- Previous update was at 04:26 PM ----------

the chromosome01_100293 is present for exemple in the line 1 and the line 2 in the same time

Last edited by radoulov; 01-04-2012 at 05:18 AM..

mchimich

View Public Profile for mchimich

Find all posts by mchimich

Programming

Matrix parsing help !

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a subset of data from a large matrix

Discussion started by: Kanja

2. Shell Programming and Scripting

Highest value matrix parsing

Discussion started by: Kanja

3. Shell Programming and Scripting

Constructing a Matrix

Discussion started by: Kanja

4. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Discussion started by: stonemonkey

5. Ubuntu

How to convert full data matrix to linearised left data matrix?

Discussion started by: evoll

6. Shell Programming and Scripting

Matrix

Discussion started by: Lucky Ali

7. Shell Programming and Scripting

diagonal matrix to square matrix

Discussion started by: yifangt

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Discussion started by: umar.shaikh

9. Shell Programming and Scripting

Perl parsing compared to Ksh parsing

Discussion started by: popeye