Matrix parsing help !

01-03-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Which code did you use ?

Could you please repost an example of input file as well as an example of the corresponding output file you expect ?

Should we assume that link between 2 chromosome have no "order" (A-D could be considered like D-A) ?
(or should it be considered like a vector so that the way A-D vs D-A does matter ?)

Last edited by ctsgnb; 01-03-2012 at 01:04 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

OK sir ctsgnb ! The input is (just the beginning because the original file contain more than 100,000 lines ! ):

Code:

chromosome07_194379   chromosome01_168057       0.975
chromosome01_100293   chromosome01_168057       0.969
chromosome01_100293   chromosome07_194379       0.969
chromosome01_29385    chromosome01_168057       0.856
chromosome01_29385    chromosome07_194379       0.856
chromosome01_29385    chromosome01_100293       0.861
chromosome08_116839   chromosome01_168057       0.78
chromosome08_116839   chromosome01_100293       0.786
chromosome08_116839   chromosome01_293853       0.946

and the output file must be like that :

Code:

chromosome07_194379 chromosome01_168057 chromosome01_100293 chromosome01_29385 chromosome08_116839 chromosome01_293853

This is one group even if the IDs in bold charachter don't share more than 80% of identity
a very simple case is when you have A--B--C association but the A and C don't share enough identity to be considered together but is one continue group . I don't now if i'm clear ctsgnb
Thanks again for your help

Last edited by vgersh99; 01-03-2012 at 01:41 PM.. Reason: fixed code tags

mchimich

View Public Profile for mchimich

Find all posts by mchimich

01-03-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

... so now the output must only be 1 line ?

Could a D-A sequence be considered the same as a A-D sequence or do the order matter ?

Do the order in which the line appear in the input file matter ?

or could we re-arrange the sorting ?

Can you write us the pseudo-code just to clarify what define the "grouping" and the logic behind ?

Last edited by ctsgnb; 01-03-2012 at 01:45 PM..

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

I'm sorry I think that sometimes I'm not very clear !
What I want is to group together chromosome sequences that are very closed basing on the identity sequence. The number of lines will depend of the number of group that the code will defined. Did you understand me or not ?
In the last example the output must be one line

mchimich

View Public Profile for mchimich

Find all posts by mchimich

01-03-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Let's say that you have

Code:

  E-F
  | |
A-B-C-D
  |
  G

If we consider that all the
X-Y
and
X
|
Y
means that they match 80% and more.

what should be considered as a group ?

---------- Post updated at 07:05 PM ---------- Previous update was at 07:02 PM ----------

How shoud that group be defined : in one line ?

A B C D E F G ?

Or in 4 lines :
B C F E
A B
B G
C D

Or in another way ?

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

01-03-2012

Registered User

12, 0

Join Date: Jan 2012

Last Activity: 4 January 2012, 10:40 AM EST

Posts: 12

Thanks Given: 2

Thanked 0 Times in 0 Posts

Exactly A B C D E F G are a group ! and the group should be defined in one line .
Thanks sir

mchimich

View Public Profile for mchimich

Find all posts by mchimich

01-04-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

1) Under which condition should the algorithm switch to build another group ? (as soon as we meet a X-Y link that is below the threshold ? other ?)

2) Do the order matter inside a line ?
(In other words : is it correct to assume that X-Y can be considered the same way as Y-X) ?

3) Do the order matter between lines ? (in think it does in order to preserve the chaining of pairs... is that correct ?)

---------- Post updated at 10:09 AM ---------- Previous update was at 09:54 AM ----------

Let's start a "kind of" pseudo-code:

Let's say we are going to build some Groups :
G[1]
G[2]
...

Let's start with G[1]
while scanning your input file line by line :
if G[1] is empty, then put G[1]=$1" "$2
if G[1] is not empty, let's check the scanned line :
if $1 is in G[1] and $2 is not : then add $2 into that group
if $2 is in G[1] and $1 is not : then add $1 into that group
if both are in it : ignore it an process next line (should we consider it as a breaking sequence so that we start a new group ?)
if none are in it : build next group : G[++c]=$1 FS $2

Is that algo correct ?

if so, the following :

A D 90
E D 90
C F 90
D C 90

would generate 2 Groups sequence :

A D E
C F D

And not

A D E C F

So before coding, you must think of what logic and what condition should apply for breaking the sequence and/or switch to a new group.

Thanks in advance for clarifing your requirements at first.

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

Programming

Matrix parsing help !

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a subset of data from a large matrix

Discussion started by: Kanja

2. Shell Programming and Scripting

Highest value matrix parsing

Discussion started by: Kanja

3. Shell Programming and Scripting

Constructing a Matrix

Discussion started by: Kanja

4. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Discussion started by: stonemonkey

5. Ubuntu

How to convert full data matrix to linearised left data matrix?

Discussion started by: evoll

6. Shell Programming and Scripting

Matrix

Discussion started by: Lucky Ali

7. Shell Programming and Scripting

diagonal matrix to square matrix

Discussion started by: yifangt

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Discussion started by: umar.shaikh

9. Shell Programming and Scripting

Perl parsing compared to Ksh parsing

Discussion started by: popeye