Matrix parsing help !


 
Thread Tools Search this Thread
Top Forums Programming Matrix parsing help !
# 15  
Old 01-03-2012
Which code did you use ?

Could you please repost an example of input file as well as an example of the corresponding output file you expect ?

Should we assume that link between 2 chromosome have no "order" (A-D could be considered like D-A) ?
(or should it be considered like a vector so that the way A-D vs D-A does matter ?)

Last edited by ctsgnb; 01-03-2012 at 01:04 PM..
# 16  
Old 01-03-2012
OK sir ctsgnb ! The input is (just the beginning because the original file contain more than 100,000 lines ! ):
Code:
chromosome07_194379   chromosome01_168057       0.975
chromosome01_100293   chromosome01_168057       0.969
chromosome01_100293   chromosome07_194379       0.969
chromosome01_29385    chromosome01_168057       0.856
chromosome01_29385    chromosome07_194379       0.856
chromosome01_29385    chromosome01_100293       0.861
chromosome08_116839   chromosome01_168057       0.78
chromosome08_116839   chromosome01_100293       0.786
chromosome08_116839   chromosome01_293853       0.946

and the output file must be like that :

Code:
chromosome07_194379 chromosome01_168057 chromosome01_100293 chromosome01_29385 chromosome08_116839 chromosome01_293853

This is one group even if the IDs in bold charachter don't share more than 80% of identity
a very simple case is when you have A--B--C association but the A and C don't share enough identity to be considered together but is one continue group . I don't now if i'm clear ctsgnb
Thanks again for your help

Last edited by vgersh99; 01-03-2012 at 01:41 PM.. Reason: fixed code tags
# 17  
Old 01-03-2012
... so now the output must only be 1 line ?

Could a D-A sequence be considered the same as a A-D sequence or do the order matter ?

Do the order in which the line appear in the input file matter ?

or could we re-arrange the sorting ?

Can you write us the pseudo-code just to clarify what define the "grouping" and the logic behind ?

Last edited by ctsgnb; 01-03-2012 at 01:45 PM..
# 18  
Old 01-03-2012
I'm sorry I think that sometimes I'm not very clear !
What I want is to group together chromosome sequences that are very closed basing on the identity sequence. The number of lines will depend of the number of group that the code will defined. Did you understand me or not ?
In the last example the output must be one line
# 19  
Old 01-03-2012
Let's say that you have

Code:
  E-F
  | |
A-B-C-D
  |
  G

If we consider that all the
X-Y
and
X
|
Y
means that they match 80% and more.

what should be considered as a group ?

---------- Post updated at 07:05 PM ---------- Previous update was at 07:02 PM ----------

How shoud that group be defined : in one line ?

A B C D E F G ?

Or in 4 lines :
B C F E
A B
B G
C D

Or in another way ?
# 20  
Old 01-03-2012
Exactly A B C D E F G are a group ! and the group should be defined in one line .
Thanks sir
# 21  
Old 01-04-2012
1) Under which condition should the algorithm switch to build another group ? (as soon as we meet a X-Y link that is below the threshold ? other ?)

2) Do the order matter inside a line ?
(In other words : is it correct to assume that X-Y can be considered the same way as Y-X) ?

3) Do the order matter between lines ? (in think it does in order to preserve the chaining of pairs... is that correct ?)

---------- Post updated at 10:09 AM ---------- Previous update was at 09:54 AM ----------

Let's start a "kind of" pseudo-code:

Let's say we are going to build some Groups :
G[1]
G[2]
...

Let's start with G[1]
while scanning your input file line by line :
if G[1] is empty, then put G[1]=$1" "$2
if G[1] is not empty, let's check the scanned line :
if $1 is in G[1] and $2 is not : then add $2 into that group
if $2 is in G[1] and $1 is not : then add $1 into that group
if both are in it : ignore it an process next line (should we consider it as a breaking sequence so that we start a new group ?)
if none are in it : build next group : G[++c]=$1 FS $2

Is that algo correct ?

if so, the following :

A D 90
E D 90
C F 90
D C 90

would generate 2 Groups sequence :

A D E
C F D

And not

A D E C F

So before coding, you must think of what logic and what condition should apply for breaking the sequence and/or switch to a new group.

Thanks in advance for clarifing your requirements at first.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parsing a subset of data from a large matrix

I do have a large matrix of the following format and it is tab delimited ch-ab1-20 ch-bb2-23 ch-ab1-34 ch-ab1-24 er-cc1-45 bv-cc1-78 ch-ab1-20 0 2 3 4 5 6 ch-bb2-23 3 0 5 ... (6 Replies)
Discussion started by: Kanja
6 Replies

2. Shell Programming and Scripting

Highest value matrix parsing

Hi All I do have a matrix in the following format a_2 a_3 s_4 t_6 b 0 0.9 0.004 0 c 0 0 1 0 d 0 0.98 0 0 e 0.0023 0.96 0 0.0034 I have thousands of rows I would like to parse the maximum value in each of the row and out put that highest value along the column header of... (2 Replies)
Discussion started by: Kanja
2 Replies

3. Shell Programming and Scripting

Constructing a Matrix

Hi, I do have couple of files in a folder. The names of each of the files have a pattern. ahet_005678.txt ahet_005898.txt ahet_007678.txt ahet_004778.txt ... ... ahet_002378.txt Each of the above files have the same pattern of data with 4 columns and have an header for the last 3... (4 Replies)
Discussion started by: Kanja
4 Replies

4. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Hi everyone I am very new at awk but think that that might be the best strategy for this. I have a matrix very similar to a correlation matrix and in practical terms I need to convert it into a list containing the values from the matrix (one value per line) with the first field of the line (row... (5 Replies)
Discussion started by: stonemonkey
5 Replies

5. Ubuntu

How to convert full data matrix to linearised left data matrix?

Hi all, Is there a way to convert full data matrix to linearised left data matrix? e.g full data matrix Bh1 Bh2 Bh3 Bh4 Bh5 Bh6 Bh7 Bh1 0 0.241058 0.236129 0.244397 0.237479 0.240767 0.245245 Bh2 0.241058 0 0.240594 0.241931 0.241975 ... (8 Replies)
Discussion started by: evoll
8 Replies

6. Shell Programming and Scripting

Matrix

Hi All I would like to merge multiple files with the same row and column size into a matrix format In a folder I have multiple files in the following format vi 12.txt a 1 b 5 c 7 d 0 vi 45.txt a 3 b 6 c 9 d 2 vi 9.txt a 4 (7 Replies)
Discussion started by: Lucky Ali
7 Replies

7. Shell Programming and Scripting

diagonal matrix to square matrix

Hello, all! I am struggling with a short script to read a diagonal matrix for later retrieval. 1.000 0.234 0.435 0.123 0.012 0.102 0.325 0.412 0.087 0.098 1.000 0.111 0.412 0.115 0.058 0.091 0.190 0.045 0.058 1.000 0.205 0.542 0.335 0.054 0.117 0.203 0.125 1.000 0.587 0.159 0.357... (11 Replies)
Discussion started by: yifangt
11 Replies

8. Shell Programming and Scripting

Parsing of file for Report Generation (String parsing and splitting)

Hey guys, I have this file generated by me... i want to create some HTML output from it. The problem is that i am really confused about how do I go about reading the file. The file is in the following format: TID1 Name1 ATime=xx AResult=yyy AExpected=yyy BTime=xx BResult=yyy... (8 Replies)
Discussion started by: umar.shaikh
8 Replies

9. Shell Programming and Scripting

Perl parsing compared to Ksh parsing

#! /usr/local/bin/perl -w $ip = "$ARGV"; $rw = "$ARGV"; $snmpg = "/usr/local/bin/snmpbulkget -v2c -Cn1 -Cn2 -Os -c $rw"; $snmpw = "/usr/local/bin/snmpwalk -Os -c $rw"; $syst=`$snmpg $ip system sysName sysObjectID`; sysDescr.0 = STRING: Cisco Internetwork Operating System Software... (1 Reply)
Discussion started by: popeye
1 Replies
Login or Register to Ask a Question