Matrix to 3 col sorted

09-21-2017

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

Matrix to 3 col sorted

Hello experts, I have matrices sorted by position, there are 400k rows, 3000 columns.

Code:

ID      CHR     POS     M1      M2      M3      M4      M5
ID1     1       1       4.6     2.6     2.1     3.5     4.2
ID2     1       100     3.6     2.9     3.2     2.6     2.5
ID3     1       1000    4.1     2.9     2.4     2.8     2.5
ID4     1       2000    4.2     2.4     2.8     3.7     2.8
ID5     1       3000    4.6     2.9     3.7     3.5     2.7
ID6     1       4000    5.2     3       3.2     4.2     3.7
ID7     1       5000    5.8     2.2     2.8     3.1     2.7

I`m looking for a sorted long-format output by position, grouped by markers (Ms)

Code:

M1 ID1__1__1 4.6
M1 ID2__1__100 3.6
M1 ID3__1__1000 4.1
M1 ID4__1__2000 4.2
M1 ID5__1__3000 4.6
M1 ID6__1__4000 5.2
M1 ID7__1__5000 5.8
M2 ID1__1__1 2.6
M2 ID2__1__100 2.9
M2 ID3__1__1000 2.9
M2 ID4__1__2000 2.4
M2 ID5__1__3000 2.9
M2 ID6__1__4000 3
M2 ID7__1__5000 2.2
M3 ID1__1__1 2.1
M3 ID2__1__100 3.2
M3 ID3__1__1000 2.4
M3 ID4__1__2000 2.8
M3 ID5__1__3000 3.7
M3 ID6__1__4000 3.2
M3 ID7__1__5000 2.8
M4 ID1__1__1 3.5
M4 ID2__1__100 2.6
M4 ID3__1__1000 2.8
M4 ID4__1__2000 3.7
M4 ID5__1__3000 3.5
M4 ID6__1__4000 4.2
M4 ID7__1__5000 3.1

I have two working methods

1. This step requires a final sorting which for 400000x3000 datapoints will be very row or run out of memory.

Code:

awk -F"\t" 'NR==1{split($0,lines);}NR>1{for(i=4;i<length(lines);i++) print lines[i],$1"__"$2"__"$3,$i}' tmp | sort -k1,1

2. This step is very very slow as it requires to take the entire matrix into memory before making splits.

Code:

awk -F"\t" '{a[NR]=$0}END{split(a[1],lines); for(i=4;i<=length(lines);i++) { for(j=2;j<=length(a);j++) { split(a[j],haps); print lines[i],haps[1]"__"haps[2]"__"haps[3],haps[i]}}}'  tmp

Is there a more efficient way to get the output I need? Thanks in advance.

senhia83

View Public Profile for senhia83

Find all posts by senhia83

09-21-2017

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

Code:

rm -rf output
mkdir output
awk '
NR==1 { for (i=1; i<=NF; i++) a[i]=$i ;}
NR > 1 {
   for (i=4; i<=NF; i++) print "touch output/\"" a[i], $1 "__" $2 "__" $3, $i "\"";
}
' infile | sh

ls -1 output > outfile

This User Gave Thanks to rdrtx1 For This Post:

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

09-21-2017

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

Interesting idea! Let the filesystem do a part of the sorting, and let ls sort the other parts.
More fine-tuned so ls sorts less:

Code:

outdir=./output
mkdir -p $outdir
awk '
NR==1 {
  for (i=4; i<=NF; i++) { line[i]=$i; printf "mkdir -p \"%s\"\n",$i }
  next
}
{
  for (i=4; i<=NF; i++) printf ">\"%s/%s__%s__%s %s\"\n",line[i],$1,$2,$3,$i
}
' infile > tmpfile
(
cd $outdir
sh tmpfile
for i in *
do
  ls -1 "$i" | sed "s#^#${i} #"
done
) | tee outfile

Last edited by MadeInGermany; 09-21-2017 at 02:37 PM..

This User Gave Thanks to MadeInGermany For This Post:

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

09-21-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Me too, I found this idea at least creative, but 1,2E9 directory entries, albeit in 3000 directories, might be somewhat, hmm, demanding on the file system...
On the other hand, I thought of awk distributing the columns into 3000 files, one column each, and then concatening them, but opening and appending 3000 files 400k times also would not be too fast. Right now, I'm lost...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

09-22-2017

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

Whatever strategy you choose, according to your expectation, you will get an output file that is near 5 times bigger than your input file.

input file :
rows = 400k
columns = 3000
Total amount of datas = 400k x 3000 = 1,2 billions

output file :
rows = (400k -1) x ( 3000 -3 )
"-1" because of the header row
"-3" because the 3 first contains the datas you want
columns = 5 (i take into account the datas, not the formatting and "__" stuff)
Total amount of datas = 5 x (400k -1) x ( 3000 -3 ) = 5,99...billions

Indeed, since you will be repeating the datas of the first 3 column for each and every subsequent column. (so (3 + 2 ) x (3000 - 3) ) rather than having it once for all of them ( "+ 2" is because you even want to add the header and the values of the subsequent column).

So just because of your prerequisits and expectations, for sure you will have to write more data, and thus you will need the corresponding number of I/O writing, independently of the strategy you choose.

If you have a huge amount of data to write, then there is and uncompressible amount of time to do it.
Of course I/O operations are quicker to do in RAM or SSD than on a standard hard drive, but still ...

It would be cheaper to save the transposed matrix and request on it
So you will have it twice : in line and in columns ... of course it will cost you 1,2 billions of datas more.
But 1,2 billions x 2 = 2,4 billions is still more than twice smaller than 5,9 billions of datas !

Anyway, processing such an amount of data using files is inappropriate : this is what Database have been designed for...
My 2 cents ...

PS// The reality might even be worse since i didn't even take into account the size of the datas, just the number of them ...

Last edited by ctsgnb; 09-22-2017 at 12:18 PM..

This User Gave Thanks to ctsgnb For This Post:

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

09-24-2017

Registered User

503, 195

Join Date: Sep 2013

Last Activity: 22 January 2021, 1:52 PM EST

Location: France

Posts: 503

Thanks Given: 43

Thanked 195 Times in 176 Posts

Hi senhia83,
maybe a little bit faster than your second solution:

Code:

awk 'NR==1 {C=NF;Y=$0;Z=$4;next}{X[NR]=$0;print Z,$1"__"$2"__"$3,$4}END{for(i=5;i<C;++i){$0=Y;Z=$i;j=1;while(++j<=NR){$0=X[j];print Z,$1"__"$2"__"$3,$i}}}'

or:

Code:

awk 'NR==1 {C=NF;Y=$0;Z=$4;next}
{X[NR]=$0;print Z,$1"__"$2"__"$3,$4}
END{
 for(i=5;i<C;++i){
   $0=Y;Z=$i;j=1;
   while(++j<=NR){
     $0=X[j];print Z,$1"__"$2"__"$3,$i
   }
 }
}' file

Regards.

This User Gave Thanks to disedorgue For This Post:

disedorgue

View Public Profile for disedorgue

Find all posts by disedorgue

09-27-2017

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

Thanks ! the filesystem ended up crashing with the files approach ..but a very innovative idea nonetheless !

senhia83

View Public Profile for senhia83

Find all posts by senhia83

Shell Programming and Scripting

Matrix to 3 col sorted

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Modifying col values based on another col

Discussion started by: newbie83

2. Shell Programming and Scripting

Printing from col x to end of line, except last col

Discussion started by: LMHmedchem

3. Shell Programming and Scripting

awk? adjacency matrix to adjacency list / correlation matrix to list

Discussion started by: stonemonkey

4. UNIX for Advanced & Expert Users

Print line based on highest value of col (B) and repetion of values in col (A)

Discussion started by: imahmoud

5. Ubuntu

How to convert full data matrix to linearised left data matrix?

Discussion started by: evoll

6. Shell Programming and Scripting

how to add new col in a file

Discussion started by: ken002

7. Shell Programming and Scripting

i can't cut the third col

Discussion started by: maxim42

8. Shell Programming and Scripting

diagonal matrix to square matrix

Discussion started by: yifangt

9. Ubuntu

Match col 1 of File 1 with col 1 File 2 and create a 3rd file

Discussion started by: sogi

10. UNIX for Advanced & Expert Users

Help On col command

Discussion started by: rahulrathod