Sort by values in the 1st row, leaving first four coulumns untouched

10-31-2017

Registered User

32, 0

Join Date: Mar 2011

Last Activity: 26 January 2018, 6:18 AM EST

Posts: 32

Thanks Given: 20

Thanked 0 Times in 0 Posts

Sort by values in the 1st row, leaving first four coulumns untouched

Dear all, will be thankful if you can help on sort command.

My data looks like (tab separated; number of columns 2317; N of rows ~200000):

Code:

a    b    c    d    V10    V2    V8    V4    V7 
xx    z    y    1000    1    2    0    2    0
tr    v    m    1001    0    0    1    2    2
rg    s    n    1003    1    1    2    0    0

I need to sort my data so, that first four columns remain untouched. Rest of the columns are sorted by values in the first row. Result will look like:

Code:

a    b    c    d    V2    V4    V7     V8    V10
xx    z    y    1000    2    2    0    0    1
tr    v    m    1001    0    2    2    1    0
rg    s    n    1003    1    0    0    2    1

Thank you a lot for your help!

Last edited by jim mcnamara; 10-31-2017 at 04:15 PM..

kush

View Public Profile for kush

Find all posts by kush

10-31-2017

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Looks to me like you mean 'sort by vertical column' You move each column based on the content of the firs row - columns 5 - 9 (V2, V4,. ....)

Correct?

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

11-01-2017

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Are we correct in assuming that each heading on the 1st line for the last 2313 fields are the single letter V followed by unique non-negative integers?

What output do you get from running the following three commands:

Code:

uname -a

getconf LINE_MAX

awk -F'\t' '
lNF != NF	{print "NF=" (lNF = NF), "NR=" NR}
length() > lm	{lm = length()}
END		{print "Max length=" lm}
' file

where file is the name of the file that contains your data.

Note that, by definition, a text file contains no lines that contain more than LINE_MAX (which is 2048 on most systems) bytes in a line (including the <newline> terminator) and most of the UNIX text processing utilities (like awk, sed, and sort) are only defined to work on text files. If the file containing your data has 2317 fields and LINE_MAX is 2048 on your system, the file containing your data is not a text file. Some versions of these utilities work even if the input files have line lengths longer than those required by the standards; other versions of these utilities will give you an error if they encounter long lines; and other versions will silently ignore some data if they encounter long lines. Hopefully, the awk script above will give us an indication of how your implementation of awk will behave. (We hope that it will just print two lines of output on standard output and not print any diagnostics.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

11-01-2017

Registered User

32, 0

Join Date: Mar 2011

Last Activity: 26 January 2018, 6:18 AM EST

Posts: 32

Thanks Given: 20

Thanked 0 Times in 0 Posts

Dear all,
thank you for quick attempts to solve the sorting!

jim mcnamara: yes, this sorting literally means moving columns based on values in the first row.

Don Cragun: - yes, values in heading line (based on which i have to sort) contain a letter V followed by non-negative unique number.
- and your code gives me:

Code:

getconf LINE_MAX 
2048


NF=2317 NR=1
NF=2313 NR=2
NF=362 NR=16134
Max length=16236

Looks like it not a trivial thing. Maybe I have to try to do it in R.

But thank you once more!

Last edited by Scott; 11-01-2017 at 08:03 AM.. Reason: Code tags

kush

View Public Profile for kush

Find all posts by kush

11-01-2017

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Mmmm, your awk clearly is able to process longer lines than 2048, since max length is 16236.

It seems to me the difference between line 1 and line 2 is perhaps explained by the first four fields in the header? That the first field in line 2 corresponds to the 5th field in the header line?

What is strange is the sudden drop in nr of fields to 362 from line 16134 onwards.

It seems to me not all of the lines contain the same number of TAB separated fields ?
What is happening on line 16134?

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

11-01-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Apart from solving above line length problems, here's something to start with if the problem doesn't hit system limits:

Code:

awk -F"\t" '
NR == 1         {printf "%s", substr ($0, 1, index($0, $5)-1)
                 for (i=5; i<=NF; i++) ORG[$i] = i
                 OFS = "\n"
                 sub ($1 FS $2 FS $3 FS $4 FS, "")
                 $1 = $1
                 CNT = 5
                 OFS = FS
                 while (1 == ("echo \"" $0 "\" | sort -k1.2n") | getline X)     {COL[CNT++] = X
                                                                                 HD = HD DL X
                                                                                 DL = FS
                                                                                }
                 print HD
                 next
                }
                {for (i=1; i<= 4; i++) printf "%s%c", $i, FS
                 for (i=5; i<=NF; i++) printf "%s%c", $(ORG[COL[i]]), i==NF?ORS:OFS
                }
' file
a	b	c	d	V2	V4	V7 	V8	V10
xx	z	y	1000	2	2	0	0	1
tr	v	m	1001	0	2	2	1	0
rg	s	n	1003	1	0	0	2	1

Most of the processing for the first line is for sorting the columns; my awk doesn't have a sorting algortihm, unfortunately.

Last edited by RudiC; 11-01-2017 at 12:19 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

11-01-2017

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

This demonstration code:

Code:

#!/usr/bin/env bash

# @(#) s1       Demonstrate separate, transpose, sort, transpose, combine file matrix.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C cut transpose.pl sort paste pass-fail

FILE=${1-data1}
N=${FILE//[A-Za-z]/}
E=expected-output$N

# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
pl " Input data file $FILE:"
head $FILE

pl " Expected output:"
head $E

pl " Prepare input, split and save first 4 columns, remainder:"
cut -f1-4 $FILE > first-four
cut -f5- $FILE |
tee remainder

pl " Results, transpose, sort:"
transpose.pl remainder |
tee t2 |
sort -k1.2,1n |
tee t3

pl " Results, re-transpose, paste:"
transpose.pl t3 > sorted-remainder
paste first-four sorted-remainder | tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C f1 "$E" || ( pe; pe " Results cannot be verified." ) >&2

exit

produces:

Code:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.9 (jessie) 
bash GNU bash 4.3.30
cut (GNU coreutils) 8.23
transpose.pl - ( local: RepRev 1.1, ~/bin/transpose.pl, 2017-01-29 )
sort (GNU coreutils) 8.23
paste (GNU coreutils) 8.23
pass-fail (local) 1.10

-----
 Input data file data1:
a       b       c       d       V10     V2      V8      V4      V7
xx      z       y       1000    1       2       0       2       0
tr      v       m       1001    0       0       1       2       2
rg      s       n       1003    1       1       2       0       0

-----
 Expected output:
a       b       c       d       V2      V4      V7      V8      V10
xx      z       y       1000    2       2       0       0       1
tr      v       m       1001    0       2       2       1       0
rg      s       n       1003    1       0       0       2       1

-----
 Prepare input, split and save first 4 columns, remainder:
V10     V2      V8      V4      V7
1       2       0       2       0
0       0       1       2       2
1       1       2       0       0

-----
 Results, transpose, sort:
V2      2       0       1
V4      2       2       0
V7      0       2       0
V8      0       1       2
V10     1       0       1

-----
 Results, re-transpose, paste:
a       b       c       d       V2      V4      V7      V8      V10
xx      z       y       1000    2       2       0       0       1
tr      v       m       1001    0       2       2       1       0
rg      s       n       1003    1       0       0       2       1

-----
 Verify results if possible:

-----
 Comparison of 4 created lines with 4 lines of desired results:
 Succeeded -- files (computed) f1 and (standard) expected-output1 have same content.

For some other codes that can transpose, see:

Code:

Transpose
        1) rs, reshape a data arrays

        2) transpose.pl
           http://www1.cuni.cz/~obo/textutils/

        3) transpose, sourceforge c
           https://sourceforge.net/projects/transpose/

        4) pspp
           'FLIP' transposes rows and columns in the active dataset.

        5) datamash
           transpose   transpose rows, columns of the input file

        *) awk, perl, python, c, R, sc, and so on:
           http://stackoverflow.com/questions/1729824/an-efficient-way-to-transpose-a-file-in-bash
           http://stackoverflow.com/questions/25331830/how-do-i-efficiently-transpose-a-matrix-in-r
           https://www.unix.com/unix-for-beginners-questions-and-answers/270683-transpose-large-data-unix.html et al

I also tried the solution with item 3 above, and it worked.

I tried an alternate file with fewer columns, and it seemed to work.

Some details on transpose.pl and transpose:

Code:

transpose.pl    Swap rows and columns in the given tab-delimited table (MR). (what)
Path    : ~/bin/transpose.pl
Version : - ( local: RepRev 1.1, ~/bin/transpose.pl, 2017-01-29 )
Length  : 28 lines
Type    : Perl script, ASCII text executable
Shebang : #!/usr/bin/perl
Home    : http://www1.cuni.cz/~obo/textutils/ (doc)

transpose       Reshapes delimited text data (help)
Path    : ~/executable/transpose
Version : - ( local: ~/executable/transpose, 2017-01-29 )
Type    : ELF64-bitLSBexecutable,x86-64,version1(SYSV ...)
Home    : https://sourceforge.net/projects/transpose/ (doc)

Best wishes ... cheers, drl

This User Gave Thanks to drl For This Post:

drl

View Public Profile for drl

Find all posts by drl

UNIX for Beginners Questions & Answers

Sort by values in the 1st row, leaving first four coulumns untouched

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to remove mutiple values from specific pattern, leaving a single value

Discussion started by: cmccabe

2. Shell Programming and Scripting

Sort by first row - awk

Discussion started by: quincyjones

3. Shell Programming and Scripting

awk transpose column to row and sort

Discussion started by: ranjancom2000

4. Shell Programming and Scripting

Sort each row (horizontally) in AWK or any

Discussion started by: joseamck

5. Shell Programming and Scripting

Keep 3 values in each row

Discussion started by: cns1710

6. Shell Programming and Scripting

Sort data from column to row

Discussion started by: killerbee

7. Shell Programming and Scripting

Sort a file from specific row onwards

Discussion started by: nvkuriseti

8. Shell Programming and Scripting

How to insert data befor some field in a row of data depending up on values in row

Discussion started by: aemunathan

9. Shell Programming and Scripting

sort and semi-duplicate row - keep latest only

Discussion started by: LisaS

10. Programming

copying or concatinating string from 1st bit, leaving 0th bit

Discussion started by: jazz