How to print median values of matrix -awk?


Login or Register for Dates, Times and to Reply

 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting How to print median values of matrix -awk?
# 1  
How to print median values of matrix -awk?

I use the following script to print the sum and how could I extend this to print medians instead? thanks

Code:
name	s1	s2	s3	s4
g1	2	8	6	5
g1	5	7	9	9
g1	6	7	8	9
g2	8	8	8	8
g2	7	7	7	7
g2	10	10	10	10
g3	3	12	1	24
g3	5	5	24	48
g3	12	3	12	12
g3	2	3	3	3



output
Code:
name	s1	s2	s3	s4
g1	5	7	8	9
g2	7	7	7	7
g3	4	4	7.5	18


scripts - mean

Code:
NR==1 {
    print
    next
}
    # print average of each column per year
    #  then, reset columns sums and number of lines
function print_sum() {
    printf prev
    # needs GNU awk, for length of array
    for (i=2; i < length(sum) + 2; i++) {
            printf FS sum[i]/nlines
            sum[i] = 0
    }
    printf ORS
    nlines = 0
}
    # print average when $1 changes, but not the first time
    # also, on end of script
NR>2 && prev!=$1 { print_sum() }
END              { print_sum() }
    # for every line with the same $1, sum column values, increment number of lines
{
    prev=$1;
    nlines++
    for (i=2; i <= NF; i++) {
            sum[i]+=$i
    }
}
}

# 2  
Hi.

Utility datamash makes the median and other statistical calculations fairly easy. Aside from the scaffolding code, the operative line is datamash:
Code:
#!/usr/bin/env bash

# @(#) s1       Demonstrate statistical calculations, median, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C datamash

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Expected output:"
cat $E

pl " Results (adjusted for visual with code align):"
datamash -H -g1 median 2 median 3 median 4 median 5 < $FILE |
align | 
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

pl " Some details for datamash:"
dixf datamash

exit $?

producing:
Code:
$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
datamash (GNU datamash) 1.0.6

-----
 Input data file data1:
name    s1      s2      s3      s4
g1      2       8       6       5
g1      5       7       9       9
g1      6       7       8       9
g2      8       8       8       8
g2      7       7       7       7
g2      10      10      10      10
g3      3       12      1       24
g3      5       5       24      48
g3      12      3       12      12
g3      2       3       3       3

-----
 Expected output:
name    s1      s2      s3      s4
g1      5       7       8       9
g2      7       7       7       7
g3      4       4       7.5     18

-----
 Results:
GroupBy(name)   median(s1)      median(s2)      median(s3)      median(s4)
g1              5               7               8               9
g2              8               8               8               8
g3              4               4               7.5             18

-----
 Verify results if possible:

-----
 Comparison of 4 created lines with 4 lines of desired results:
f1 expected-output.txt differ: char 1, line 1
 Failed -- files f1 and expected-output.txt not identical -- detailed comparison follows.
1c1
< name  s1      s2      s3      s4
---
> GroupBy(name) median(s1)      median(s2)      median(s3)      median(s4)
3c3
< g2    7       7       7       7
---
> g2            8               8               8               8

 Results cannot be verified.

-----
 Some details for datamash:
datamash        command-line calculations (man)
Path    : /usr/bin/datamash
Version : 1.0.6
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h,--help
Repo    : Debian 8.7 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)

There is a disagreement about the headers and group g2. I would tend to trust datamash, but you can do the calculations again to verify your answer. I tried it again sorting the file, as well as interchanging lines for g2 and got the same result.

Best wishes ... cheers, drl
# 3  
Try this for medians:
Code:
awk -F"\t" '
NR == 1
NR > 1          {for (i=2; i<=NF; i++) print $1, i, $i | "sort -k1,2 -k3bn > TMP"
                }

function PRMED()        {printf TFS "%s", MEDIAN
                         TFS = OFS
                         PRV2 = $2
                         CNT = 0
                        }

END             {while (1 == getline < "TMP")   {if ($1 != PRV1)        {PRMED()
                                                                         printf TRS "%s", $1
                                                                         TRS = ORS
                                                                         PRV1 = $1
                                                                        }
                                                 if ($2 != PRV2)        {PRMED()
                                                                        }
                                                 M[++CNT] = $3
                                                 CH       = int (CNT / 2)
                                                 MEDIAN   = CNT%2?M[CH+1]:(M[CH]+M[CH+1])/2
                                                }
                }
END             {PRMED()
                 printf ORS
                }
' OFS="\t" file
name 	s1	s2	s3	s4
g1	5	7	8	9
g2	8	8	8	8
g3	4	4	7.5	18

This User Gave Thanks to RudiC For This Post:
# 4  
I am afraid, it seems there is a small bug some where. Sometimes, I get different outputs from the same input. Some times just the header.
# 5  
I'm afraid, I can't help without sample data leading to errors. Different outputs from identical input is highly improbable, btw ...
# 6  
No worries but thank you for the help. I figure out this in R in more easy way.
Code:
library(dplyr)
a<-read.table("input", head=T)
b<- a %>%
  group_by(name) %>%
  summarise_each(funs(median(., na.rm=TRUE)))
write.table(b, file="output", sep="\t")

Login or Register for Dates, Times and to Reply

Previous Thread | Next Thread
Thread Tools Search this Thread
Search this Thread:
Advanced Search

Test Your Knowledge in Computers #16
Difficulty: Easy
There are a total of 25 pins in the traditional parallel port of a computer system.
True or False?

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Print values within groups of lines with awk

Hello to all, I'm trying to print the value corresponding to the words A, B, C, D, E. These words could appear sometimes and sometimes not inside each group of lines. Each group of lines begins with "ZYX". My issue with current code is that should print values for 3 groups and only is... (6 Replies)
Discussion started by: Ophiuchus
6 Replies

2. Shell Programming and Scripting

awk print odd values

value=$(some command) for all in `echo $value` do awk checks each value (all) to see if it is a odd number. if so, prints the value done sounds easy enough but i've been unable to find anything on google. (2 Replies)
Discussion started by: SkySmart
2 Replies

3. Shell Programming and Scripting

awk print values between consecutive lines

I have a file in below format: file01.txt TERM TERM TERM ABC 12315 68.53 12042013 165144 ABC 12315 62.12 12042013 165145 ABC 12315 122.36 12052013 165146 ABC 12315 582.18 12052013 165147 ABC 12316 2.36 12052013 165141 ABC 12316 ... (8 Replies)
Discussion started by: alex2005
8 Replies

4. Shell Programming and Scripting

How to print in awk matching $1 values ,to $1,$4 example given.?

Hi Experts, I am trying to get the output from a matching pattern but unable to construct the awk command: file : aa bb cc 11 dd aa cc 33 cc 22 45 68 aa 33 44 44 dd aa cc 37 aa 33 44 67 I want the output to be : ( if $1 match to "aa" start of the line,then print $4 of that line, and... (3 Replies)
Discussion started by: rveri
3 Replies

5. Shell Programming and Scripting

Print minimum and maximum values using awk

Can I print the minimum and maximum values of values in first 4 columns ? input 3038669 3038743 3037800 3038400 m101c 3218627 3218709 3217600 3219800 m290 ............. output 3037800 3038743 m101c 3217600 3219800 m290 (2 Replies)
Discussion started by: quincyjones
2 Replies

6. UNIX for Advanced & Expert Users

Awk to print values of second file

Hello, I have a data file with 300,000 records in it, and another file which contains only the line numbers of roughly 13,000 records in the data file which have data integrity issues. I'm trying to find a way to print the original data by line number identified in the second file. How can I do... (2 Replies)
Discussion started by: peteroc
2 Replies

7. Shell Programming and Scripting

Print a key with its all values using awk/others

input COL1 a1 b1 c1 d1 e1 f1 C1 10 10 10 100 100 1000 C2 20 20 200 200 200 2000 output C1 a1 10 1 C1 b1 10 1 C1 c1 10 1 C1 d1 100 2 C1 e1 100 2 C1 f1 1000 3 C2 ... (12 Replies)
Discussion started by: ruby_sgp
12 Replies

8. Shell Programming and Scripting

awk to median

hi! i have a file like the attachement. you can see on the last column, there is a marker from 1 to 64 for each time. I'd like to have the median for each marker: i want to get a median every 128 values the result is : for an hour and marker x, i have the median value thank you for... (5 Replies)
Discussion started by: riderman
5 Replies

9. Shell Programming and Scripting

Awk to print distinct col values

Hi Guys... I am newbie to awk and would like a solution to probably one of the simple practical questions. I have a test file that goes as: 1,2,3,4,5,6 7,2,3,8,7,6 9,3,5,6,7,3 8,3,1,1,1,1 4,4,2,2,2,2 I would like to know how AWK can get me the distinct values say for eg: on col2... (22 Replies)
Discussion started by: anduzzi
22 Replies

10. Shell Programming and Scripting

awk to print mon and max values of ranges

HI all I'm trying to write an awk script to print the min and max value in a range(s) contained in another file - the range values are in $2 EG 114,7964,1,y,y,n 114,7965,1,y,y,n 114,7966,1,y,y,n 114,7967,1,y,y,n 114,7969,1,y,y,n 114,7970,1,y,y,n 114,7971,1,y,y,n 114,7972,1,y,y,n... (3 Replies)
Discussion started by: Mudshark
3 Replies

Featured Tech Videos