How to print median values of matrix -awk?

Login or Register to Ask a Question and Join Our Community

How to print median values of matrix -awk?

Tags

median, shell scripts

Login to Discuss or Reply to this Discussion in Our Community

Top Forums Shell Programming and Scripting How to print median values of matrix -awk?

06-22-2017

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

How to print median values of matrix -awk?

I use the following script to print the sum and how could I extend this to print medians instead? thanks

Code:

name	s1	s2	s3	s4
g1	2	8	6	5
g1	5	7	9	9
g1	6	7	8	9
g2	8	8	8	8
g2	7	7	7	7
g2	10	10	10	10
g3	3	12	1	24
g3	5	5	24	48
g3	12	3	12	12
g3	2	3	3	3

output

Code:

name	s1	s2	s3	s4
g1	5	7	8	9
g2	7	7	7	7
g3	4	4	7.5	18

scripts - mean

Code:

NR==1 {
    print
    next
}
    # print average of each column per year
    #  then, reset columns sums and number of lines
function print_sum() {
    printf prev
    # needs GNU awk, for length of array
    for (i=2; i < length(sum) + 2; i++) {
            printf FS sum[i]/nlines
            sum[i] = 0
    }
    printf ORS
    nlines = 0
}
    # print average when $1 changes, but not the first time
    # also, on end of script
NR>2 && prev!=$1 { print_sum() }
END              { print_sum() }
    # for every line with the same $1, sum column values, increment number of lines
{
    prev=$1;
    nlines++
    for (i=2; i <= NF; i++) {
            sum[i]+=$i
    }
}
}

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

06-22-2017

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

Utility datamash makes the median and other statistical calculations fairly easy. Aside from the scaffolding code, the operative line is datamash:

Code:

#!/usr/bin/env bash

# @(#) s1       Demonstrate statistical calculations, median, datamash.

# Utility functions: print-as-echo, print-line-with-visual-space, debug.
# export PATH="/usr/local/bin:/usr/bin:/bin"
LC_ALL=C ; LANG=C ; export LC_ALL LANG
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }
em() { pe "$*" >&2 ; }
db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
db() { : ; }
C=$HOME/bin/context && [ -f $C ] && $C datamash

FILE=${1-data1}
E=expected-output.txt

pl " Input data file $FILE:"
cat $FILE

pl " Expected output:"
cat $E

pl " Results (adjusted for visual with code align):"
datamash -H -g1 median 2 median 3 median 4 median 5 < $FILE |
align | 
tee f1

pl " Verify results if possible:"
C=$HOME/bin/pass-fail
[ -f $C ] && $C || ( pe; pe " Results cannot be verified." ) >&2

pl " Some details for datamash:"
dixf datamash

exit $?

producing:

Code:

$ ./s1

Environment: LC_ALL = C, LANG = C
(Versions displayed with local utility "version")
OS, ker|rel, machine: Linux, 3.16.0-4-amd64, x86_64
Distribution        : Debian 8.7 (jessie) 
bash GNU bash 4.3.30
datamash (GNU datamash) 1.0.6

-----
 Input data file data1:
name    s1      s2      s3      s4
g1      2       8       6       5
g1      5       7       9       9
g1      6       7       8       9
g2      8       8       8       8
g2      7       7       7       7
g2      10      10      10      10
g3      3       12      1       24
g3      5       5       24      48
g3      12      3       12      12
g3      2       3       3       3

-----
 Expected output:
name    s1      s2      s3      s4
g1      5       7       8       9
g2      7       7       7       7
g3      4       4       7.5     18

-----
 Results:
GroupBy(name)   median(s1)      median(s2)      median(s3)      median(s4)
g1              5               7               8               9
g2              8               8               8               8
g3              4               4               7.5             18

-----
 Verify results if possible:

-----
 Comparison of 4 created lines with 4 lines of desired results:
f1 expected-output.txt differ: char 1, line 1
 Failed -- files f1 and expected-output.txt not identical -- detailed comparison follows.
1c1
< name  s1      s2      s3      s4
---
> GroupBy(name) median(s1)      median(s2)      median(s3)      median(s4)
3c3
< g2    7       7       7       7
---
> g2            8               8               8               8

 Results cannot be verified.

-----
 Some details for datamash:
datamash        command-line calculations (man)
Path    : /usr/bin/datamash
Version : 1.0.6
Type    : ELF 64-bit LSB executable, x86-64, version 1 (SYSV ...)
Help    : probably available with -h,--help
Repo    : Debian 8.7 (jessie) 
Home    : https://savannah.gnu.org/projects/datamash/ (pm)

There is a disagreement about the headers and group g2. I would tend to trust datamash, but you can do the calculations again to verify your answer. I tried it again sorting the file, as well as interchanging lines for g2 and got the same result.

Best wishes ... cheers, drl

drl

View Public Profile for drl

Find all posts by drl

06-23-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try this for medians:

Code:

awk -F"\t" '
NR == 1
NR > 1          {for (i=2; i<=NF; i++) print $1, i, $i | "sort -k1,2 -k3bn > TMP"
                }

function PRMED()        {printf TFS "%s", MEDIAN
                         TFS = OFS
                         PRV2 = $2
                         CNT = 0
                        }

END             {while (1 == getline < "TMP")   {if ($1 != PRV1)        {PRMED()
                                                                         printf TRS "%s", $1
                                                                         TRS = ORS
                                                                         PRV1 = $1
                                                                        }
                                                 if ($2 != PRV2)        {PRMED()
                                                                        }
                                                 M[++CNT] = $3
                                                 CH       = int (CNT / 2)
                                                 MEDIAN   = CNT%2?M[CH+1]:(M[CH]+M[CH+1])/2
                                                }
                }
END             {PRMED()
                 printf ORS
                }
' OFS="\t" file
name 	s1	s2	s3	s4
g1	5	7	8	9
g2	8	8	8	8
g3	4	4	7.5	18

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-23-2017

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

I am afraid, it seems there is a small bug some where. Sometimes, I get different outputs from the same input. Some times just the header.

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

06-23-2017

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

I'm afraid, I can't help without sample data leading to errors. Different outputs from identical input is highly improbable, btw ...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

06-23-2017

Registered User

184, 0

Join Date: Sep 2010

Last Activity: 10 July 2017, 5:54 AM EDT

Posts: 184

Thanks Given: 53

Thanked 0 Times in 0 Posts

No worries but thank you for the help. I figure out this in R in more easy way.

Code:

library(dplyr)
a<-read.table("input", head=T)
b<- a %>%
  group_by(name) %>%
  summarise_each(funs(median(., na.rm=TRUE)))
write.table(b, file="output", sep="\t")

quincyjones

View Public Profile for quincyjones

Find all posts by quincyjones

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Print values within groups of lines with awk

Hello to all, I'm trying to print the value corresponding to the words A, B, C, D, E. These words could appear sometimes and sometimes not inside each group of lines. Each group of lines begins with "ZYX". My issue with current code is that should print values for 3 groups and only is...

2. Shell Programming and Scripting

awk print odd values

value=$(some command) for all in `echo $value` do awk checks each value (all) to see if it is a odd number. if so, prints the value done sounds easy enough but i've been unable to find anything on google.

3. Shell Programming and Scripting

awk print values between consecutive lines

I have a file in below format: file01.txt TERM TERM TERM ABC 12315 68.53 12042013 165144 ABC 12315 62.12 12042013 165145 ABC 12315 122.36 12052013 165146 ABC 12315 582.18 12052013 165147 ABC 12316 2.36 12052013 165141 ABC 12316 ...

4. Shell Programming and Scripting

How to print in awk matching $1 values ,to $1,$4 example given.?

Hi Experts, I am trying to get the output from a matching pattern but unable to construct the awk command: file : aa bb cc 11 dd aa cc 33 cc 22 45 68 aa 33 44 44 dd aa cc 37 aa 33 44 67 I want the output to be : ( if $1 match to "aa" start of the line,then print $4 of that line, and...

5. Shell Programming and Scripting

Print minimum and maximum values using awk

Can I print the minimum and maximum values of values in first 4 columns ? input 3038669 3038743 3037800 3038400 m101c 3218627 3218709 3217600 3219800 m290 ............. output 3037800 3038743 m101c 3217600 3219800 m290

6. UNIX for Advanced & Expert Users

Awk to print values of second file

Hello, I have a data file with 300,000 records in it, and another file which contains only the line numbers of roughly 13,000 records in the data file which have data integrity issues. I'm trying to find a way to print the original data by line number identified in the second file. How can I do...

7. Shell Programming and Scripting

Help fixing awk code to print values from 2 files

Hi everyone, Please help on this: I have file1: <file title="Title 1 and 2"> <report> <title>Title 1</title> <number>No. 1234</number> <address>Address 1</address> <date>October 07, 2009</date> <description>Some text</description> </report> ...

8. Shell Programming and Scripting

Print a key with its all values using awk/others

input COL1 a1 b1 c1 d1 e1 f1 C1 10 10 10 100 100 1000 C2 20 20 200 200 200 2000 output C1 a1 10 1 C1 b1 10 1 C1 c1 10 1 C1 d1 100 2 C1 e1 100 2 C1 f1 1000 3 C2 ...

9. Shell Programming and Scripting

awk to median

hi! i have a file like the attachement. you can see on the last column, there is a marker from 1 to 64 for each time. I'd like to have the median for each marker: i want to get a median every 128 values the result is : for an hour and marker x, i have the median value thank you for...

10. Shell Programming and Scripting

Awk to print distinct col values

Hi Guys... I am newbie to awk and would like a solution to probably one of the simple practical questions. I have a test file that goes as: 1,2,3,4,5,6 7,2,3,8,7,6 9,3,5,6,7,3 8,3,1,1,1,1 4,4,2,2,2,2 I would like to know how AWK can get me the distinct values say for eg: on col2...

Login or Register to Ask a Question