Median and max of duplicate rows

07-31-2013

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Median and max of duplicate rows

Hi all,

plz help me with this, I want to to extract the duplicate rows (column 1) in a file which at least repeat 4 times. then I want to summarize them by getting the max , mean, median and min. The file is sorted by column 1, all the repeated rows appear together.

If number of elements is odd, median is middle one , eg 4th element among 7 sorted numbers ... element number (n+1)/2
If number of elements is even, it is the average of middle 2, eg. average of 4th and 5th element for set of 8 sorted numbers...average of n/2 + 1 and n/2

Code:

Inp

R1 1
R1 2
R1 3
R2 1
R2 2
R2 3
R2 100
R3 5


output

R2 100 26.25 2.5 1

I figured our uniq -d option will give me the duplicate lines, but how do I work with at least 4?

Also, I tried to find the mean and median, getting errors but trying to get this to work.

Code:

sort -n file | awk ' { a[i++]=$2;  N[$1]++}
    END { for (key in i) {
                        avg = sum[key] / N[key];}
x=int((i+1)/2); 
if (x < (i+1)/2)
 print (a[x-1]+a[x])/2 " " avg; 
else print a[x-1] " " avg; }'

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

07-31-2013

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

Is it this that you are after?

Code:

sort file -k1,1 -k2,2n | awk '
{nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}

END {
    for (key in a) {
        split(a[key], b, "@")
        len = length(b)
        for (i=1;i<=len;i++) {
            avg = sum[key] / nbr[key];
            if (nbr[key]%2) {
                median = b[(nbr[key]+1)/2]
            } else {
                median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
            }
        }
        printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
    }
}
'

Last edited by ripat; 07-31-2013 at 08:32 AM.. Reason: typo and tabs expanded

These 2 Users Gave Thanks to ripat For This Post:

ripat

View Public Profile for ripat

Find all posts by ripat

07-31-2013

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

This works good for all rows...but how do I print rows only which repeat at least 4 times?
I tried the following modification but it prints out gibberish..

Code:

sort file -k1,1 -k2,2n | awk ' {nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}  END {     for (key in a) {         split(a[key], b, "@")         len = length(b)         for (i=1;i<=len;i++) {             avg = sum[key] / nbr[key];             if (nbr[key]%2) {                 median = b[(nbr[key]+1)/2]             } else {                 median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2             }         }
        if (len >3) {  
        printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
         }
 } } '

Also my original files are quite large..eg 500mb each, each there a way to speed this up? Right now it takes forever to run

---------- Post updated at 11:54 AM ---------- Previous update was at 11:10 AM ----------

Update..this seems to run fine... but if anything can be done to speed up..please let me know..

Code:

sort testmed.txt -k1,1 -k2,2n | awk '
{nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}

END {
    for (key in a) {
        split(a[key], b, "@")
        len = length(b)
        for (i=1;i<=len;i++) {
            avg = sum[key] / nbr[key];
            if (nbr[key]%2) {
                median = b[(nbr[key]+1)/2]
            } else {
                median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
            }
        }
        if ( len > 3)
        {
        printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
        }
    }
}
'

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

07-31-2013

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

Place your condition on the length higher in the code and also change the way to determine that length. Only marginal speed increase to be expected.

Code:

sort f -k1,1 -k2,2n | awk '
{nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}

END {
  for (key in a) {
    len = nbr[key]
    if ( len > 3 ) {
      split(a[key], b, "@")
      for (i=1;i<=len;i++) {
        avg = sum[key] / nbr[key];
        if (nbr[key]%2) {
          median = b[(nbr[key]+1)/2]
        } else {
          median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
        }
      }
      printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
    }
  }
}
'

This User Gave Thanks to ripat For This Post:

ripat

View Public Profile for ripat

Find all posts by ripat

07-31-2013

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

thanks ripat !! but there seems to be an error in calculating median. It should be 0.00056 but showing 134.79100, also min should be 0

Code:

cat testiso_GRMZM2G074386

GRMZM2G074386 0.00000
GRMZM2G074386 0.00000
GRMZM2G074386 0.00000
GRMZM2G074386 0.00056
GRMZM2G074386 2.63247
GRMZM2G074386 112.58600
GRMZM2G074386 134.79100

 awk '
> {nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}
>
> END {
>     for (key in a) {
>         split(a[key], b, "@")
>         len = length(b)
>         for (i=1;i<=len;i++) {
>             avg = sum[key] / nbr[key];
>             if (nbr[key]%2) {
>                 median = b[(nbr[key]+1)/2]
>             } else {
>                 median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
>             }
>         }
>         if ( len > 3)
>         {
>         printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
>         }
>     }
> }
> ' testiso_GRMZM2G074386
GRMZM2G074386 134.79100 35.7157 134.79100 0.00056

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

07-31-2013

Registered User

544, 43

Join Date: Oct 2006

Last Activity: 27 March 2017, 3:00 AM EDT

Location: Belgium

Posts: 544

Thanks Given: 5

Thanked 43 Times in 29 Posts

ok, I see where the problem is. The ternary condition was not expecting to see zero values.

Try this:

Code:

{nbr[$1]++; a[$1]= (a[$1]!="") ? a[$1]"@"$2 : $2; sum[$1]+=$2} # NEW


END {
  for (key in a) {
    len = nbr[key]
    if ( len > 3 ) {
      split(a[key], b, "@")
      for (i=1;i<=len;i++) {
        avg = sum[key] / nbr[key];
        if (nbr[key]%2) {
          median = b[(nbr[key]+1)/2]
        } else {
          median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
        }
      }
      printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
    }
  }
}
'

Last edited by ripat; 07-31-2013 at 04:40 PM..

ripat

View Public Profile for ripat

Find all posts by ripat

Shell Programming and Scripting

Median and max of duplicate rows

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Any 'shortcut' to doing this search for duplicate and print max

Discussion started by: newbie_01

2. Shell Programming and Scripting

How to duplicate rows using awk or any other method?

Discussion started by: sidnow

3. UNIX for Dummies Questions & Answers

get max value every 4 rows between 2 column

Discussion started by: xinox

4. Programming

Getting Rows from a MySQL Table with max values?

Discussion started by: Astrocloud

5. Shell Programming and Scripting

Delete duplicate rows

Discussion started by: jacobs.smith

6. Programming

eliminate duplicate rows - sqlloader

Discussion started by: megha2525

7. Shell Programming and Scripting

How to extract duplicate rows

Discussion started by: chromatin

8. HP-UX

How to get Duplicate rows in a file

Discussion started by: raghu.iv85

9. Shell Programming and Scripting

How to extract duplicate rows

Discussion started by: bobbygsk

10. Shell Programming and Scripting

duplicate rows in a file

Discussion started by: infyanurag