Median and max of duplicate rows Post: 302839129

Sponsored Content

Top Forums Shell Programming and Scripting Median and max of duplicate rows Post 302839129 by ritakadm on Wednesday 31st of July 2013 12:54:13 PM

07-31-2013

Registered User

This works good for all rows...but how do I print rows only which repeat at least 4 times?
I tried the following modification but it prints out gibberish..

Code:

sort file -k1,1 -k2,2n | awk ' {nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}  END {     for (key in a) {         split(a[key], b, "@")         len = length(b)         for (i=1;i<=len;i++) {             avg = sum[key] / nbr[key];             if (nbr[key]%2) {                 median = b[(nbr[key]+1)/2]             } else {                 median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2             }         }
        if (len >3) {  
        printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
         }
 } } '

Also my original files are quite large..eg 500mb each, each there a way to speed this up? Right now it takes forever to run

---------- Post updated at 11:54 AM ---------- Previous update was at 11:10 AM ----------

Update..this seems to run fine... but if anything can be done to speed up..please let me know..

Code:

sort testmed.txt -k1,1 -k2,2n | awk '
{nbr[$1]++; a[$1]= a[$1] ? a[$1]"@"$2 : $2; sum[$1]+=$2}

END {
    for (key in a) {
        split(a[key], b, "@")
        len = length(b)
        for (i=1;i<=len;i++) {
            avg = sum[key] / nbr[key];
            if (nbr[key]%2) {
                median = b[(nbr[key]+1)/2]
            } else {
                median = (b[(nbr[key]/2)+1] + b[nbr[key]/2])/2
            }
        }
        if ( len > 3)
        {
        printf "%s %s %s %s %s\n", key, b[len], avg, median, b[1]
        }
    }
}
'

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

duplicate rows in a file

hi all can anyone please let me know if there is a way to find out duplicate rows in a file. i have a file that has hundreds of numbers(all in next row). i want to find out the numbers that are repeted in the file. eg. 123434 534 5575 4746767 347624 5575 i want 5575 please help

2. Shell Programming and Scripting

How to extract duplicate rows

I have searched the internet for duplicate row extracting. All I have seen is extracting good rows or eliminating duplicate rows. How do I extract duplicate rows from a flat file in unix. I'm using Korn shell on HP Unix. For.eg. FlatFile.txt ======== 123:456:678 123:456:678 123:456:876...

3. HP-UX

How to get Duplicate rows in a file

Hi all, I have written one shell script. The output file of this script is having sql output. In that file, I want to extract the rows which are having multiple entries(duplicate rows). For example, the output file will be like the following way. ...

4. Shell Programming and Scripting

How to extract duplicate rows

Hi! I have a file as below: line1 line2 line2 line3 line3 line3 line4 line4 line4 line4 I would like to extract duplicate lines (not unique, triplicate or quadruplicate lines). Output will be as below: line2 line2 I would appreciate if anyone can help. Thanks.

5. Programming

eliminate duplicate rows - sqlloader

Hi , I have a data file in this format. p1 p2 p3 10 0 10 0 1000 I am using a sqlloader script to load the data into the database table.There is a unique constraint on the columns p1 and p2. So, sqlldr cannot load both the records. This eliminates duplicate records from being...

6. Shell Programming and Scripting

Delete duplicate rows

Hi, This is a followup to my earlier post him mno klm 20 76 . + . klm_mango unix_00000001; alp fdc klm 123 456 . + . klm_mango unix_0000103; her tkr klm 415 439 . + . klm_mango unix_00001043; abc tvr klm 20 76 . + . klm_mango unix_00000001; abc def klm 83 84 . + . klm_mango...

7. Programming

Getting Rows from a MySQL Table with max values?

I feel stupid for asking this because it seems that MYSQL code isn't working the way that I think it should work. Basically I wrote code like this: select * from `Test_DC_Trailer` HAVING max(DR_RefKey); Where the DR_RefKey is a unique numeric field that is auto iterated (like a primary key)...

8. UNIX for Dummies Questions & Answers

get max value every 4 rows between 2 column

Hi all I have a file that has two columns and I need the maximum value in column 2 of 4 positions o rows. for example at position {1..3} there are 4 characters (A, C, G and T) each of these characters with a value with a value in column 2. I need the maximum value in column 2 and the corresponding...

9. Shell Programming and Scripting

How to duplicate rows using awk or any other method?

I want to duplicate each row in my file Egfile.txt Name State Age Jack NJ 34 John MA 23 Jessica FL 45 I want the code to produce this output Name State Age Jack NJ 34 Jack NJ 34 John MA 23 John MA 23 Jessica FL 45 Jessica FL 45

10. UNIX for Dummies Questions & Answers

Any 'shortcut' to doing this search for duplicate and print max

Hi, I have a file that contains multiple records of the same database. I need to search for the maximum size of the database. At the moment, I am doing as below: Sample generated file to parse is as below. With the caret (^) delimiter, field 1 is the database name, 2 is the database ID and...

LEARN ABOUT DEBIAN

fastx_quality_stats

FASTX_QUALITY_STATS(1)						   User Commands					    FASTX_QUALITY_STATS(1)

NAME

       fastx_quality_stats - FASTX Statistics

DESCRIPTION

       usage: fastx_quality_stats [-h] [-N] [-i INFILE] [-o OUTFILE] Part of FASTX Toolkit 0.0.13.2 by A. Gordon (gordon@cshl.edu)

	      [-h]  =  This  helpful help screen.  [-i INFILE]	= FASTQ input file. default is STDIN.  [-o OUTFILE] = TEXT output file. default is
	      STDOUT.  [-N]	    = New output format (with more information per nucleotide/cycle).

   The *OLD* output TEXT file will have the following fields (one row per column):
       column = column number (1 to 36 for a 36-cycles read solexa file)

       count  = number of bases found in this column.

       min    = Lowest quality score value found in this column.

       max    = Highest quality score value found in this column.

       sum    = Sum of quality score values for this column.

       mean   = Mean quality score value for this column.

       Q1     = 1st quartile quality score.

       med    = Median quality score.

       Q3     = 3rd quartile quality score.

       IQR    = Inter-Quartile range (Q3-Q1).

       lW     = 'Left-Whisker' value (for boxplotting).

       rW     = 'Right-Whisker' value (for boxplotting).

	      A_Count = Count of 'A' nucleotides found in this column.	C_Count = Count of 'C' nucleotides found in this column.  G_Count =  Count
	      of  'G'  nucleotides found in this column.  T_Count = Count of 'T' nucleotides found in this column.  N_Count = Count of 'N' nucleo-
	      tides found in this column.  max-count = max. number of bases (in all cycles)

   The *NEW* output format:
	      cycle (previously called 'column') = cycle number max-count For each nucleotide in the cycle (ALL/A/C/G/T/N):

       count  = number of bases found in this column.

       min    = Lowest quality score value found in this column.

       max    = Highest quality score value found in this column.

       sum    = Sum of quality score values for this column.

       mean   = Mean quality score value for this column.

       Q1     = 1st quartile quality score.

       med    = Median quality score.

       Q3     = 3rd quartile quality score.

       IQR    = Inter-Quartile range (Q3-Q1).

       lW     = 'Left-Whisker' value (for boxplotting).

       rW     = 'Right-Whisker' value (for boxplotting).

SEE ALSO

       The quality of this automatically generated manpage might be insufficient.  It is suggested to visit

	      http://hannonlab.cshl.edu/fastx_toolkit/commandline.html

       to get a better layout as well as an overview about connected FASTX tools.

fastx_quality_stats 0.0.13.2					     May 2012						    FASTX_QUALITY_STATS(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

duplicate rows in a file

Discussion started by: infyanurag

2. Shell Programming and Scripting

How to extract duplicate rows

Discussion started by: bobbygsk

3. HP-UX

How to get Duplicate rows in a file

Discussion started by: raghu.iv85

4. Shell Programming and Scripting

How to extract duplicate rows

Discussion started by: chromatin

5. Programming

eliminate duplicate rows - sqlloader

Discussion started by: megha2525

6. Shell Programming and Scripting

Delete duplicate rows

Discussion started by: jacobs.smith

7. Programming

Getting Rows from a MySQL Table with max values?

Discussion started by: Astrocloud

8. UNIX for Dummies Questions & Answers

get max value every 4 rows between 2 column

Discussion started by: xinox

9. Shell Programming and Scripting

How to duplicate rows using awk or any other method?

Discussion started by: sidnow

10. UNIX for Dummies Questions & Answers

Any 'shortcut' to doing this search for duplicate and print max

Discussion started by: newbie_01

LEARN ABOUT DEBIAN

fastx_quality_stats