Median calculator based on id match

03-19-2014

Registered User

83, 1

Join Date: Oct 2011

Last Activity: 1 June 2014, 10:33 PM EDT

Posts: 83

Thanks Given: 20

Thanked 1 Time in 1 Post

Median calculator based on id match

I am trying to calculate the median of a column of numbers if they match an ID type on a different column. The input file has 3 columns. The column that has the ID is column 1 and the column with the values I'd like to find the median for is column 3. The file does not need to be sorted.

What I have so far is this:

Code:

awk '
NR==1 {
        print "id", "col2", "col3", "median"
        next
  } 
FNR==NR {
        A[$1]++
        id=$1
        ($1 in A){
        B[i++]=$3;} END {print B[int(i/2)];}}' file

the input looks like this :

Code:

id   col2   col3 
dog   0       0        
dog   1       4          
dog   2       3          
cat    0       2          
cat    1       4

the output should look like this:

Code:

id   col2   col3  median
dog   0       0          3      
dog   1       4          3
dog   2       3          3
cat    0       2          3
cat    1       4          3

Moderator's Comments:

Please use CODE tags (not QUOTE tags) to mark input and output samples as well as code samples.

Last edited by Don Cragun; 03-19-2014 at 05:12 PM.. Reason: Change QUOTE tags to CODE tags.

verse123

View Public Profile for verse123

Find all posts by verse123

03-19-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

That's the thing, you'd usually take the median by taking the middle value from a sorted list. An unsorted one is meaningless. Do you have GNU awk? It can sort.

Code:

printf "%s\n" 1 5 8 7 2 35 15 | awk '{A[NR]=$1 }; END { asort(A); for(X=1; X<=NR; X++) print A[X] ; printf("\n"); print A[int(NR/2)+1]; }'
1
2
5
7
8
15
35

7

$

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

03-19-2014

Registered User

83, 1

Join Date: Oct 2011

Last Activity: 1 June 2014, 10:33 PM EDT

Posts: 83

Thanks Given: 20

Thanked 1 Time in 1 Post

Hi Corona, I have sorted by a different column and would like to find the median as a function of another sort. That's why I don't need to necessarily sort col 3.

verse123

View Public Profile for verse123

Find all posts by verse123

03-20-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

What do you mean by "I have sorted by a different column..."? It looks like the input has been sorted in reverse order by field1 with increasing order by field2 as a secondary key. If that is the case, you could greatly simplify this process by sorting the file using:

Code:

sort -k1,1r -k3,3n

which would eliminate the need to sort on field 3 for each different value in field 1.

If that isn't possible, can you at least tell us if all of the entries in your input file will always have all of the entries with the same value in field 1 on adjacent lines?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-20-2014

Registered User

83, 1

Join Date: Oct 2011

Last Activity: 1 June 2014, 10:33 PM EDT

Posts: 83

Thanks Given: 20

Thanked 1 Time in 1 Post

Precisely Don. The file has already been sorted by the columns you mentioned. Sorting isn't really the issue here though. The issue is finding the median for all strings in col1 that share an ID, writing that answer for all lines that have that ID, and performing the same thing for a new string. I notice now the output example I gave is not very clear. It should look like this

Code:

id   col2   col3  median
dog   0       0          3      
dog   1       4          3
dog   2       3          3
cat    0       2          4
cat    1       6          4

verse123

View Public Profile for verse123

Find all posts by verse123

03-20-2014

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

You could try something like:

Code:

#!/bin/ksh
IAm=${0##*/}	# Save basename of the current script.
tempf="$IAm.$$"	# Set temp filename to be used by sort.

awk -v debug="$debug" -v tempf="$tempf" '
# Usage:	median()
# DESCRIPTION:	Compute the median for the 1st lc values in f3[] and print the
#		lc entries in the arrays f1[], f2[], and f3[] and the computed
#		median.  Set id to compare to following input lines, and clear
#		lc.
function median(	high, i, low, m, x) {
	# Save the 1st field on the current line to compare against subsequent
	# lines.
	id = $1

	# If lc is zero, we do not have any data to process.  (This should
	# only happen when we are looking at the 1st line after the header
	# line.)
	if(lc == 0) return

	# Close the file produced by sorting the field 3 values for the
	# previous id.
	close(sortcmd)

	# Compute the input lines from the sorted list needed to determine the
	# median value in the array f3[].
	if(lc % 2) {
		# For odd numbers of lines, we just need the middle line.
		low = high = (lc + 1) / 2
	} else {# For even number of lines, we need the average of the middle
		# two lines.
		low = lc / 2
		high = low + 1
	}
	if(debug) printf("low=%d, high=%d, ", low, high)

	# Compute the median...
	m = 0
	for(i = 1; i <= high; i++) {
		# Read the 1st high lines from the sorted file.
		getline x < tempf
		if(i == low || i == high) {
			# Add values from the middle line or lines.
			m += x
			if(debug) printf("median before averaging=%d, ", m)
		}
	}

	# Close the sorted input file.
	close(tempf)

	# Compute the median.
	if(low != high) m /= 2
	if(debug) print "median=" m

	# Print the original data (reformatted) and add the median to each line.
	for(i = 1; i <= lc; i++) {
		if(debug) printf("lc=%d, i=%d, ", lc, i)
		printf("%s\t%s\t%s\t%s\n", f1[i], f2[i], f3[i], m)
	}

	# Clear the arrays.
	lc = 0
}
NR == 1 {
	# We have the 1st input line.  Print the header.
	printf("%s\t%s\t%s\tmedian\n", $1, $2, $3)

	# Initialize the command to sort field 3 values.
	sortcmd = "sort -n -o \"" tempf "\""
	next
}
{	# We are not on the header line.  If the id field has changed since the
	# previous line, compute the median and print the accumulated data and
	# median for the previous id.
	if(id != $1) median()

	# Save data from the current line to print when we find the next id.
	f1[++lc] = $1
	f2[lc] = $2
	f3[lc] = $3

	# Sort the values from the 3rd field for the current id.
	print $3 | sortcmd
}
END {	# Compute the median and print the accumulated data and median for the
	# last id.
	median()
}' file

# Save the exit code from awk.
ec=$?

# Remove the temp file used to sort the 3rd field for each id.
rm -f "$tempf"

# Exit with awk's exit code.
exit $ec

This was written and tested using the Korn shell on Mac OS X. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk. If you have gawk, you could use its built-in sorting capabilities to sort a copy of the f3[] array instead of using the external sort command to sort that field. If your input was sorted with the primary key being the 1st field (in increasing or decreasing order) and with the secondary key being the 3rd field (sorted as an increasing or decreasing numeric field) you wouldn't need to sort the field 3 data at all. You could just change the line:

Code:

	# Compute the median...

to:

Code:

	# Compute the median…
	m = (f3[low] + f3[high]) / 2

and remove all of the code shown in red above. (Of course the header could not be sorted unless the sort key used and the data in the file sorted the header to be the 1st line of the input file.) This is why the way you sort your input matters!

If you save this script in a file named add_median and have the following data in a file named file:

Code:

id   col2   col3 
fly 0 10
fly 1 10
fly 2 0
fly 3 3
fly 4 0
emu 0 10
emu 1 10
emu 2 0
emu 3 3
emu 4 0
emu 5 0
dog   0       0        
dog   1       4          
dog   2       3          
cat    0       2          
cat    1       4
bee    0       6          
bee    1       3

and invoke the script as:

Code:

./add_median

you'll get the output:

Code:

id	col2	col3	median
fly	0	10	3
fly	1	10	3
fly	2	0	3
fly	3	3	3
fly	4	0	3
emu	0	10	1.5
emu	1	10	1.5
emu	2	0	1.5
emu	3	3	1.5
emu	4	0	1.5
emu	5	0	1.5
dog	0	0	3
dog	1	4	3
dog	2	3	3
cat	0	2	3
cat	1	4	3
bee	0	6	4.5
bee	1	3	4.5

If you want to see the debugging output so you can more easily following what the script is doing, invoke the script with:

Code:

debug=1 ./add_median

and you'll see the output:

Code:

id	col2	col3	median
low=3, high=3, median before averaging=3, median=3
lc=5, i=1, fly	0	10	3
lc=5, i=2, fly	1	10	3
lc=5, i=3, fly	2	0	3
lc=5, i=4, fly	3	3	3
lc=5, i=5, fly	4	0	3
low=3, high=4, median before averaging=0, median before averaging=3, median=1.5
lc=6, i=1, emu	0	10	1.5
lc=6, i=2, emu	1	10	1.5
lc=6, i=3, emu	2	0	1.5
lc=6, i=4, emu	3	3	1.5
lc=6, i=5, emu	4	0	1.5
lc=6, i=6, emu	5	0	1.5
low=2, high=2, median before averaging=3, median=3
lc=3, i=1, dog	0	0	3
lc=3, i=2, dog	1	4	3
lc=3, i=3, dog	2	3	3
low=1, high=2, median before averaging=2, median before averaging=6, median=3
lc=2, i=1, cat	0	2	3
lc=2, i=2, cat	1	4	3
low=1, high=2, median before averaging=3, median before averaging=9, median=4.5
lc=2, i=1, bee	0	6	4.5
lc=2, i=2, bee	1	3	4.5

Last edited by Don Cragun; 03-20-2014 at 06:08 AM.. Reason: Mark more code that should be deleted if field 3 is sorted in the input as a secondary sort key.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

03-20-2014

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by verse123

Hi Corona, I have sorted by a different column and would like to find the median as a function of another sort. That's why I don't need to necessarily sort col 3.

I don't see anything in your input or output that looks remotely like a median. I don't understand.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

UNIX for Dummies Questions & Answers

Median calculator based on id match

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data match 2 files based on first 2 columns matching only and join if match

Discussion started by: axis88

2. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

Discussion started by: cmccabe

3. Shell Programming and Scripting

Merge lines based on match

Discussion started by: jamie_123

4. Shell Programming and Scripting

Match based on criteria to file

Discussion started by: cmccabe

5. Shell Programming and Scripting

New files based off match or no match

Discussion started by: cmccabe

6. UNIX for Advanced & Expert Users

Match and print based on columns

Discussion started by: rossi

7. Shell Programming and Scripting

Match files based on either of the two columns awk

Discussion started by: smitra

8. UNIX for Dummies Questions & Answers

Display/Cut the characters based on match

Discussion started by: nsuresh316

9. Shell Programming and Scripting

Awk based script to find the median of all individual columns in a data file

Discussion started by: ks_reddy

10. Shell Programming and Scripting

Split a record based on particular match

Discussion started by: mksuneel