Median calculator based on id match


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Median calculator based on id match
# 1  
Old 03-19-2014
Median calculator based on id match

I am trying to calculate the median of a column of numbers if they match an ID type on a different column. The input file has 3 columns. The column that has the ID is column 1 and the column with the values I'd like to find the median for is column 3. The file does not need to be sorted.

What I have so far is this:

Code:
awk '
NR==1 {
        print "id", "col2", "col3", "median"
        next
  } 
FNR==NR {
        A[$1]++
        id=$1
        ($1 in A){
        B[i++]=$3;} END {print B[int(i/2)];}}' file

the input looks like this :

Code:
id   col2   col3 
dog   0       0        
dog   1       4          
dog   2       3          
cat    0       2          
cat    1       4

the output should look like this:

Code:
id   col2   col3  median
dog   0       0          3      
dog   1       4          3
dog   2       3          3
cat    0       2          3
cat    1       4          3

Moderator's Comments:
Mod Comment Please use CODE tags (not QUOTE tags) to mark input and output samples as well as code samples.

Last edited by Don Cragun; 03-19-2014 at 05:12 PM.. Reason: Change QUOTE tags to CODE tags.
# 2  
Old 03-19-2014
That's the thing, you'd usually take the median by taking the middle value from a sorted list. An unsorted one is meaningless. Do you have GNU awk? It can sort.

Code:
printf "%s\n" 1 5 8 7 2 35 15 | awk '{A[NR]=$1 }; END { asort(A); for(X=1; X<=NR; X++) print A[X] ; printf("\n"); print A[int(NR/2)+1]; }'
1
2
5
7
8
15
35

7

$

# 3  
Old 03-19-2014
Hi Corona, I have sorted by a different column and would like to find the median as a function of another sort. That's why I don't need to necessarily sort col 3.
# 4  
Old 03-20-2014
What do you mean by "I have sorted by a different column..."? It looks like the input has been sorted in reverse order by field1 with increasing order by field2 as a secondary key. If that is the case, you could greatly simplify this process by sorting the file using:
Code:
sort -k1,1r -k3,3n

which would eliminate the need to sort on field 3 for each different value in field 1.

If that isn't possible, can you at least tell us if all of the entries in your input file will always have all of the entries with the same value in field 1 on adjacent lines?
# 5  
Old 03-20-2014
Precisely Don. The file has already been sorted by the columns you mentioned. Sorting isn't really the issue here though. The issue is finding the median for all strings in col1 that share an ID, writing that answer for all lines that have that ID, and performing the same thing for a new string. I notice now the output example I gave is not very clear. It should look like this


Code:
id   col2   col3  median
dog   0       0          3      
dog   1       4          3
dog   2       3          3
cat    0       2          4
cat    1       6          4

# 6  
Old 03-20-2014
You could try something like:
Code:
#!/bin/ksh
IAm=${0##*/}	# Save basename of the current script.
tempf="$IAm.$$"	# Set temp filename to be used by sort.

awk -v debug="$debug" -v tempf="$tempf" '
# Usage:	median()
# DESCRIPTION:	Compute the median for the 1st lc values in f3[] and print the
#		lc entries in the arrays f1[], f2[], and f3[] and the computed
#		median.  Set id to compare to following input lines, and clear
#		lc.
function median(	high, i, low, m, x) {
	# Save the 1st field on the current line to compare against subsequent
	# lines.
	id = $1

	# If lc is zero, we do not have any data to process.  (This should
	# only happen when we are looking at the 1st line after the header
	# line.)
	if(lc == 0) return

	# Close the file produced by sorting the field 3 values for the
	# previous id.
	close(sortcmd)

	# Compute the input lines from the sorted list needed to determine the
	# median value in the array f3[].
	if(lc % 2) {
		# For odd numbers of lines, we just need the middle line.
		low = high = (lc + 1) / 2
	} else {# For even number of lines, we need the average of the middle
		# two lines.
		low = lc / 2
		high = low + 1
	}
	if(debug) printf("low=%d, high=%d, ", low, high)

	# Compute the median...
	m = 0
	for(i = 1; i <= high; i++) {
		# Read the 1st high lines from the sorted file.
		getline x < tempf
		if(i == low || i == high) {
			# Add values from the middle line or lines.
			m += x
			if(debug) printf("median before averaging=%d, ", m)
		}
	}

	# Close the sorted input file.
	close(tempf)

	# Compute the median.
	if(low != high) m /= 2
	if(debug) print "median=" m

	# Print the original data (reformatted) and add the median to each line.
	for(i = 1; i <= lc; i++) {
		if(debug) printf("lc=%d, i=%d, ", lc, i)
		printf("%s\t%s\t%s\t%s\n", f1[i], f2[i], f3[i], m)
	}

	# Clear the arrays.
	lc = 0
}
NR == 1 {
	# We have the 1st input line.  Print the header.
	printf("%s\t%s\t%s\tmedian\n", $1, $2, $3)

	# Initialize the command to sort field 3 values.
	sortcmd = "sort -n -o \"" tempf "\""
	next
}
{	# We are not on the header line.  If the id field has changed since the
	# previous line, compute the median and print the accumulated data and
	# median for the previous id.
	if(id != $1) median()

	# Save data from the current line to print when we find the next id.
	f1[++lc] = $1
	f2[lc] = $2
	f3[lc] = $3

	# Sort the values from the 3rd field for the current id.
	print $3 | sortcmd
}
END {	# Compute the median and print the accumulated data and median for the
	# last id.
	median()
}' file

# Save the exit code from awk.
ec=$?

# Remove the temp file used to sort the 3rd field for each id.
rm -f "$tempf"

# Exit with awk's exit code.
exit $ec

This was written and tested using the Korn shell on Mac OS X. If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk, /usr/xpg6/bin/awk, or nawk. If you have gawk, you could use its built-in sorting capabilities to sort a copy of the f3[] array instead of using the external sort command to sort that field. If your input was sorted with the primary key being the 1st field (in increasing or decreasing order) and with the secondary key being the 3rd field (sorted as an increasing or decreasing numeric field) you wouldn't need to sort the field 3 data at all. You could just change the line:
Code:
	# Compute the median...

to:
Code:
	# Compute the median…
	m = (f3[low] + f3[high]) / 2

and remove all of the code shown in red above. (Of course the header could not be sorted unless the sort key used and the data in the file sorted the header to be the 1st line of the input file.) This is why the way you sort your input matters!

If you save this script in a file named add_median and have the following data in a file named file:
Code:
id   col2   col3 
fly 0 10
fly 1 10
fly 2 0
fly 3 3
fly 4 0
emu 0 10
emu 1 10
emu 2 0
emu 3 3
emu 4 0
emu 5 0
dog   0       0        
dog   1       4          
dog   2       3          
cat    0       2          
cat    1       4
bee    0       6          
bee    1       3

and invoke the script as:
Code:
./add_median

you'll get the output:
Code:
id	col2	col3	median
fly	0	10	3
fly	1	10	3
fly	2	0	3
fly	3	3	3
fly	4	0	3
emu	0	10	1.5
emu	1	10	1.5
emu	2	0	1.5
emu	3	3	1.5
emu	4	0	1.5
emu	5	0	1.5
dog	0	0	3
dog	1	4	3
dog	2	3	3
cat	0	2	3
cat	1	4	3
bee	0	6	4.5
bee	1	3	4.5

If you want to see the debugging output so you can more easily following what the script is doing, invoke the script with:
Code:
debug=1 ./add_median

and you'll see the output:
Code:
id	col2	col3	median
low=3, high=3, median before averaging=3, median=3
lc=5, i=1, fly	0	10	3
lc=5, i=2, fly	1	10	3
lc=5, i=3, fly	2	0	3
lc=5, i=4, fly	3	3	3
lc=5, i=5, fly	4	0	3
low=3, high=4, median before averaging=0, median before averaging=3, median=1.5
lc=6, i=1, emu	0	10	1.5
lc=6, i=2, emu	1	10	1.5
lc=6, i=3, emu	2	0	1.5
lc=6, i=4, emu	3	3	1.5
lc=6, i=5, emu	4	0	1.5
lc=6, i=6, emu	5	0	1.5
low=2, high=2, median before averaging=3, median=3
lc=3, i=1, dog	0	0	3
lc=3, i=2, dog	1	4	3
lc=3, i=3, dog	2	3	3
low=1, high=2, median before averaging=2, median before averaging=6, median=3
lc=2, i=1, cat	0	2	3
lc=2, i=2, cat	1	4	3
low=1, high=2, median before averaging=3, median before averaging=9, median=4.5
lc=2, i=1, bee	0	6	4.5
lc=2, i=2, bee	1	3	4.5


Last edited by Don Cragun; 03-20-2014 at 06:08 AM.. Reason: Mark more code that should be deleted if field 3 is sorted in the input as a secondary sort key.
This User Gave Thanks to Don Cragun For This Post:
# 7  
Old 03-20-2014
Quote:
Originally Posted by verse123
Hi Corona, I have sorted by a different column and would like to find the median as a function of another sort. That's why I don't need to necessarily sort col 3.
I don't see anything in your input or output that looks remotely like a median. I don't understand.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Data match 2 files based on first 2 columns matching only and join if match

Hi, i have 2 files , the data i need to match is in masterfile and i need to pull out column 3 from master if column 1 and 2 match and output entire row to new file I have tried with join and awk and i keep getting blank outputs or same file is there an easier way than what i am... (4 Replies)
Discussion started by: axis88
4 Replies

2. Shell Programming and Scripting

awk to update file based on partial match in field1 and exact match in field2

I am trying to create a cronjob that will run on startup that will look at a list.txt file to see if there is a later version of a database using database.txt as the source. The matching lines are written to output. $1 in database.txt will be in list.txt as a partial match. $2 of database.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Merge lines based on match

I am trying to merge two lines to one based on some matching condition. The file is as follows: Matches filter: 'request ', timestamp, <HTTPFlow request=<GET: Matches filter: 'request ', timestamp, <HTTPFlow request=<GET: Matches filter: 'request ', timestamp, <HTTPFlow ... (8 Replies)
Discussion started by: jamie_123
8 Replies

4. Shell Programming and Scripting

Match based on criteria to file

Trying to match $1 of target.txt to $5 of file.txt. If there is a match then in an output.txt file $1,$1 (row underneath),$6,$4,$7 from file.txt are printed on the same line as $1 of target.txt. The input is from excel and the output should be tab-deliminated. Thank you :). target.txt... (2 Replies)
Discussion started by: cmccabe
2 Replies

5. Shell Programming and Scripting

New files based off match or no match

Trying to match $2 in original_targets with $2 of new_targets . If the two numbers match exactly then a match.txt file is outputted using the information in the new_targets in the beginning 4 fields $1, $2, $3, $4 and value of $4 in the original_targets . If there is "No Match" then a no... (2 Replies)
Discussion started by: cmccabe
2 Replies

6. UNIX for Advanced & Expert Users

Match and print based on columns

HI, I have 2 different questions in this thread. Consider 2 files as input (input file have different line count ) File 1 1 1 625 56 1 12 657 34 1 9 25 45 1 2 20 54 67 3 25 35 27 4 45 73 36 5 125 56 45 File2 1 1 878 76 1 9 83 67 2 20 73 78 4 47 22 17 3 25 67 99 (4 Replies)
Discussion started by: rossi
4 Replies

7. Shell Programming and Scripting

Match files based on either of the two columns awk

Dear Shell experts, I have 2 files with structure: File 1: ID and count head test_GI_count1.txt 1000094 2 10039307 1 10039641 1 10047177 11 10047359 1 1008555 2 10120302 1 10120672 13 10121776 1 10121865 32 And 2nd file: head Protein_gi_GeneID_symbol.txt protein_gi GeneID... (11 Replies)
Discussion started by: smitra
11 Replies

8. UNIX for Dummies Questions & Answers

Display/Cut the characters based on match

I have input file like this update tablename set column1='ABC',column2='BBC' where columnx=1 and columny=100 and columnz='10000001' update tablename set column1='ABC',column2='BBC',column3='CBC' where columnx=1 and columny=100 and columnz='10000002' update tablename set column1='ABC' where... (1 Reply)
Discussion started by: nsuresh316
1 Replies

9. Shell Programming and Scripting

Awk based script to find the median of all individual columns in a data file

Hi All, I have some data like below. Step1,Param1,Param2,Param3 1,2,3,4 2,3,4,5 2,4,5,6 3,0,1,2 3,0,0,0 3,2,1,3 ........ so on Where I need to find the median(arithmetic) of each column from Param1...to..Param3 for each set of Step1 values. (Sort each specific column, if the... (5 Replies)
Discussion started by: ks_reddy
5 Replies

10. Shell Programming and Scripting

Split a record based on particular match

Hi , I have a requirement to split the record based on particular match using UNIX. Case1: Input Record : 10.44.48.63;"Personals/Dating;sports";1441 Output Records : 10.44.48.63;Personals/Dating;1441;Original 10.44.48.63;sports;1441;Dummy Case2: Input Record : ... (5 Replies)
Discussion started by: mksuneel
5 Replies
Login or Register to Ask a Question