Ranking data points from multiple files

06-27-2016

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

It prints the tied values as the same rank. Once the data is different the rank will increment by x number of ties.

Below is an example of the tied values having the same rank, but then skipping to 102 once the tie is broken.

Code:

 46.8542 -121.7292    -1.59    99
 46.8542 -121.7292    -1.56    100
 46.8542 -121.7292    -1.56    100
 46.8542 -121.7292    -1.52    102

Does that make sense?

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

06-28-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

If I understand what you're trying to do (and I am not sure that I do), the following seems to do what you want and should be MUCH faster than what you are currently doing. It calls sort once (to sort all of the data in your input files) and awk once (to process each different group of lines found in the sorted input with the same values in the first two fields; and, for each group, to rank the lines found based on the third field; with ranking at both tails of the ranges as requested in post #18 in this VERY long thread).

This script assumes that your input fields are separated by tabs (as shown in your sample data in post #22) and not by spaces as shown in many of your other examples. And, based on that, it uses a tab to separate the rank in the output from the input fields.

This script was written and tested using a Korn shell (instead of bash) but will also work with any shell that uses Bourne shell syntax (including bash and many other). Since you're working with a few hundred files, this code assumes that you will run this script while located in the directory that contains the data files (instead of a parent directory) so that you can reference files with shorter names (such as *05.pnt instead of test/*05.pnt) because you can name more files on the command line without running into ARG_MAX limits if you use shorter names for each of the file pathnames. And this code assumes that you will provide a list of the files to be processed as command line arguments instead of hardcoding the names into the script. If the list of files you need to process does exceed ARG_MAX limits even with the suggested ways to shorten your argument list, we can use a for loop to feed data to sort using xargs and cat, but on most systems in use today, I wouldn't expect that to be necessary.

The script is:

Code:

#!/bin/ksh
sort -n -k1,1 -k2,2 -k3,3 "$@" | awk '
# Set input and output field separators to a tab character.
BEGIN {	FS = OFS = "\t"
}

# Function to print a group of elements that all have identical values in the
# first and second input fields.
function print_group() {
	# Check to see if we have data to process...
	if(cnt) {
		# Look for the 1st change in values after the mid-point for this
		# set group.
		for(i = int((cnt + 1) / 2) + 1; i <= cnt; i++)
			if(d[i] != d[i - 1])
				break
		# For each set of duplicate values after the midpoint, reset the
		# rank for those points to the end of the set instead of the
		# start of the set.
		while(i < cnt) {
			if(c[i] > 1)
				for(j = i; j <= i + c[i] - 1; j++)
					r[j] += c[i] - 1
			i += c[i]
		}
		# Print the data and rank for each element of the set.
		for(i = 1; i <= cnt; i++)
			print d[i], r[i]
	}
	# Reset variables for next group.
	cnt = con = 0
}

# Look for a change in the first two input fields...
$1 != l1 || $2 != l2 || NR == 1 {
	# We have found a change in values.  Print the results from the previous
	# group, if there was one.
	print_group()

	# Note first two field values so we notice the next change.
	l1 = $1
	l2 = $2

	# Clear the remembered 3rd field value to prevent contamination from the
	# previous group.
	l3 = ""
}

# Gather data for this group...
{	# Save the data for this line.
	d[++cnt] = $0

	# Calculate the rank for this line.  (At this point, we do not know what
	# the midpoint will be for this group, so all of these are saved with
	# the rank being the lowest rank for the set of lines with identical
	# third field values.  The group_print() function wll make adjustments
	# for sets of ranks after the midpoint in the group.)
	if($3 != l3 || cnt == 1) {
		# A change in field 3 values has been found.  Save the value and
		# rank for this set.
		l3 = $3
		lr = cnt
		# Clear the count of the consecutive number of lines with the
		# same value.
		con = 0
	} 

	# Set the rank for this line.
	r[cnt] = lr

	# Set number of consecutive lines that have this third field value.
	for(i = cnt - con++; i <= cnt; i++)
		c[i] = con
}

# We have found EOF.
END {	# Print the data for the last group.
	print_group()
}'

With a named file named post20 containing data extrapolated from your post #20:

Code:

and assuming that you saved the above script in a file named rank_files and made it executable, then the command:

Code:

rank_files post20

would produce the output:

Code:

And, with the data you supplied in post #27 split out into 118 different input files with names used there, the command:

Code:

rank_files {1895..2016}05.pnt

produces the output:

Code:

46.8542	-121.7292	-6.08	1
46.8542	-121.7292	-5.99	2
46.8542	-121.7292	-5.66	3
46.8542	-121.7292	-5.61	4
46.8542	-121.7292	-5.49	5
46.8542	-121.7292	-5.48	6
46.8542	-121.7292	-5.42	7
46.8542	-121.7292	-5.33	8
46.8542	-121.7292	-5.33	8
46.8542	-121.7292	-5.29	10
46.8542	-121.7292	-5.28	11
46.8542	-121.7292	-5.15	12
46.8542	-121.7292	-5.1	13
46.8542	-121.7292	-5.09	14
46.8542	-121.7292	-4.93	15
46.8542	-121.7292	-4.74	16
46.8542	-121.7292	-4.73	17
46.8542	-121.7292	-4.62	18
46.8542	-121.7292	-4.58	19
46.8542	-121.7292	-4.56	20
46.8542	-121.7292	-4.55	21
46.8542	-121.7292	-4.53	22
46.8542	-121.7292	-4.51	23
46.8542	-121.7292	-4.47	24
46.8542	-121.7292	-4.41	25
46.8542	-121.7292	-4.32	26
46.8542	-121.7292	-4.3	27
46.8542	-121.7292	-4.26	28
46.8542	-121.7292	-4.16	29
46.8542	-121.7292	-4.14	30
46.8542	-121.7292	-4.1	31
46.8542	-121.7292	-4.09	32
46.8542	-121.7292	-4.09	32
46.8542	-121.7292	-4	34
46.8542	-121.7292	-3.99	35
46.8542	-121.7292	-3.94	36
46.8542	-121.7292	-3.88	37
46.8542	-121.7292	-3.87	38
46.8542	-121.7292	-3.83	39
46.8542	-121.7292	-3.77	40
46.8542	-121.7292	-3.76	41
46.8542	-121.7292	-3.72	42
46.8542	-121.7292	-3.62	43
46.8542	-121.7292	-3.61	44
46.8542	-121.7292	-3.49	45
46.8542	-121.7292	-3.49	45
46.8542	-121.7292	-3.46	47
46.8542	-121.7292	-3.43	48
46.8542	-121.7292	-3.4	49
46.8542	-121.7292	-3.37	50
46.8542	-121.7292	-3.34	51
46.8542	-121.7292	-3.32	52
46.8542	-121.7292	-3.31	53
46.8542	-121.7292	-3.28	54
46.8542	-121.7292	-3.27	55
46.8542	-121.7292	-3.27	55
46.8542	-121.7292	-3.23	57
46.8542	-121.7292	-3.21	58
46.8542	-121.7292	-3.2	59
46.8542	-121.7292	-3.17	60
46.8542	-121.7292	-3.17	60
46.8542	-121.7292	-3.12	62
46.8542	-121.7292	-3.11	63
46.8542	-121.7292	-3.08	64
46.8542	-121.7292	-3.06	65
46.8542	-121.7292	-3.05	66
46.8542	-121.7292	-3.04	67
46.8542	-121.7292	-3.02	68
46.8542	-121.7292	-3.01	69
46.8542	-121.7292	-2.98	70
46.8542	-121.7292	-2.93	71
46.8542	-121.7292	-2.84	72
46.8542	-121.7292	-2.8	73
46.8542	-121.7292	-2.77	74
46.8542	-121.7292	-2.76	75
46.8542	-121.7292	-2.75	76
46.8542	-121.7292	-2.7	77
46.8542	-121.7292	-2.67	79
46.8542	-121.7292	-2.67	79
46.8542	-121.7292	-2.62	80
46.8542	-121.7292	-2.48	81
46.8542	-121.7292	-2.47	82
46.8542	-121.7292	-2.46	83
46.8542	-121.7292	-2.39	84
46.8542	-121.7292	-2.31	85
46.8542	-121.7292	-2.29	86
46.8542	-121.7292	-2.22	87
46.8542	-121.7292	-2.18	88
46.8542	-121.7292	-2.15	89
46.8542	-121.7292	-2.14	90
46.8542	-121.7292	-2.08	91
46.8542	-121.7292	-1.8	92
46.8542	-121.7292	-1.75	93
46.8542	-121.7292	-1.68	94
46.8542	-121.7292	-1.62	96
46.8542	-121.7292	-1.62	96
46.8542	-121.7292	-1.61	98
46.8542	-121.7292	-1.61	98
46.8542	-121.7292	-1.59	99
46.8542	-121.7292	-1.56	101
46.8542	-121.7292	-1.56	101
46.8542	-121.7292	-1.52	102
46.8542	-121.7292	-1.51	103
46.8542	-121.7292	-1.43	104
46.8542	-121.7292	-1.39	106
46.8542	-121.7292	-1.39	106
46.8542	-121.7292	-1.3	107
46.8542	-121.7292	-1.26	108
46.8542	-121.7292	-1.09	109
46.8542	-121.7292	-1.08	110
46.8542	-121.7292	-1.02	112
46.8542	-121.7292	-1.02	112
46.8542	-121.7292	-0.93	113
46.8542	-121.7292	-0.9	114
46.8542	-121.7292	-0.77	115
46.8542	-121.7292	-0.68	116
46.8542	-121.7292	-0.61	117
46.8542	-121.7292	-0.44	118
46.8542	-121.7292	-0.3	119
46.8542	-121.7292	-0.14	120
46.8542	-121.7292	0.71	121
46.8542	-121.7292	1.05	122

which I think does what you want even though the snippet of output you provided in post #27 did not provide the requested adjustments to ranks in the last half of the data for lines with field 1 set to 46.8542 and field 2 set to -121.7292.

It wasn't clear to me why you talked about choosing one file to gather field 1 and 2 values. The above code gathers field 1 and 2 values from all input files specified and produces ranked output for each different set of values in fields 1 and 2. (And, in case some sets of field 1 and 2 values have fewer entries than other values (i.e., do not appear in all input files), the midpoint is recalculated for each group of lines.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

06-29-2016

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

Quote:

It wasn't clear to me why you talked about choosing one file to gather field 1 and 2 values.

I need to base the ranks off of a particular input file. In this case it would be 201605.pnt
So I figured I would gather fields 1 and 2 from that file; search through the other 100+ files for the same set of values in fields 1 and 2; rank the value in field 3 of 201605.pnt based on those other 100+ files.

The result would be a single value of field 1 and 2 with a (value) and rank. Value is optional and if its there would need to be taken from the primary input - 201605.pnt. In other words the expected output would be 201605.pnt with a rank added as a 4th field. The rank would be based on that value when compared to the other 100+ files with like fields 1 and 2.

I ran your script as

Code:

./grid-ranking.sh {1895..2016}05.pnt

and got output (rather quickly) that has the ranks as "1" and multiple listing of the same fields 1 and 2.

Code:

 25.8125  -80.9375    24.87     1
 25.8125  -80.9375    24.88     1
 25.8125  -80.9375    24.90     1
 25.8125  -80.9375    25.00     1
 25.8125  -80.9375    25.00     1
 25.8125  -80.9375    25.01     1
 25.8125  -80.9375    25.03     1
 25.8125  -80.9375    25.07     1

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

06-29-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by ncwxpanther

It prints the tied values as the same rank. Once the data is different the rank will increment by x number of ties.

Ah.

Thank you. This is why having clear output related to your input is so indispensable.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-29-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

Quote:

Originally Posted by Don Cragun

Code:

sort -n -k1,1 -k2,2 -k3,3 "$@"

No matter what I do, I cannot get sort to do anything sensible here. It does not give sort priority to columns 1 and 2, it just ignores the first two -k and obeys the third.

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-29-2016

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

My latest try:

Code:

REF="test/190005.pnt"

sort  -t' ' -k3,3 -n test/{1900..2016}05.pnt |
        awk 'NR==FNR {  A[$1,$2]=$3 ; next }

        {
                if($3 == LAST[$1,$2])   TIE[$1,$2]++;
                else                    TIE[$1,$2]=0;
                C[$1,$2]++
                LAST[$1,$2]=$3
        }

        A[$1,$2] == $3 && !(($1,$2,$3) in P) {
                print $1, $2, $3, C[$1,$2]-TIE[$1,$2]
                P[$1,$2,$3]
        }
' $REF -

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

06-29-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Quote:

Originally Posted by ncwxpanther

Thanks for taking a look at this Don.

Based on your assumptions I have moved all 100+ input files over to my parent directory. I can now reference them as {1895..2016}05.pnt

To clarify your statement at the bottom of your post

I need to base the ranks off of a particular input file. In this case it would be 201605.pnt
So I figured I would gather fields 1 and 2 from that file; search through the other 100+ files for the same set of values in fields 1 and 2; rank the value in field 3 of 201605.pnt based on those other 100+ files.

The result would be a single value of field 1 and 2 with a (value) and rank. Value is optional and if its there would need to be taken from the primary input - 201605.pnt. In other words the expected output would be 201605.pnt with a rank added as a 4th field. The rank would be based on that value when compared to the other 100+ files with like fields 1 and 2.

I ran your script as

Code:

./grid-ranking.sh {1895..2016}05.pnt

and got output (rather quickly) that has the ranks as "1" and multiple listing of the same fields 1 and 2.

Code:

 25.8125  -80.9375    24.87     1
 25.8125  -80.9375    24.88     1
 25.8125  -80.9375    24.90     1
 25.8125  -80.9375    25.00     1
 25.8125  -80.9375    25.00     1
 25.8125  -80.9375    25.01     1
 25.8125  -80.9375    25.03     1
 25.8125  -80.9375    25.07     1

Hi ncwxpanther,
Instead of moving all of your data files to a parent directory, I would just have executed the script in the child directory where the files were located. But, either way should work.

Your data format seems to change every time you post something. As stated in my last post, my code was designed to work with the sample input you provided in post #27 under the title: "Entire Input for a single value (sorted by data)". Note that in that sample data there are no leading spaces, and the field separator between fields is a single tab character; not sequences of one or more spaces.

In the data shown above, however, there is a leading space character and the field separators are sequences of two, four, or six space characters. If there were no tab characters in your input this time, my script would have only seen one input field; not three. Please make the following changes to the script I suggested:
Change lines 2-4 from:

Code:

sort -n -k1,1 -k2,2 -k3,3 "$@" | awk '
# Set input and output field separators to <tab>.
BEGIN { FS = OFS = "\t"

to:

Code:

sort -bn -k1,1 -k2,2 -k3,3 "$@" | awk '
# Set output field separator to <tab>.
BEGIN {	OFS = "\t"

and change line 51 from:

Code:

	d[++cnt] = $0

to:

Code:

	d[++cnt] = $1 OFS $2 OFS $3

and try again. This will normalize the output using a single tab as the output field separator no matter how many spaces and tabs appeared before the 1st field or between other fields in your input files. It does, however, still assume that your real field 1 and 2 input data is numeric (which might or might not work with some of your early sample data with uppercase alphabetic values in the 1st two fields). If the 1st two fields are alphanumeric instead of numeric, you could change line 2 in the script I suggested above to:

Code:

sort -b -k1,1 -k2,2 -k3,3n "$@" | awk '

As I said before, this code provides individual rankings for each different pair of field 1 and field 2 values in a single output file. If each of your 122 sample input files contains five different pairs of field 1 and 2 values and the same values appears in all of your input files, you will get five rank-ordered lists in the output sorted by the field 1 and 2 values with each list containing 122 entries. If 5 of your 122 input files contain an additional line with another pair of field 1 and 2 values, there will be another list in the output with only 5 entries. If you really need to limit the output to only contain field 1 and field 2 value combinations that appear in a specific file, I can add code to my script to make that happen. But, of course, I still find your continually changing descriptions of your desired output confusing and I may have completely misinterpreted what you are trying to do.

Quote:

Originally Posted by Corona688

No matter what I do, I cannot get sort to do anything sensible here. It does not give sort priority to columns 1 and 2, it just ignores the first two -k and obeys the third.

Hi Corona688,
With the data I was using field 1 never had any leading spaces and the field separator was always a single tab character, so:

Code:

sort -n -k1,1 -k2,2 -k3,3 file...

worked with the data I was using. But, with the above sort command, leading spaces and tabs are part of the data being sorted. To ignore leading blanks, we also need the -b option to be specified before any of the -k sort key options.

Hopefully,

Code:

sort -bn -k1,1 -k2,2 -k3,3 file...

or:

Code:

sort -b -k1,1 -k2,2 -k3,3n file...

will work better for you (depending on whether the 1st two fields are numeric or alphanumeric, respectively).

Note also that the standards say that if the -t char option is not specified, sort uses strings of one or more adjacent blanks (i.e., <space>s and <tab>s) as a field separator. But, if you include a -t option on the command line, each occurrence of char shall treated as a field separator. So, if an input line has a leading space and two spaces between the 2nd and 3rd "fields" (as recognized by awk with the default FS), sort would see field 1 as the empty string before the 1st space, field 2 as the 1st non-empty string, and field 3 would be the empty string between the next two spaces.

Are we having fun yet?

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Ranking data points from multiple files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

In PErl script: need to read the data one file and generate multiple files based on the data

Discussion started by: Sanjeev G

2. UNIX for Dummies Questions & Answers

Stack data from multiple files into one, with variable column files

Discussion started by: wamshi

3. Shell Programming and Scripting

Grabbing data between 2 points in text file

Discussion started by: Mikey

4. UNIX for Dummies Questions & Answers

Finding data value that contains x% of points

Discussion started by: ida1215

5. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Discussion started by: Liverpaul09

6. Programming

GNUPLOT- how to change the style of data points

Discussion started by: natasha

7. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

Discussion started by: reva

8. Shell Programming and Scripting

Group search (multiple data points) in Linux

Discussion started by: Lucky Ali

9. Shell Programming and Scripting

recoding data points using SED??

Discussion started by: doobedoo

10. Shell Programming and Scripting

to extarct data points

Discussion started by: cdfd123