Ranking data points from multiple files

08-02-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

And when I run the script I suggested with those three files, I get the output:

Code:

24.5625	-81.8125	39.16	1
24.5625	-81.8125	74.28	2
24.5625	-81.7708	40.81	1
24.5625	-81.7708	72.68	2
24.5625	-81.7292	46.73	1
24.5625	-81.7292	66.90	2
24.5625	-81.6875	52.67	1
24.5625	-81.6875	61.92	2
24.6042	-81.6458	57.16	1
24.6042	-81.6458	62.22	2
24.6458	-81.5625	60.11	1
24.6458	-81.5625	66.18	2
24.6458	-81.4792	62.80	1
24.6458	-81.4792	68.19	2
24.6875	-81.5625	62.20	1
24.6875	-81.5625	67.32	2
24.6875	-81.3958	68.01	1
24.6875	-81.3958	71.72	2
24.7292	-81.3958	69.86	1
24.7292	-81.3958	73.26	2
24.7708	-80.9375	85.71	1
24.7708	-80.9375	90.29	2
25.1458	-81.1042	116.34	1
25.1458	-81.1042	159.11	2
25.1458	-81.0625	117.04	1
25.1458	-81.0625	161.78	2
25.1458	-81.0208	119.01	1
25.1458	-81.0208	162.54	2
25.1458	-80.9792	118.53	1
25.1458	-80.9792	163.41	2
25.1458	-80.9375	118.07	1
25.1458	-80.9375	169.29	2
25.1458	-80.7708	142.98	1
25.1458	-80.7708	150.50	2
25.1458	-80.7292	145.82	1
25.1458	-80.7292	149.23	2
25.1458	-80.4375	122.51	1
25.1458	-80.4375	171.91	2
25.1458	-80.3958	120.30	1
25.1458	-80.3958	172.67	2
25.1875	-81.1042	122.42	1
25.1875	-81.1042	168.44	2
25.1875	-81.0625	125.46	1
25.1875	-81.0625	170.80	2
25.1875	-81.0208	125.53	1
25.1875	-81.0208	173.15	2
25.1875	-80.9792	125.67	1
25.1875	-80.9792	176.74	2
25.1875	-80.9375	127.46	1
25.1875	-80.9375	176.95	2
25.1875	-80.8958	130.94	1
25.1875	-80.8958	176.87	2

Which lines in this output have a point1, point2 pair that is not present in the file 201606.pnt?

If I add one more line to 201506.pnt containing:

Code:

1 2 3

which is a point1, point2 pair that is not present in 201606.pnt and rerun the script; I get exactly the same output showing that the line I added was ignored since that point pair is not present in 201606.pnt.

Please explain what the script I suggested is doing incorrectly with the new data you just showed us in post #42?

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-02-2016

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

I am expecting a return value of a single point for each value from 201606.pnt.

For instance (please ignore field separation issue below)

Code:

24.5625	-81.8125	39.16  1
24.5625	-81.7708	40.81  1
24.5625	-81.7292	46.73	  1
24.5625	-81.6875	52.67  1
24.6042	-81.6458	62.22	  2
24.6458	-81.5625	66.18	  2
24.6458	-81.4792	68.19	  2
24.6875	-81.5625	67.32  2 
24.6875	-81.3958	71.72  2
24.7292	-81.3958	73.26	 2
24.7708	-80.9375	90.29	 2
25.1458	-81.1042	116.34  1
25.1458	-81.0625	117.04  1
25.1458	-81.0208	119.01  1
25.1458	-80.9792	118.53  1
25.1458	-80.9375	118.07  1
25.1458	-80.7708	142.98  1
25.1458	-80.7292	149.23  2
25.1458	-80.4375	171.91  2
25.1458	-80.3958	172.67  2
25.1875	-81.1042	122.42  1
25.1875	-81.0625	125.46  1
25.1875	-81.0208	125.53  1
25.1875	-80.9792	125.67  1
25.1875	-80.9375	127.46  1
25.1875	-80.8958	130.94  1

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

08-02-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Ah! Now I think I understand what you want.

I'm not going to have much time to work on this today, but I should have something that will work in the next couple of days. (I think the changes are minor, but it is going to need some testing.)

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-02-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

The following seems to do what I think you want...

Code:

#!/bin/ksh
# Final component of script name.
IAm=${0##*/}

# Absolute pathname of control file.
CF='/some/dir/control.status'

# Absolute pathname of directory containing the *.pnt files to be procssed.
DataDir='/some/same_or_other/directory'

if ! cd "$DataDir"
then	exit 1
fi
if ! read BaseYear BaseMonth < "$CF"
then	exit 2
fi
BaseFile="$BaseYear$BaseMonth.pnt"
if [ ! -r "$BaseFile" ]
then	printf "%s: Can't read base file (%s).\n" "$IAm" "$DataDir/$BaseFile" >&2
	exit 3
fi

sort -bn -k1,1 -k2,2 -k3,3 *"$BaseMonth.pnt" | awk '
# Set output field separator to <tab>.
BEGIN {	OFS = "\t"
}

# Function to print a group of elements that all have identical values in the
# first and second input fields.
function print_group() {
	# Check to see if we have data to process...
	if(cnt) {
		# Look for the 1st change in values after the mid-point for
		# this set group.
		for(i = int((cnt + 1) / 2) + 1; i <= cnt; i++)
			if(d[i] != d[i - 1])
				break
		# For each set of duplicate values after the midpoint, reset
		# the rank for those points to the end of the set instead of
		# the start of the set.
		while(i < cnt) {
			if(c[i] > 1)
				for(j = i; j <= i + c[i] - 1; j++)
					r[j] += c[i] - 1
			i += c[i]
		}
		# Print the data and rank for each element of the set in the
		# base file.
		for(i = 1; i <= cnt; i++)
			if(d[i] in P) {
				print d[i], r[i]
				delete P[d[i]]
			}
	}
	# Reset variables for next group.
	cnt = con = 0
}

# Gather points to process from 1st input file...
FNR == NR {
	# Gather data from the base file (given as first file operand)...
	# Gather list of point pairs to be processed.
	L[$1 OFS $2]

	# Gather list of points and value triples to be printed.
	P[$1 OFS $2 OFS $3]
	next
}

# Skip points not found in the 1st input file...
!(($1 OFS $2) in L) {
	next
}

# Look for a change in the first two input fields...
$1 != l1 || $2 != l2 || NR == 1 {
	# We have found a change in values.  Print the results from the
	# previous group, if there was one.
	print_group()

	# Note first two field values so we notice the next change.
	l1 = $1
	l2 = $2

	# Clear the remembered 3rd field value to prevent contamination from
	# the previous group.
	l3 = ""
}

# Gather data for this group...
{	# Save the data for this line.
	d[++cnt] = $1 OFS $2 OFS $3

	# Calculate the rank for this line.  (At this point, we do not know
	# what the midpoint will be for this group, so all of these are saved
	# with the rank being the lowest rank for the set of lines with
	# identical third field values.  The group_print() function wll make
	# adjustments for sets of ranks after the midpoint in the group.)
	if($3 != l3 || cnt == 1) {
		# A change in field 3 values has been found.  Save the value
		# and rank for this set.
		l3 = $3
		lr = cnt
		# Clear the count of the consecutive number of lines with the
		# same value.
		con = 0
	} 

	# Set the rank for this line.
	r[cnt] = lr

	# Set number of consecutive lines that have this third field value.
	for(i = cnt - con++; i <= cnt; i++)
		c[i] = con
}

# We have found EOF.
END {	# Print the data for the last group.
	print_group()
}' "$BaseFile" -

This code is written assuming that it is possible for more than one entry for a pair of points to appear in a single *.pnt file. If only one entry for a given pair of points can appear in a *.pnt file, you can make this script run a little bit faster by changing the line shown in orange in the print_group() function from:

Code:

				delete P[d[i]]

to:

Code:

				break

With the sample inputs provided in post #42, it produces the output:

Code:

24.5625	-81.8125	39.16	1
24.5625	-81.7708	40.81	1
24.5625	-81.7292	46.73	1
24.5625	-81.6875	52.67	1
24.6042	-81.6458	62.22	2
24.6458	-81.5625	66.18	2
24.6458	-81.4792	68.19	2
24.6875	-81.5625	67.32	2
24.6875	-81.3958	71.72	2
24.7292	-81.3958	73.26	2
24.7708	-80.9375	90.29	2
25.1458	-81.1042	116.34	1
25.1458	-81.0625	117.04	1
25.1458	-81.0208	119.01	1
25.1458	-80.9792	118.53	1
25.1458	-80.9375	118.07	1
25.1458	-80.7708	142.98	1
25.1458	-80.7292	149.23	2
25.1458	-80.4375	171.91	2
25.1458	-80.3958	172.67	2
25.1875	-81.1042	122.42	1
25.1875	-81.0625	125.46	1
25.1875	-81.0208	125.53	1
25.1875	-80.9792	125.67	1
25.1875	-80.9375	127.46	1
25.1875	-80.8958	130.94	1

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-03-2016

Registered User

152, 3

Join Date: Aug 2011

Last Activity: 5 January 2020, 12:29 PM EST

Posts: 152

Thanks Given: 48

Thanked 3 Times in 3 Posts

Just wanted to follow up by saying this code is working well.

Within the routine, Is there a way to print out, in a separate file, the values that are tied with another value? Doing this post process is very time consuming as I am grepping for the exact values in 100+ files that contain 400k lines.

For instance if the points values are the same as another points values, then output that into another file.

Current Output

Code:

24.5625	-81.8125	39.16	1
24.5625	-81.7708	40.81	1

Data File

Code:

24.5625	-81.8125	39.16
24.5625	-81.7708	40.50

Expected Ties Output

Code:

24.5625	-81.8125	39.16 1

ncwxpanther

View Public Profile for ncwxpanther

Find all posts by ncwxpanther

10-04-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

After seemingly meeting all of your requirements at least three times in this thread, you have now come back in post #47 in this thread with another set of ambiguous new requirements.

What have you tried to do to meet your new requirements?

What are you trying to do? Show us sample inputs and the exact corresponding sample output that you want to produce along with a CLEAR English description of the logic used to produce those outputs from the input files you are processing?

Show us how you have attempted to modify the code that you have been given to solve earlier versions of your problem and show us where you got stuck trying to produce the output you now want. We are not here to act as you unpaid programming staff, but we are happy to help you fix your code if you make an honest effort on your own. If you aren't interested in learning how to write your own code, hire someone to do the work for you.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Ranking data points from multiple files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

In PErl script: need to read the data one file and generate multiple files based on the data

Discussion started by: Sanjeev G

2. UNIX for Dummies Questions & Answers

Stack data from multiple files into one, with variable column files

Discussion started by: wamshi

3. Shell Programming and Scripting

Grabbing data between 2 points in text file

Discussion started by: Mikey

4. UNIX for Dummies Questions & Answers

Finding data value that contains x% of points

Discussion started by: ida1215

5. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Discussion started by: Liverpaul09

6. Programming

GNUPLOT- how to change the style of data points

Discussion started by: natasha

7. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

Discussion started by: reva

8. Shell Programming and Scripting

Group search (multiple data points) in Linux

Discussion started by: Lucky Ali

9. Shell Programming and Scripting

recoding data points using SED??

Discussion started by: doobedoo

10. Shell Programming and Scripting

to extarct data points

Discussion started by: cdfd123