Ranking data points from multiple files Post: 302976374

Sponsored Content

Top Forums Shell Programming and Scripting Ranking data points from multiple files Post 302976374 by Don Cragun on Tuesday 28th of June 2016 08:03:18 PM

06-28-2016

Registered User

If I understand what you're trying to do (and I am not sure that I do), the following seems to do what you want and should be MUCH faster than what you are currently doing. It calls sort once (to sort all of the data in your input files) and awk once (to process each different group of lines found in the sorted input with the same values in the first two fields; and, for each group, to rank the lines found based on the third field; with ranking at both tails of the ranges as requested in post #18 in this VERY long thread).

This script assumes that your input fields are separated by tabs (as shown in your sample data in post #22) and not by spaces as shown in many of your other examples. And, based on that, it uses a tab to separate the rank in the output from the input fields.

This script was written and tested using a Korn shell (instead of bash) but will also work with any shell that uses Bourne shell syntax (including bash and many other). Since you're working with a few hundred files, this code assumes that you will run this script while located in the directory that contains the data files (instead of a parent directory) so that you can reference files with shorter names (such as *05.pnt instead of test/*05.pnt) because you can name more files on the command line without running into ARG_MAX limits if you use shorter names for each of the file pathnames. And this code assumes that you will provide a list of the files to be processed as command line arguments instead of hardcoding the names into the script. If the list of files you need to process does exceed ARG_MAX limits even with the suggested ways to shorten your argument list, we can use a for loop to feed data to sort using xargs and cat, but on most systems in use today, I wouldn't expect that to be necessary.

The script is:

Code:

#!/bin/ksh
sort -n -k1,1 -k2,2 -k3,3 "$@" | awk '
# Set input and output field separators to a tab character.
BEGIN {	FS = OFS = "\t"
}

# Function to print a group of elements that all have identical values in the
# first and second input fields.
function print_group() {
	# Check to see if we have data to process...
	if(cnt) {
		# Look for the 1st change in values after the mid-point for this
		# set group.
		for(i = int((cnt + 1) / 2) + 1; i <= cnt; i++)
			if(d[i] != d[i - 1])
				break
		# For each set of duplicate values after the midpoint, reset the
		# rank for those points to the end of the set instead of the
		# start of the set.
		while(i < cnt) {
			if(c[i] > 1)
				for(j = i; j <= i + c[i] - 1; j++)
					r[j] += c[i] - 1
			i += c[i]
		}
		# Print the data and rank for each element of the set.
		for(i = 1; i <= cnt; i++)
			print d[i], r[i]
	}
	# Reset variables for next group.
	cnt = con = 0
}

# Look for a change in the first two input fields...
$1 != l1 || $2 != l2 || NR == 1 {
	# We have found a change in values.  Print the results from the previous
	# group, if there was one.
	print_group()

	# Note first two field values so we notice the next change.
	l1 = $1
	l2 = $2

	# Clear the remembered 3rd field value to prevent contamination from the
	# previous group.
	l3 = ""
}

# Gather data for this group...
{	# Save the data for this line.
	d[++cnt] = $0

	# Calculate the rank for this line.  (At this point, we do not know what
	# the midpoint will be for this group, so all of these are saved with
	# the rank being the lowest rank for the set of lines with identical
	# third field values.  The group_print() function wll make adjustments
	# for sets of ranks after the midpoint in the group.)
	if($3 != l3 || cnt == 1) {
		# A change in field 3 values has been found.  Save the value and
		# rank for this set.
		l3 = $3
		lr = cnt
		# Clear the count of the consecutive number of lines with the
		# same value.
		con = 0
	} 

	# Set the rank for this line.
	r[cnt] = lr

	# Set number of consecutive lines that have this third field value.
	for(i = cnt - con++; i <= cnt; i++)
		c[i] = con
}

# We have found EOF.
END {	# Print the data for the last group.
	print_group()
}'

With a named file named post20 containing data extrapolated from your post #20:

Code:

and assuming that you saved the above script in a file named rank_files and made it executable, then the command:

Code:

rank_files post20

would produce the output:

Code:

And, with the data you supplied in post #27 split out into 118 different input files with names used there, the command:

Code:

rank_files {1895..2016}05.pnt

produces the output:

Code:

46.8542	-121.7292	-6.08	1
46.8542	-121.7292	-5.99	2
46.8542	-121.7292	-5.66	3
46.8542	-121.7292	-5.61	4
46.8542	-121.7292	-5.49	5
46.8542	-121.7292	-5.48	6
46.8542	-121.7292	-5.42	7
46.8542	-121.7292	-5.33	8
46.8542	-121.7292	-5.33	8
46.8542	-121.7292	-5.29	10
46.8542	-121.7292	-5.28	11
46.8542	-121.7292	-5.15	12
46.8542	-121.7292	-5.1	13
46.8542	-121.7292	-5.09	14
46.8542	-121.7292	-4.93	15
46.8542	-121.7292	-4.74	16
46.8542	-121.7292	-4.73	17
46.8542	-121.7292	-4.62	18
46.8542	-121.7292	-4.58	19
46.8542	-121.7292	-4.56	20
46.8542	-121.7292	-4.55	21
46.8542	-121.7292	-4.53	22
46.8542	-121.7292	-4.51	23
46.8542	-121.7292	-4.47	24
46.8542	-121.7292	-4.41	25
46.8542	-121.7292	-4.32	26
46.8542	-121.7292	-4.3	27
46.8542	-121.7292	-4.26	28
46.8542	-121.7292	-4.16	29
46.8542	-121.7292	-4.14	30
46.8542	-121.7292	-4.1	31
46.8542	-121.7292	-4.09	32
46.8542	-121.7292	-4.09	32
46.8542	-121.7292	-4	34
46.8542	-121.7292	-3.99	35
46.8542	-121.7292	-3.94	36
46.8542	-121.7292	-3.88	37
46.8542	-121.7292	-3.87	38
46.8542	-121.7292	-3.83	39
46.8542	-121.7292	-3.77	40
46.8542	-121.7292	-3.76	41
46.8542	-121.7292	-3.72	42
46.8542	-121.7292	-3.62	43
46.8542	-121.7292	-3.61	44
46.8542	-121.7292	-3.49	45
46.8542	-121.7292	-3.49	45
46.8542	-121.7292	-3.46	47
46.8542	-121.7292	-3.43	48
46.8542	-121.7292	-3.4	49
46.8542	-121.7292	-3.37	50
46.8542	-121.7292	-3.34	51
46.8542	-121.7292	-3.32	52
46.8542	-121.7292	-3.31	53
46.8542	-121.7292	-3.28	54
46.8542	-121.7292	-3.27	55
46.8542	-121.7292	-3.27	55
46.8542	-121.7292	-3.23	57
46.8542	-121.7292	-3.21	58
46.8542	-121.7292	-3.2	59
46.8542	-121.7292	-3.17	60
46.8542	-121.7292	-3.17	60
46.8542	-121.7292	-3.12	62
46.8542	-121.7292	-3.11	63
46.8542	-121.7292	-3.08	64
46.8542	-121.7292	-3.06	65
46.8542	-121.7292	-3.05	66
46.8542	-121.7292	-3.04	67
46.8542	-121.7292	-3.02	68
46.8542	-121.7292	-3.01	69
46.8542	-121.7292	-2.98	70
46.8542	-121.7292	-2.93	71
46.8542	-121.7292	-2.84	72
46.8542	-121.7292	-2.8	73
46.8542	-121.7292	-2.77	74
46.8542	-121.7292	-2.76	75
46.8542	-121.7292	-2.75	76
46.8542	-121.7292	-2.7	77
46.8542	-121.7292	-2.67	79
46.8542	-121.7292	-2.67	79
46.8542	-121.7292	-2.62	80
46.8542	-121.7292	-2.48	81
46.8542	-121.7292	-2.47	82
46.8542	-121.7292	-2.46	83
46.8542	-121.7292	-2.39	84
46.8542	-121.7292	-2.31	85
46.8542	-121.7292	-2.29	86
46.8542	-121.7292	-2.22	87
46.8542	-121.7292	-2.18	88
46.8542	-121.7292	-2.15	89
46.8542	-121.7292	-2.14	90
46.8542	-121.7292	-2.08	91
46.8542	-121.7292	-1.8	92
46.8542	-121.7292	-1.75	93
46.8542	-121.7292	-1.68	94
46.8542	-121.7292	-1.62	96
46.8542	-121.7292	-1.62	96
46.8542	-121.7292	-1.61	98
46.8542	-121.7292	-1.61	98
46.8542	-121.7292	-1.59	99
46.8542	-121.7292	-1.56	101
46.8542	-121.7292	-1.56	101
46.8542	-121.7292	-1.52	102
46.8542	-121.7292	-1.51	103
46.8542	-121.7292	-1.43	104
46.8542	-121.7292	-1.39	106
46.8542	-121.7292	-1.39	106
46.8542	-121.7292	-1.3	107
46.8542	-121.7292	-1.26	108
46.8542	-121.7292	-1.09	109
46.8542	-121.7292	-1.08	110
46.8542	-121.7292	-1.02	112
46.8542	-121.7292	-1.02	112
46.8542	-121.7292	-0.93	113
46.8542	-121.7292	-0.9	114
46.8542	-121.7292	-0.77	115
46.8542	-121.7292	-0.68	116
46.8542	-121.7292	-0.61	117
46.8542	-121.7292	-0.44	118
46.8542	-121.7292	-0.3	119
46.8542	-121.7292	-0.14	120
46.8542	-121.7292	0.71	121
46.8542	-121.7292	1.05	122

which I think does what you want even though the snippet of output you provided in post #27 did not provide the requested adjustments to ranks in the last half of the data for lines with field 1 set to 46.8542 and field 2 set to -121.7292.

It wasn't clear to me why you talked about choosing one file to gather field 1 and 2 values. The above code gathers field 1 and 2 values from all input files specified and produces ranked output for each different set of values in fields 1 and 2. (And, in case some sets of field 1 and 2 values have fewer entries than other values (i.e., do not appear in all input files), the midpoint is recalculated for each group of lines.

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

to extarct data points

suppose u have a file which consist of many data points separated by asterisk Question is to extract third part in each line . 0.0002*0.003*-0.93939*0.0202*0.322*0.3332*0.2222*0.22020 0.003*0.3333*0.33322*-0.2220*0.3030*0.2222*0.3331*-0.3030 0.0393*0.3039*-0.03038*0.033*0.4033*0.30384*0.4048...

2. Shell Programming and Scripting

recoding data points using SED??

Hello all, I have a data file that needs some serious work...I have no idea how to implement the changes that are needed! The file is a genotypic file with >64,000 columns representing genetic markers, a header line, and >1100 rows that looks like this: ID 1 2 3 4 ...

3. Shell Programming and Scripting

Group search (multiple data points) in Linux

Hi All I have a data set like this tab delimited: weft fgr-1 345 -1 fgrythdgd weft fgr-3 456 -2 ghjdklflllff weft fgr-11 456 -3 ghtjuffl weft fgr-1 213 -2 ghtyjdkl weft fgr-34 567 -5 fghytkflf frgt fgr-36 567 -1 ghrjufjf frgt fgr-45 678 -2 ghjruir frgt fgr-34 546 -5 gjjjgkldlld frgt...

4. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

hiii, Help me out..i have a huge set of data stored in a file.This file has has 2 columns which is latitude & longitude of a region. Now i have a program which asks for the number of points & based on this number it asks the user to enter that latitude & longitude values which are in the same...

5. Programming

GNUPLOT- how to change the style of data points

Hi, I am trying to arrange my graphs with GNUPLOT. Although it looked like simple at the beginning, I could not figure out an answer for the following: I want to change the style of my data points (not the line, just exact data points) The terminal assigns first + and then x to them but what I...

6. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Hi, I'd like to process multiple files. For example: file1.txt file2.txt file3.txt Each file contains several lines of data. I want to extract a piece of data and output it to a new file. file1.txt ----> newfile1.txt file2.txt ----> newfile2.txt file3.txt ----> newfile3.txt Here is...

7. UNIX for Dummies Questions & Answers

Finding data value that contains x% of points

Hi, I need help on finding the value of my data that encompasses certain percentage of my total data points (n). Attached is an example of my data, n=30. What I want to do is for instance is find the minimum threshold that still encompasses 60% (n=18), 70% (n=21) and 80% (n=24). manually to...

8. Shell Programming and Scripting

Grabbing data between 2 points in text file

I have a text file that shows the output of my solar inverters. I want to separate this into sections. overview , device 1 , device 2 , device 3. Each device has different number of lines. but they all have unique starting points. Overview starts with 6 #'s, Devices have 4#'s and their data starts...

9. UNIX for Dummies Questions & Answers

Stack data from multiple files into one, with variable column files

Hello Gurus, Im new to scripting. Got struck with a file merge issue in Unix. Was looking for some direction and stumbled upon this site. I saw many great posts and replies but couldnt find a solution to my issue. Greatly appreciate any help.. I have three csv files -> Apex_10_Latest.csv,...

10. Shell Programming and Scripting

In PErl script: need to read the data one file and generate multiple files based on the data

We have the data looks like below in a log file. I want to generat files based on the string between two hash(#) symbol like below Source: #ext1#test1.tale2 drop #ext1#test11.tale21 drop #ext1#test123.tale21 drop #ext2#test1.tale21 drop #ext2#test12.tale21 drop #ext3#test11.tale21 drop...

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

to extarct data points

Discussion started by: cdfd123

2. Shell Programming and Scripting

recoding data points using SED??

Discussion started by: doobedoo

3. Shell Programming and Scripting

Group search (multiple data points) in Linux

Discussion started by: Lucky Ali

4. UNIX for Dummies Questions & Answers

How to get data only inside polygon created by points which is part of whole data from file?

Discussion started by: reva

5. Programming

GNUPLOT- how to change the style of data points

Discussion started by: natasha

6. UNIX for Dummies Questions & Answers

Using AWK: Extract data from multiple files and output to multiple new files

Discussion started by: Liverpaul09

7. UNIX for Dummies Questions & Answers

Finding data value that contains x% of points

Discussion started by: ida1215

8. Shell Programming and Scripting

Grabbing data between 2 points in text file

Discussion started by: Mikey

9. UNIX for Dummies Questions & Answers

Stack data from multiple files into one, with variable column files

Discussion started by: wamshi

10. Shell Programming and Scripting

In PErl script: need to read the data one file and generate multiple files based on the data

Discussion started by: Sanjeev G