How to filter out almost dupicate X Y (Easting Northing) coordinates?


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers How to filter out almost dupicate X Y (Easting Northing) coordinates?
# 1  
Old 09-17-2010
How to filter out almost dupicate X Y (Easting Northing) coordinates?

I have a large ascii file of 3+ million records.
The data type is easting northing elevation coordinates.
The file format is 3 fields, a numeric value for:
easting northing elevation

When I view this data graphically in AutoCAD or ESRI ArcMap, I can see that there a many points that are very close, not duplicated, but very very close to each other in the x and y 2D space. So therefore a lot of the data points are redundant.

Is there an awk script that I could use on the ascii file that could remove records that are close to each other within a user defined x/easting and y/northing tolerance?

I already have the command for removing duplicate records ( !x{$1$2}++ ) but I can't figure out how to get nearly dupicates removed.

Thanks in advance for your help.

Kenny.
# 2  
Old 09-18-2010
I am assuming your x/y data looks like normal long/lat xx.xxxxxx yy.yyyyyy
Further lets define a tolerance by rounding - say 10^-5 gives you separation. So, change the 5 to suit. by the way your code has a syntax error {...}++ should be [...]++

Code:
awk '!arr[sprintf("%0.5f|%0.5f",$1,$2)]++' inputfile > outputfile

# 3  
Old 09-19-2010
paste your sample record here, and tell us which records belong to "nearly dupicates"
# 4  
Old 09-20-2010
Thank you.

Jim,

{$1$2} was a typo.

rdcwayx,

My files are not Latitude Longitude Degrees, but State Plane Feet.
They typically contain 3 million records.

A file looks like, PointNumber Easting Northing Elevation, as follows:

PointNumber_0000001 1000000.123456 1000000.123456 10000.123456
PointNumber_0000010 1000001.234567 1000002.234567 10345.234567
PointNumber_0000100 1000010.345678 1000020.456789 10030.987654
PointNumber_0001000 1000050.345678 1000050.456789 10030.987654
PointNumber_0010000 1000123.123456 1000456.123456 10789.123456
PointNumber_0100000 1000123.123456 1000456.123456 10789.123456
PointNumber_1000000 1000000.123456 1000000.123456 10000.123456
PointNumber_2000000 1000011.345678 1000021.456789 10030.987654
PointNumber_3000000 1000051.000678 1000049.999000 10030.987654

Where, relative to fields 2 and 3:
PointNumber_1000000 is an "exact duplicate" of PointNumber_0000001
PointNumber_0100000 is an "exact duplicate" of PointNumber_0010000
Where, relative to fields 2 and 3, and within a user defined range of + or - 2.0:
PointNumber_2000000 is a "near duplicate" of PointNumber_0000100
PointNumber_3000000 is a "near duplicate" of PointNumber_0001000

So a point/record is a "near duplicate" when the easting and northing are within a user defined range. So if I use a value of 2.75 feet for a range, then if a record has easting and northing that are within 2.5 feet of any other record then it it to considered a "near duplicate" and deleted.

If possible, it would be great if I could get two files from the input file:
1. An output file with the near duplicates removed.
2. An output file with the near duplicates that were removed.

Thank you again,
Kenny.

---------- Post updated at 01:38 PM ---------- Previous update was at 09:25 AM ----------

Jim,

When I use your code on the sample data set in my previous post, it prints the whole file.

Kenny.
# 5  
Old 09-20-2010
Because I thought you were using lat longs (our GIS system does output data that way) not state plane coords. My bad.

---------- Post updated at 15:50 ---------- Previous update was at 15:04 ----------

Try this - the two files created are named: uniques, duplicates
Code:
 #!/bin/ksh
 > uniques
 > duplicates
 
awk '{ x=int( ($2/2.75) + 1.0000001 ) * 2.75
       y=int( ($3/2.75) + 1.0000001 ) * 2.75
        tmp=sprintf("%f|%f", x , y)
        print $0, tmp
        if(tmp in arr) {
          print $0 >> "duplicates"
        }
        else {
          print $0 >> "uniques"
          arr[tmp]++
        }
      } ' inputfile

# 6  
Old 09-21-2010
Jim,

Thank you for your efforts in this matter.

I should add that my OS is Microsoft Windows XP.

Not sure if this will have any effect on your syntax.

I am actually using mawk.

I typically run mawk from inside a DOS batch.

Will any of this effect the behavior of your code?

Note.... that I am a complete novice (aka hacker).

Can you explain what your code is doing?

Thanks,
Kenny.
# 7  
Old 09-21-2010
Jim's code is so perfect.

A little bit of change to easily adjust range value.

Code:
awk -v range=2.75 '{ x=int( ($2/range) + 1.0000001 ) * range
       y=int( ($3/range) + 1.0000001 ) * range
        tmp=sprintf("%.2f|%.2f", x , y)
        print $0, tmp
        if(tmp in arr) {
          print $0 >> "duplicates"
        }
        else {
          print $0 >> "uniques"
          arr[tmp]++
        }
      } ' infile

To kenneth.mcbride,

From the output, the value in last column (separated by |) tell you the secret. It simplifies the whole processes.

Code:
PointNumber_0000001 1000000.123456 1000000.123456 10000.123456 1000001.75|1000001.75
PointNumber_0000010 1000001.234567 1000002.234567 10345.234567 1000001.75|1000004.50
PointNumber_0000100 1000010.345678 1000020.456789 10030.987654 1000012.75|1000021.00
PointNumber_0001000 1000050.345678 1000050.456789 10030.987654 1000051.25|1000051.25
PointNumber_0010000 1000123.123456 1000456.123456 10789.123456 1000125.50|1000458.25
PointNumber_0100000 1000123.123456 1000456.123456 10789.123456 1000125.50|1000458.25
PointNumber_1000000 1000000.123456 1000000.123456 10000.123456 1000001.75|1000001.75
PointNumber_2000000 1000011.345678 1000021.456789 10030.987654 1000012.75|1000023.75
PointNumber_3000000 1000051.000678 1000049.999000 10030.987654 1000051.25|1000051.25

Seesm if two records are "exact duplicate", Jim's code will have wrong output.
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Help with processing coordinates in a file.

I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start... (1 Reply)
Discussion started by: Sanchari
1 Replies

2. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2... (1 Reply)
Discussion started by: fadista
1 Replies

3. UNIX for Dummies Questions & Answers

Length of a segment based on coordinates

Hi, I would like to have the length of a segment based on coordinates of its parts. Example input file: chr11 genes_good3.gtf aggregate_gene 1 100 gene1 chr11 genes_good3.gtf exonic_part 1 60 chr11 genes_good3.gtf exonic_part 70 100 chr11 genes_good3.gtf aggregate_gene 200 1000 gene2... (2 Replies)
Discussion started by: fadista
2 Replies

4. Shell Programming and Scripting

Differential substring removal using coordinates

Hello all, this might be better suited for a bioinformatics forum, but I thought I'd try my luck here as well. I have several tabular text files of DNA sequence reads that appear as such: File_1.txt >H01BA45XW GATTACAGATTCGACATCCAACTGAGGCATT >H02BG78WR CCTTACAGACTGGGCATGAATATTGCATACC... (3 Replies)
Discussion started by: vectorborne5
3 Replies

5. Shell Programming and Scripting

Determination n points between two coordinates

Hi guys. Can anyone tell me how to determine points between two coardinates. For example: Which type of command line gives me 50 points between (8, -5, 7) and (2, 6, 9) points Thanks (5 Replies)
Discussion started by: rpf
5 Replies

6. Shell Programming and Scripting

place cursor in specific coordinates

Hi, I have this problem on how to place the cursor in a text editor (for example: pico). I made this script that would attach comments to a script file then open the script file, I would like to know how to place the cursor in a specific place, for example at the end of the comments, ... (1 Reply)
Discussion started by: lechelle
1 Replies

7. Shell Programming and Scripting

Search for particular tag and arrange as coordinates

Hi I have a file whose sample contents are shown here, 1.2.3.4->2.4.2.4 a(10) b(20) c(30) 1.2.3.4->2.9.2.4 a(10) c(20) 2.3.4.3->3.6.3.2 b(40) d(50) c(20) 2.3.4.3->3.9.0.2 a(40) e(50) c(20) 1.2.3.4->3.4.2.4 a(10) c(30) 6.2.3.4->2.4.2.5 c(10) . . . . Here I need to search... (5 Replies)
Discussion started by: AKD
5 Replies

8. Shell Programming and Scripting

Calculating distance between two LAT long coordinates

hi, i have a pair of latitude and longitude and i want to calculate the distance between these two points. In vbscript i achieved in the following way...Now i want to implement this in unix shell scripting.... <% Dim lat1, lon1, lat2, lon2 const pi = 3.14159265358979323846 ... (8 Replies)
Discussion started by: aemunathan
8 Replies

9. Shell Programming and Scripting

Defining X and Y Coordinates Inside A Window

Hello, I am starting up an Xnest window and trying to place a program inside of it. I have the window inside of it now but it always spawns with the top left corner at (0, 0). I need to find a way to set the x and y coordinates to something other than (0, 0). I tried using the -geometry option... (1 Reply)
Discussion started by: lesnaubr
1 Replies

10. Shell Programming and Scripting

Removing dupicate lines in the file ..(they are not continuous)

I have duplicates records in a file, but they are not consecutive. I want to remove the duplicates , using script. Can some one help me in writing a ksh script to implement this task. Ex file is like below. 1234 5689 4556 1234 4444 (7 Replies)
Discussion started by: Srini75
7 Replies
Login or Register to Ask a Question