How to filter out almost dupicate X Y (Easting Northing) coordinates?

Login or Register to Ask a Question and Join Our Community

How to filter out almost dupicate X Y (Easting Northing) coordinates?

Tags

Top Forums UNIX for Dummies Questions & Answers How to filter out almost dupicate X Y (Easting Northing) coordinates?

09-17-2010

Registered User

32, 0

Join Date: Jun 2008

Last Activity: 4 October 2012, 3:31 PM EDT

Location: Redmond, WA

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

How to filter out almost dupicate X Y (Easting Northing) coordinates?

I have a large ascii file of 3+ million records.
The data type is easting northing elevation coordinates.
The file format is 3 fields, a numeric value for:
easting northing elevation

When I view this data graphically in AutoCAD or ESRI ArcMap, I can see that there a many points that are very close, not duplicated, but very very close to each other in the x and y 2D space. So therefore a lot of the data points are redundant.

Is there an awk script that I could use on the ascii file that could remove records that are close to each other within a user defined x/easting and y/northing tolerance?

I already have the command for removing duplicate records ( !x{$1$2}++ ) but I can't figure out how to get nearly dupicates removed.

Thanks in advance for your help.

Kenny.

kenneth.mcbride

View Public Profile for kenneth.mcbride

Find all posts by kenneth.mcbride

09-18-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

I am assuming your x/y data looks like normal long/lat xx.xxxxxx yy.yyyyyy
Further lets define a tolerance by rounding - say 10^-5 gives you separation. So, change the 5 to suit. by the way your code has a syntax error {...}++ should be [...]++

Code:

awk '!arr[sprintf("%0.5f|%0.5f",$1,$2)]++' inputfile > outputfile

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-19-2010

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

paste your sample record here, and tell us which records belong to "nearly dupicates"

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

09-20-2010

Registered User

32, 0

Join Date: Jun 2008

Last Activity: 4 October 2012, 3:31 PM EDT

Location: Redmond, WA

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank you.

Jim,

{$1$2} was a typo.

rdcwayx,

My files are not Latitude Longitude Degrees, but State Plane Feet.
They typically contain 3 million records.

A file looks like, PointNumber Easting Northing Elevation, as follows:

PointNumber_0000001 1000000.123456 1000000.123456 10000.123456
PointNumber_0000010 1000001.234567 1000002.234567 10345.234567
PointNumber_0000100 1000010.345678 1000020.456789 10030.987654
PointNumber_0001000 1000050.345678 1000050.456789 10030.987654
PointNumber_0010000 1000123.123456 1000456.123456 10789.123456
PointNumber_0100000 1000123.123456 1000456.123456 10789.123456
PointNumber_1000000 1000000.123456 1000000.123456 10000.123456
PointNumber_2000000 1000011.345678 1000021.456789 10030.987654
PointNumber_3000000 1000051.000678 1000049.999000 10030.987654

Where, relative to fields 2 and 3:
PointNumber_1000000 is an "exact duplicate" of PointNumber_0000001
PointNumber_0100000 is an "exact duplicate" of PointNumber_0010000
Where, relative to fields 2 and 3, and within a user defined range of + or - 2.0:
PointNumber_2000000 is a "near duplicate" of PointNumber_0000100
PointNumber_3000000 is a "near duplicate" of PointNumber_0001000

So a point/record is a "near duplicate" when the easting and northing are within a user defined range. So if I use a value of 2.75 feet for a range, then if a record has easting and northing that are within 2.5 feet of any other record then it it to considered a "near duplicate" and deleted.

If possible, it would be great if I could get two files from the input file:
1. An output file with the near duplicates removed.
2. An output file with the near duplicates that were removed.

Thank you again,
Kenny.

---------- Post updated at 01:38 PM ---------- Previous update was at 09:25 AM ----------

Jim,

When I use your code on the sample data set in my previous post, it prints the whole file.

Kenny.

kenneth.mcbride

View Public Profile for kenneth.mcbride

Find all posts by kenneth.mcbride

09-20-2010

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Because I thought you were using lat longs (our GIS system does output data that way) not state plane coords. My bad.

---------- Post updated at 15:50 ---------- Previous update was at 15:04 ----------

Try this - the two files created are named: uniques, duplicates

Code:

 #!/bin/ksh
 > uniques
 > duplicates
 
awk '{ x=int( ($2/2.75) + 1.0000001 ) * 2.75
       y=int( ($3/2.75) + 1.0000001 ) * 2.75
        tmp=sprintf("%f|%f", x , y)
        print $0, tmp
        if(tmp in arr) {
          print $0 >> "duplicates"
        }
        else {
          print $0 >> "uniques"
          arr[tmp]++
        }
      } ' inputfile

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

09-21-2010

Registered User

32, 0

Join Date: Jun 2008

Last Activity: 4 October 2012, 3:31 PM EDT

Location: Redmond, WA

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

Jim,

Thank you for your efforts in this matter.

I should add that my OS is Microsoft Windows XP.

Not sure if this will have any effect on your syntax.

I am actually using mawk.

I typically run mawk from inside a DOS batch.

Will any of this effect the behavior of your code?

Note.... that I am a complete novice (aka hacker).

Can you explain what your code is doing?

Thanks,
Kenny.

kenneth.mcbride

View Public Profile for kenneth.mcbride

Find all posts by kenneth.mcbride

09-21-2010

Registered User

2,759, 420

Join Date: Jun 2006

Last Activity: 13 September 2015, 8:58 PM EDT

Posts: 2,759

Thanks Given: 44

Thanked 420 Times in 408 Posts

Jim's code is so perfect.

A little bit of change to easily adjust range value.

Code:

awk -v range=2.75 '{ x=int( ($2/range) + 1.0000001 ) * range
       y=int( ($3/range) + 1.0000001 ) * range
        tmp=sprintf("%.2f|%.2f", x , y)
        print $0, tmp
        if(tmp in arr) {
          print $0 >> "duplicates"
        }
        else {
          print $0 >> "uniques"
          arr[tmp]++
        }
      } ' infile

To kenneth.mcbride,

From the output, the value in last column (separated by |) tell you the secret. It simplifies the whole processes.

Code:

PointNumber_0000001 1000000.123456 1000000.123456 10000.123456 1000001.75|1000001.75
PointNumber_0000010 1000001.234567 1000002.234567 10345.234567 1000001.75|1000004.50
PointNumber_0000100 1000010.345678 1000020.456789 10030.987654 1000012.75|1000021.00
PointNumber_0001000 1000050.345678 1000050.456789 10030.987654 1000051.25|1000051.25
PointNumber_0010000 1000123.123456 1000456.123456 10789.123456 1000125.50|1000458.25
PointNumber_0100000 1000123.123456 1000456.123456 10789.123456 1000125.50|1000458.25
PointNumber_1000000 1000000.123456 1000000.123456 10000.123456 1000001.75|1000001.75
PointNumber_2000000 1000011.345678 1000021.456789 10030.987654 1000012.75|1000023.75
PointNumber_3000000 1000051.000678 1000049.999000 10030.987654 1000051.25|1000051.25

Seesm if two records are "exact duplicate", Jim's code will have wrong output.

rdcwayx

View Public Profile for rdcwayx

Find all posts by rdcwayx

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Help with processing coordinates in a file.

I have a variation table (variation.txt) which is a very big file. The first column in the chromosome number and the second column is the position of the variation. I have a second file annotation.txt which has a list of 37,000 genes (1st column), their chromosome number(2nd column), their start...

2. UNIX for Dummies Questions & Answers

overlapped genomic coordinates

Hi, I would like to know how can I get the ID of a feature if its genomic coordinates overlap the coordinates of another file. Example: Get the 4th column (ID) of this file1: chr1 10 100 gene1 chr2 3000 5000 gene2 chr3 200 1500 gene3 if it overlaps with a feature in this file2: chr2...

3. UNIX for Dummies Questions & Answers

Length of a segment based on coordinates

Hi, I would like to have the length of a segment based on coordinates of its parts. Example input file: chr11 genes_good3.gtf aggregate_gene 1 100 gene1 chr11 genes_good3.gtf exonic_part 1 60 chr11 genes_good3.gtf exonic_part 70 100 chr11 genes_good3.gtf aggregate_gene 200 1000 gene2...

4. Shell Programming and Scripting

Differential substring removal using coordinates

Hello all, this might be better suited for a bioinformatics forum, but I thought I'd try my luck here as well. I have several tabular text files of DNA sequence reads that appear as such: File_1.txt >H01BA45XW GATTACAGATTCGACATCCAACTGAGGCATT >H02BG78WR CCTTACAGACTGGGCATGAATATTGCATACC...

5. Shell Programming and Scripting

Determination n points between two coordinates

Hi guys. Can anyone tell me how to determine points between two coardinates. For example: Which type of command line gives me 50 points between (8, -5, 7) and (2, 6, 9) points Thanks

6. Shell Programming and Scripting

place cursor in specific coordinates

Hi, I have this problem on how to place the cursor in a text editor (for example: pico). I made this script that would attach comments to a script file then open the script file, I would like to know how to place the cursor in a specific place, for example at the end of the comments, ...

7. Shell Programming and Scripting

Search for particular tag and arrange as coordinates

Hi I have a file whose sample contents are shown here, 1.2.3.4->2.4.2.4 a(10) b(20) c(30) 1.2.3.4->2.9.2.4 a(10) c(20) 2.3.4.3->3.6.3.2 b(40) d(50) c(20) 2.3.4.3->3.9.0.2 a(40) e(50) c(20) 1.2.3.4->3.4.2.4 a(10) c(30) 6.2.3.4->2.4.2.5 c(10) . . . . Here I need to search...

8. Shell Programming and Scripting

Calculating distance between two LAT long coordinates

hi, i have a pair of latitude and longitude and i want to calculate the distance between these two points. In vbscript i achieved in the following way...Now i want to implement this in unix shell scripting.... <% Dim lat1, lon1, lat2, lon2 const pi = 3.14159265358979323846 ...

9. Shell Programming and Scripting

Defining X and Y Coordinates Inside A Window

Hello, I am starting up an Xnest window and trying to place a program inside of it. I have the window inside of it now but it always spawns with the top left corner at (0, 0). I need to find a way to set the x and y coordinates to something other than (0, 0). I tried using the -geometry option...

10. Shell Programming and Scripting

Removing dupicate lines in the file ..(they are not continuous)

I have duplicates records in a file, but they are not consecutive. I want to remove the duplicates , using script. Can some one help me in writing a ksh script to implement this task. Ex file is like below. 1234 5689 4556 1234 4444

Login or Register to Ask a Question