How to remove duplicate records with out sort

02-29-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Quote:

Originally Posted by girish.batra

yes I agree with you that uniq like a sorted input

So check the original post.

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-05-2008

Registered User

32, 0

Join Date: Jun 2008

Last Activity: 4 October 2012, 3:31 PM EDT

Location: Redmond, WA

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk '!x[$2,$3]++' FS="," file

This has been a true grit work horse one liner.

I have use it extensively.

Does anyone know if it can be instructed to work within a range of values?
I suspect it wouldn't be a one liner.

For example:

My file contents are all numbers (a mixture of intergers and floating point), where field one is a unique point number, field two is an X or Easting coordinate, field 3 is a Y or Northing coordinate, and field four is an elevation:

1,2.1,3.1,1.1
2,2.2,3.2,2.2
3,2.3,3.3,3.3
4,3.4,3.4,4.4
5,3.5,3.5,5.5
6,3.6,3.6,6.6
7,4.7,4.7,7.1
8,4.8,4.8,8.8
9,4.9,4.9,9.9

I would like to process the file via fields two and three and have the result be:

1,2.1,3.1,1.1
3,2.3,3.3,3.3
4,3.4,3.4,4.4
6,3.6,3.6,6.6
7,4.7,4.7,7.1
9,4.9,4.9,9.9

So I think what I am asking is, that field 2 and 3 be considered duplicates if they are in the range "$2-0.1 to $2+0.1" and "$3-0.1 to $3+0.1".
That's the best way I have of decribing it.

Thanks in advance,
Kenny.

kenneth.mcbride

View Public Profile for kenneth.mcbride

Find all posts by kenneth.mcbride

06-05-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

If the input is ordered as the one you posted,
and if I understand correctly, you could use something like this:
(use nawk or /usr/xpg4/bin/awk on Solaris)

Code:

awk -F, '
(x-0.1 >= $2 || $2 <= x+0.1) && (y-0.1 >= $3 || $3 <= y+0.1) { next }
{ x = $2; y = $3 }
1' input

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-05-2008

Registered User

32, 0

Join Date: Jun 2008

Last Activity: 4 October 2012, 3:31 PM EDT

Location: Redmond, WA

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thank you radoulov.

Your code works well when the data is in a sorted list.

Example: using your code with the range values -1.1 and +1.1 on file:

1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2
2-a,1000002.2,500002.2,102.2
2-b,1000002.2,500002.2,102.3
3-a,1000003.3,500003.3,103.3
3-b,1000003.3,500003.3,103.4
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
6-a,1000006.6,500006.6,106.6
6-b,1000006.6,500006.6,106.7
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
8-a,1000008.8,500008.8,108.8
8-b,1000008.8,500008.8,108.9
9-a,1000009.9,500009.9,109.9
9-b,1000009.9,500009.9,110.0
10-a,1000010.0,500010.0,110.0
10-b,1000010.0,500010.0,110.1

I get the following result:

1-a,1000001.1,500001.1,101.1
3-a,1000003.3,500003.3,103.3
5-a,1000005.5,500005.5,105.5
7-a,1000007.7,500007.7,107.7
9-a,1000009.9,500009.9,109.9

This is the expected result.

However, if I jumble up the records like this:

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

I only get the first line as output.
10-a,1000010.0,500010.0,110.0

Is there code that will work on an unsorted list?

My data sets are almost always listed in a random order.

Thank you again,
Kenny.

kenneth.mcbride

View Public Profile for kenneth.mcbride

Find all posts by kenneth.mcbride

06-06-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

Do you want to preserve the order?
Otherwise you could sort the input first:

Code:

sort -t, -k2n,3n inputfile |
  awk -F, '
  (x-1.1 >= $2 || $2 <= x+1.1) && (y-1.1 >= $3 || $3 <= y+1.1) { next }
  { x = $2; y = $3 }
1'

radoulov

View Public Profile for radoulov

Find all posts by radoulov

06-06-2008

Registered User

32, 0

Join Date: Jun 2008

Last Activity: 4 October 2012, 3:31 PM EDT

Location: Redmond, WA

Posts: 32

Thanks Given: 0

Thanked 0 Times in 0 Posts

Yes I would like to preserve the order.

Is there code to process an unsorted list?

I would assume that the programming logic would then have to be:

1. Keep the first record.
2. Compare all remaining records to it, testing for duplicates in fields $2 and $3 [within the user defined range].
3. Move down one record and repeat until you reach the last record.

kenneth.mcbride

View Public Profile for kenneth.mcbride

Find all posts by kenneth.mcbride

06-07-2008

Registered User

5,690, 630

Join Date: Jan 2007

Last Activity: 9 January 2017, 4:40 AM EST

Location: Варна, България / Milano, Italia

Posts: 5,690

Thanks Given: 184

Thanked 630 Times in 587 Posts

(previous post removed)

Could you please post the desired output from this input?

Code:

10-a,1000010.0,500010.0,110.0
2-b,1000002.2,500002.2,102.3
9-b,1000009.9,500009.9,110.0
10-b,1000010.0,500010.0,110.1
9-a,1000009.9,500009.9,109.9
8-a,1000008.8,500008.8,108.8
3-b,1000003.3,500003.3,103.4
8-b,1000008.8,500008.8,108.9
7-a,1000007.7,500007.7,107.7
7-b,1000007.7,500007.7,107.8
6-b,1000006.6,500006.6,106.7
3-a,1000003.3,500003.3,103.3
6-a,1000006.6,500006.6,106.6
5-a,1000005.5,500005.5,105.5
5-b,1000005.5,500005.5,105.6
4-a,1000004.4,500004.4,104.4
4-b,1000004.4,500004.4,104.5
2-a,1000002.2,500002.2,102.2
1-a,1000001.1,500001.1,101.1
1-b,1000001.1,500001.1,101.2

Last edited by radoulov; 06-08-2008 at 06:11 PM..

radoulov

View Public Profile for radoulov

Find all posts by radoulov

Shell Programming and Scripting

How to remove duplicate records with out sort

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Remove duplicate lines, sort it and save it as file itself

Discussion started by: refrain

2. Shell Programming and Scripting

Remove duplicate records

Discussion started by: reignangel2003

3. Shell Programming and Scripting

Remove duplicate chars and sort string [SED]

Discussion started by: jds93

4. Shell Programming and Scripting

Remove duplicate lines based on field and sort

Discussion started by: cokedude

5. Shell Programming and Scripting

Remove somewhat Duplicate records from a flat file

Discussion started by: jolney

6. Shell Programming and Scripting

Sort and Remove Duplicate on file

Discussion started by: mabarif16

7. Shell Programming and Scripting

Remove Duplicate Records

Discussion started by: imipsita.rath

8. Shell Programming and Scripting

Remove duplicate records

Discussion started by: svenkatareddy

9. Solaris

How to remove duplicate records with out sort

Discussion started by: svenkatareddy

10. Shell Programming and Scripting

Remove all instances of duplicate records from the file

Discussion started by: vukkusila