Filter uniq field values (non-substring)

05-08-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

hmmm... have problems visualizing this. Could you provide a sample, pls.
It works for the OP's sample... Just trying to see the boarder conditions......

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by vgersh99

hmmm... have problems visualizing this. Could you provide a sample, pls.
It works for the OP's sample... Just trying to see the boarder conditions......

After re-reading the OP's previous two posts, I'm not certain if I correctly understood the task. If I am mistaken, apologies for the noise.

What I have in mind is the following scenario:

Code:

1 aaaa 10 aaaa
2 bbbb 20 bbbb
3 aaa 30 bbb

At the time that I wrote my previous post, I was under the impression that line 3 should be excluded because $2 is a substring of line1 and $4 of line 2.

However, your interpretation may well be correct. Perhaps the substring matches must be constrained to the same line. I am no longer confident in my assumption. Both of the fields of the lines excluded from the sample data in post #10 match the same preceding line, but nothing in that data set precludes independent column matching.

Perhaps I read too much into it.

Regards,
Alister

These 2 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

05-08-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

There should not be cross column comparison. Line 3 3 aaa 30 bbbis unique as its field2 and field4 are not substring of their corresponding field of any previous lines at the same time.

Code:

1 aaaa 10 aaaa
3 aaa  30 bbb

Only the script is slow for my 76,000 lines, still running after ~1hr on Linux 2.6.32-431.5.1.el6.x86_64. Thank you very much, both of you!

Last edited by yifangt; 05-09-2014 at 12:32 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

05-08-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

see if this makes it faster - getting away from the associate array and split-ing....:

Code:

awk '{
      for(i=1;i<=c;i++) {
        if (index(a[i],$2) && index(b[i],$4))
           next
        if (index($2, a[i]) && index($4,b[iA])) {
           delete a[i]
           delete b[i]
        }
      }
      a[++c]=$2
      b[c]=$4
      all[c]=$0
   }
END {
   for (i=1; i in all;i++) print all[i]
}' myFile

probably there's a better way to handle delete-d array elements that doesn't create 'holes' to be iterated over and over again, but... First let's see if this change makes any difference

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by vgersh99

see if this makes it faster - getting away from the associate array and split-ing....:

Code:

awk '{
      for(i=1;i<=c;i++) {
...

If you add an array of keys, for the added memory, it's possible to completely bypass a scan of a and b:

Code:

awk '{
    if (($2 SUBSEP $4) in k) next
    for(i=1;i<=c;i++) {
...

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

05-08-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

actually, it should be as simple as this for the sample file provided:

Code:

awk  '{
      for(i=1;i<=c;i++)
        if (index(a[i],$2) && index(b[i],$4))
           next
      a[++c]=$2
      b[c]=$4
      all[c]=$0
   }
END {
   for (i=1; i in all;i++) print all[i]
}' myFile

If you had a better, more representative data sample, maybe we could tweak the above - it works for the provided sample.

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by vgersh99

Code:

awk '{
...
      all[c]=$0
   }
END {
   for (i=1; i in all;i++) print all[i]
}' myFile

That looks wrong to me. Every line is added to all, but none of its members are ever removed (even when members of a or b are deleted).

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Filter uniq field values (non-substring)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

Discussion started by: cmccabe

2. Shell Programming and Scripting

HELP - uniq values per column

Discussion started by: Apollo

3. Shell Programming and Scripting

Grok filter to extract substring from path and add to host field in logstash

Discussion started by: Ravi Kishore

4. Shell Programming and Scripting

Printing uniq first field with the the highest second field

Discussion started by: ailnilanjan

5. Shell Programming and Scripting

Sort field and uniq

Discussion started by: sabercats

6. Shell Programming and Scripting

Uniq based on first field

Discussion started by: venummca

7. Shell Programming and Scripting

filter the uniq record problem

Discussion started by: bleach8578

8. Shell Programming and Scripting

How to use uniq on a certain field?

Discussion started by: Bandit390

9. UNIX for Dummies Questions & Answers

How to uniq third field in a file

Discussion started by: babycakes

10. UNIX for Dummies Questions & Answers

Uniq using only the first field

Discussion started by: Digby