Filter uniq field values (non-substring)

05-08-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Quote:

Originally Posted by alister

If you add an array of keys, for the added memory, it's possible to completely bypass a scan of a and b:

Code:

awk '{
    if (($2 SUBSEP $4) in k) next
    for(i=1;i<=c;i++) {
...

Regards,
Alister

this is only for the exact match of both fields concat-ed together some saving, but it's hard to gauge the savings unless we know guesstimate the ratio of the unique eact matches VS total # of records - the lookup for every record/line might not be worth it..

---------- Post updated at 06:59 PM ---------- Previous update was at 06:57 PM ----------

Quote:

Originally Posted by alister

That looks wrong to me. Every line is added to all, but none of its members are ever removed (even when members of a or b are deleted).

Regards,
Alister

see simplified version - with no deletes - just next-ing...

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Nevermind me. I didn't register the "next".

Regards,
Alister

---------- Post updated at 07:35 PM ---------- Previous update was at 06:59 PM ----------

Quote:

Originally Posted by vgersh99

see simplified version - with no deletes - just next-ing...

You are correct in correcting me; not every line is added. However, if, like he original problem, substrings can precede their superstrings, then your suggestion is inadequate.

Consider:

Code:

1 abcd    idx01    ijklm
2 abc    idx03    klm
3 abcd    idx05    jkl
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
6 efg    idx09    abc
7 efx    idx11    abcd
8 fgh    idx12    bcd
9 fefx  blah  zabcdz

If, like the original data sample, substrings can precede their superstings, line 7 should be excluded because both its $2 and $4 are substrings of line 9. Your code won't catch that.

Again, I could be mistaken. yifangt has not been strictly comprehensive in describing the problem.

I hope my nitpicking isn't getting on your last nerve.

Regards,
Alister

This User Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

05-09-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

no-no, this is all good - no ill feelings here!
I see your point - and yes, we solved this case earlier - I need to revert to my previous code:

Code:

awk '{
      for(i=1;i<=c;i++) {
        if (!(i in a)) continue
        if (index(a[i],$2) && index(b[i],$4))
           next

        if (index($2, a[i]) && index($4,b[i])) {
           delete a[i]
           delete b[i]
           delete all[i]
        }
      }

      a[++c]=$2
      b[c]=$4
      all[c]=$0
   }
END {
   for (i=1; i<=c;i++) if (i in all) print all[i]
}' myFile

It's a bit ugly. If you have an improvement idea, I'd be grateful to see it as well.
Also I cannot think of an easy non-convoluted way to avoid a full scan of the already cached entries for every new record/line (in order to improve the performance)...

Thanks for staying on this thread!

Last edited by vgersh99; 05-09-2014 at 10:54 AM..

These 2 Users Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-13-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

Quote:

Originally Posted by vgersh99

[..]
If you have an improvement idea, I'd be grateful to see it as well.
[..]

Not really, but it could perhaps be reduced to something like this:

Code:

awk '
  {
    for(i in all) {
      if (index(a[i],$2) && index(b[i],$4)) next
      if (index($2,a[i]) && index($4,b[i])) delete all[i]
    }
    all[++c]=$0
    a[c]=$2
    b[c]=$4
  }

  END {
    for (i=1; i<=c; i++) if (i in all) print all[i]
  }
' file

This example does not delete the the a and b array elements, so less efficient with memory but a bit simpler code..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

05-13-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

I did not forget this discussion, the left challenge is the speed. The script by vgersh99 took ~45hours to finish the file of 76,235 rows. Probably awk is not the best choice to process big file. Thank you all anyway!

Last edited by yifangt; 05-13-2014 at 03:09 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

05-13-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Quote:

Originally Posted by yifangt

I don't think it's a matter of a 'tool of choice', but rather of an algorithm to deal with the cached/stored matches in order to minimize the 'full table scan' for each newly found row/line....
I couldn't think of an easy non-convoluted way of minimizing the full table scan...
Maybe others would have better ideas.....

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-13-2014

Moderator

12,296, 3,792

Join Date: Nov 2008

Last Activity: 1 January 2021, 1:47 AM EST

Location: Amsterdam

Posts: 12,296

Thanks Given: 679

Thanked 3,792 Times in 3,282 Posts

What awk were you using? There is sometimes quite a bit to be gained by choosing the right version of awk. If you used gawk, then that would be the slowest, BSD awk should be faster and mawk can really be surprisingly fast at times, sometime several times faster. It is not an optimal algorithmic solution, but it might be worth exploring..

--
Is the order of output important by the way?

Last edited by Scrutinizer; 05-13-2014 at 03:31 PM..

This User Gave Thanks to Scrutinizer For This Post:

Scrutinizer

View Public Profile for Scrutinizer

Find all posts by Scrutinizer

Shell Programming and Scripting

Filter uniq field values (non-substring)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

Discussion started by: cmccabe

2. Shell Programming and Scripting

HELP - uniq values per column

Discussion started by: Apollo

3. Shell Programming and Scripting

Grok filter to extract substring from path and add to host field in logstash

Discussion started by: Ravi Kishore

4. Shell Programming and Scripting

Printing uniq first field with the the highest second field

Discussion started by: ailnilanjan

5. Shell Programming and Scripting

Sort field and uniq

Discussion started by: sabercats

6. Shell Programming and Scripting

Uniq based on first field

Discussion started by: venummca

7. Shell Programming and Scripting

filter the uniq record problem

Discussion started by: bleach8578

8. Shell Programming and Scripting

How to use uniq on a certain field?

Discussion started by: Bandit390

9. UNIX for Dummies Questions & Answers

How to uniq third field in a file

Discussion started by: babycakes

10. UNIX for Dummies Questions & Answers

Uniq using only the first field

Discussion started by: Digby