Filter uniq field values (non-substring)

05-07-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

sorry, mah bad:

Code:

 awk '{for(i in a) {if (i~$2) next;if ($2 ~ i) delete a[i]};a[$2]=$0}END {for (i in a) print a[i]}' myFile

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Since the strings tested aren't regular expressions, using the regular expression operator is, at best, unnecessarily expensive. At worst, if the strings are allowed to contain regular expression metacharacters, it can lead to an erroneous result.

I suggest using index() instead. For non-trivial data sets, it will also speed things up dramatically.

Testing a near-worst case scenario. The file contains 1501 lines and only the last line contains a string which is a substring of another. Note that while gawk is used, testing with mawk and nawk showed similar improvements:

Code:

$ yes | awk 'NR>1500 {exit} {print NR, NR+1000} END {print NR, 25}' > 1501_1.txt
$ tail -n5 1501_1.txt 
1497 2497
1498 2498
1499 2499
1500 2500
1501 25
$ time gawk '{for(i in a) {if (i~$2) next;if ($2 ~ i) delete a[i]};a[$2]=$0}END {for (i in a) print a[i]}' 1501_1.txt | tail -n5
638 1638
269 1269
228 1228
639 1639
229 1229

real	0m10.462s
user	0m10.149s
sys	0m0.276s
$ time gawk '{for(i in a) {if (index(i,$2)) next;if (index($2, i)) delete a[i]};a[$2]=$0}END {for (i in a) print a[i]}' 1501_1.txt | tail -n5
638 1638
269 1269
228 1228
639 1639
229 1229

real	0m0.895s
user	0m0.892s
sys	0m0.004s

Regards,
Alister

These 4 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

05-08-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

Thanks!
I get it now, actually I have hundred thousand lines.

I just thought another scenario: Is possible to do with two columns? i.e. any substring of the same column (but need both col2 and col4 at the same time), should be skipped.

Code:

infile:
1 abcd    idx01    ijklm
2 abc    idx03    klm
3 abcd    idx05    jkl
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
6 efg    idx09    abc
7 efx    idx11    abcd
8 fgh    idx12    bcd

Code:

output:
1 abcd    idx01    ijklm
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
7 efx    idx11    abcd

I tried using two arrays to loop:

Code:

gawk '{for(i in a) {if (index(i,$2)) next; for(j in b) {if (index(j,$4)) next; if (index($2, i) && index($4, j)) delete a[i]}; a[$2]=$0; b[$4]=$4}} END {for (i in a) print a[i]}' a[$2]=$0} END {for (i in a) print a[i]}'  infile

But there was no output. The second loop seems of problem, any suggestions please? Thanks a lot!

yifangt

View Public Profile for yifangt

Find all posts by yifangt

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by yifangt

I get it now, actually I have hundred thousand lines.

That's the type of information that should always be mentioned in the initial post. Please keep that in mind going forward.

Quote:

Originally Posted by yifangt

I just thought another scenario: Is possible to do with two columns? i.e. any substring of the same column (but need both col2 and col4 at the same time), should be skipped.

Code:

infile:
1 abcd    idx01    ijklm
2 abc    idx03    klm
3 abcd    idx05    jkl
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
6 efg    idx09    abc
7 efx    idx11    abcd
8 fgh    idx12    bcd

Code:

output:
1 abcd    idx01    ijklm
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
7 efx    idx11    abcd

I tried using two arrays to loop:

Code:

gawk '{for(i in a) {if (index(i,$2)) next; for(j in b) {if (index(j,$4)) next; if (index($2, i) && index($4, j)) delete a[i]}; a[$2]=$0; b[$4]=$4}} END {for (i in a) print a[i]}' a[$2]=$0} END {for (i in a) print a[i]}'  infile

But there was no output. The second loop seems of problem, any suggestions please? Thanks a lot!

I haven't given it much thought, but upon cursory examination, your logic is definitely very flawed. If i is in a, you jump to the next line. That's wrong. If I understood the task, before you can skip a line, both i must be in a and j must be in b.

Once included, if later lines prove that an earlier line's $2 and $4 are substrings, then that earlier line must be excluded. Note that an earlier line's $2 and $4 may be disqualified by different subsequent lines, so you must track that as well.

Was there some copy-paste malfunction in your post? There are two END sections, nearly identical, which doesn't make sense (multiple END pattern-action pairs are allowed, but in this case I don't see the point of them).

One simple, if not optimal, way to solve the problem is to handle each column individually and then join the results.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

05-08-2014

Registered User

564, 13

Join Date: Sep 2009

Last Activity: 26 May 2021, 8:59 AM EDT

Location: Saskatchewan, Canada

Posts: 564

Thanks Given: 376

Thanked 13 Times in 12 Posts

You are right: If i is in a, you jump to the next line. That's wrong. ... before you can skip a line, both i must be in a and j must be in b. ...Was there some copy-paste malfunction in your post?Sorry for that!

This part is not what I want:
One simple, if not optimal, way to solve the problem is to handle each column individually and then join the results.
My logic is only if i in a is substring of $2 and j in b is substring of $4 at the same time, that line should be skipped. Delete a[i] will skip the whole line as a[$2]=$0. If $2 and $4 are handled separately before joined later, some lines are deleted but should not! For example:

Code:

1 abcd    idx01    ijklm 
2 abc    idx03    klm 
3 abcd    idx05    jkl
...
9 abcd    idx05   abcd

Line 9 should not be deleted even $2 is identical to that in Line 1, as $4 at Line 9 is not a substring of $4 in Line 1! No cross comparing between $2 and $4! I thought of combine the two columns to have a single one, which does not work either obviously, as not all the string start with the same char to have substring. I have hard time to catch array in awk. Thanks a lot!

Last edited by yifangt; 05-08-2014 at 03:18 PM..

yifangt

View Public Profile for yifangt

Find all posts by yifangt

05-08-2014

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

a bit verbose and most likely not optimal:

Code:

awk '
   {idx=$2 SUBSEP $4}
   {
      for(i in a) {
        split(i,tA,SUBSEP)
        if (index(tA[1],$2) && index(tA[2],$4)) 
           next
        if (index($2, tA[1]) && index($4,tA[2])) 
           delete a[i]
      }
      a[idx]=$0
   }
END {
   for (i in a) print a[i]
}' myFile

can probably generalize this to specify any number of fields AND-ed together.....

Last edited by vgersh99; 05-08-2014 at 04:33 PM..

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

05-08-2014

Registered User

3,231, 978

Join Date: Dec 2009

Last Activity: 11 June 2014, 8:40 PM EDT

Posts: 3,231

Thanks Given: 179

Thanked 978 Times in 791 Posts

Quote:

Originally Posted by vgersh99

a bit verbose and most likely not optimal:

Code:

awk '
   {idx=$2 SUBSEP $4}
   {
      for(i in a) {
        split(i,tA,SUBSEP)
        if (index(tA[1],$2) && index(tA[2],$4)) 
           next
        if (index($2, tA[1]) && index($4,tA[2])) 
           delete a[i]
      }
      a[idx]=$0
   }
END {
   for (i in a) print a[i]
}' myFile

can probably generalize this to specify any number of fields AND-ed together.....

That's not correct because it only accounts for cases where one line's field pair matches another line's field pair. The case where a line's fields are substrings of fields of two different lines is unaccounted for.

Regards,
Alister

alister

View Public Profile for alister

Find all posts by alister

Shell Programming and Scripting

Filter uniq field values (non-substring)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

Discussion started by: cmccabe

2. Shell Programming and Scripting

HELP - uniq values per column

Discussion started by: Apollo

3. Shell Programming and Scripting

Grok filter to extract substring from path and add to host field in logstash

Discussion started by: Ravi Kishore

4. Shell Programming and Scripting

Printing uniq first field with the the highest second field

Discussion started by: ailnilanjan

5. Shell Programming and Scripting

Sort field and uniq

Discussion started by: sabercats

6. Shell Programming and Scripting

Uniq based on first field

Discussion started by: venummca

7. Shell Programming and Scripting

filter the uniq record problem

Discussion started by: bleach8578

8. Shell Programming and Scripting

How to use uniq on a certain field?

Discussion started by: Bandit390

9. UNIX for Dummies Questions & Answers

How to uniq third field in a file

Discussion started by: babycakes

10. UNIX for Dummies Questions & Answers

Uniq using only the first field

Discussion started by: Digby