Filter uniq field values (non-substring)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Filter uniq field values (non-substring)
# 22  
Old 05-08-2014
Quote:
Originally Posted by alister
If you add an array of keys, for the added memory, it's possible to completely bypass a scan of a and b:
Code:
awk '{
    if (($2 SUBSEP $4) in k) next
    for(i=1;i<=c;i++) {
...

Regards,
Alister
this is only for the exact match of both fields concat-ed together some saving, but it's hard to gauge the savings unless we know guesstimate the ratio of the unique eact matches VS total # of records - the lookup for every record/line might not be worth it..

---------- Post updated at 06:59 PM ---------- Previous update was at 06:57 PM ----------

Quote:
Originally Posted by alister
That looks wrong to me. Every line is added to all, but none of its members are ever removed (even when members of a or b are deleted).

Regards,
Alister
see simplified version - with no deletes - just next-ing...
This User Gave Thanks to vgersh99 For This Post:
# 23  
Old 05-08-2014
Nevermind me. I didn't register the "next".

Regards,
Alister

---------- Post updated at 07:35 PM ---------- Previous update was at 06:59 PM ----------

Quote:
Originally Posted by vgersh99
see simplified version - with no deletes - just next-ing...
You are correct in correcting me; not every line is added. However, if, like he original problem, substrings can precede their superstrings, then your suggestion is inadequate.

Consider:
Code:
1 abcd    idx01    ijklm
2 abc    idx03    klm
3 abcd    idx05    jkl
4 cdef    idx06    ijklm
5 efgh    idx07    abcd
6 efg    idx09    abc
7 efx    idx11    abcd
8 fgh    idx12    bcd
9 fefx  blah  zabcdz

If, like the original data sample, substrings can precede their superstings, line 7 should be excluded because both its $2 and $4 are substrings of line 9. Your code won't catch that.

Again, I could be mistaken. yifangt has not been strictly comprehensive in describing the problem.

I hope my nitpicking isn't getting on your last nerve.

Regards,
Alister
This User Gave Thanks to alister For This Post:
# 24  
Old 05-09-2014
no-no, this is all good - no ill feelings here!
I see your point - and yes, we solved this case earlier - I need to revert to my previous code:
Code:
awk '{
      for(i=1;i<=c;i++) {
        if (!(i in a)) continue
        if (index(a[i],$2) && index(b[i],$4))
           next

        if (index($2, a[i]) && index($4,b[i])) {
           delete a[i]
           delete b[i]
           delete all[i]
        }
      }

      a[++c]=$2
      b[c]=$4
      all[c]=$0
   }
END {
   for (i=1; i<=c;i++) if (i in all) print all[i]
}' myFile

It's a bit ugly. If you have an improvement idea, I'd be grateful to see it as well.
Also I cannot think of an easy non-convoluted way to avoid a full scan of the already cached entries for every new record/line (in order to improve the performance)...

Thanks for staying on this thread!

Last edited by vgersh99; 05-09-2014 at 10:54 AM..
These 2 Users Gave Thanks to vgersh99 For This Post:
# 25  
Old 05-13-2014
Quote:
Originally Posted by vgersh99
[..]
If you have an improvement idea, I'd be grateful to see it as well.
[..]
Not really, but it could perhaps be reduced to something like this:
Code:
awk '
  {
    for(i in all) {
      if (index(a[i],$2) && index(b[i],$4)) next
      if (index($2,a[i]) && index($4,b[i])) delete all[i]
    }
    all[++c]=$0
    a[c]=$2
    b[c]=$4
  }

  END {
    for (i=1; i<=c; i++) if (i in all) print all[i]
  }
' file

This example does not delete the the a and b array elements, so less efficient with memory but a bit simpler code..
This User Gave Thanks to Scrutinizer For This Post:
# 26  
Old 05-13-2014
I did not forget this discussion, the left challenge is the speed. The script by vgersh99 took ~45hours to finish the file of 76,235 rows. Probably awk is not the best choice to process big file. Thank you all anyway!

Last edited by yifangt; 05-13-2014 at 03:09 PM..
# 27  
Old 05-13-2014
Quote:
Originally Posted by yifangt
I did not forget this discussion, the left challenge is the speed. The script by vgersh99 took ~45hours to finish the file of 76,235 rows. Probably awk is not the best choice to process big file. Thank you all anyway!
I don't think it's a matter of a 'tool of choice', but rather of an algorithm to deal with the cached/stored matches in order to minimize the 'full table scan' for each newly found row/line....
I couldn't think of an easy non-convoluted way of minimizing the full table scan...
Maybe others would have better ideas.....
# 28  
Old 05-13-2014
What awk were you using? There is sometimes quite a bit to be gained by choosing the right version of awk. If you used gawk, then that would be the slowest, BSD awk should be faster and mawk can really be surprisingly fast at times, sometime several times faster. It is not an optimal algorithmic solution, but it might be worth exploring..


--
Is the order of output important by the way?

Last edited by Scrutinizer; 05-13-2014 at 03:31 PM..
This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

In the awk below I am trying to set/update the value of $14 in file2 in bold, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;... (2 Replies)
Discussion started by: cmccabe
2 Replies

2. Shell Programming and Scripting

HELP - uniq values per column

Hi All, I am trying to output uniq values per column. see file below. can you please assist? Thank you in advance. cat names joe allen ibm joe smith ibm joe allen google joe smith google rachel allen google desired output is: joe allen google rachel smith ibm (5 Replies)
Discussion started by: Apollo
5 Replies

3. Shell Programming and Scripting

Grok filter to extract substring from path and add to host field in logstash

Hii, I am reading data from files by defining path as *.log etc, Files names are like app1a_test2_heep.log , cdc2a_test3_heep.log etc How to configure logstash so that the part of string that is string before underscore (app1a, cdc2a..) should be grepped and added to host field and... (7 Replies)
Discussion started by: Ravi Kishore
7 Replies

4. Shell Programming and Scripting

Printing uniq first field with the the highest second field

Hi All, I am searching for a script which will produce an output file with the uniq first field with the second field having highest value among all the duplicates.. The output file will produce only the uniqs which are duplicate 3 times.. Input file X 9 B 5 A 1 Z 9 T 4 C 9 A 4... (13 Replies)
Discussion started by: ailnilanjan
13 Replies

5. Shell Programming and Scripting

Sort field and uniq

I have a flatfile A.txt 2012/12/04 14:06:07 |trees|Boards 2, 3|denver|mekong|mekong12 2012/12/04 17:07:22 |trees|Boards 2, 3|denver|mekong|mekong12 2012/12/04 17:13:27 |trees|Boards 2, 3|denver|mekong|mekong12 2012/12/04 14:07:39 |rain|Boards 1|tampa|merced|merced11 How do i sort and get... (3 Replies)
Discussion started by: sabercats
3 Replies

6. Shell Programming and Scripting

Uniq based on first field

Hi New to unix. I want to display only the unrepeated lines from a file using first field. Ex: 1234 uname1 status1 1235 uname2 status2 1234 uname3 status3 1236 uname5 status5 I used sort filename | uniq -u output: 1234 uname1 status1 1235 uname2 status2 1234 uname3 status3 1236... (10 Replies)
Discussion started by: venummca
10 Replies

7. Shell Programming and Scripting

filter the uniq record problem

Anyone can help for filter the uniq record for below example? Thank you very much Input file 20090503011111|test|abc 20090503011112|tet1|abc|def 20090503011112|test1|bcd|def 20090503011131|abc|abc 20090503011131|bbc|bcd 20090503011152|bcd|abc 20090503011151|abc|abc... (8 Replies)
Discussion started by: bleach8578
8 Replies

8. Shell Programming and Scripting

How to use uniq on a certain field?

How can I use uniq on a certain field or what else could I use? If I want to use uniq on the second field and the output would remove one of the lines with a 5. bob 5 hand jane 3 leg jon 4 head chris 5 lungs (1 Reply)
Discussion started by: Bandit390
1 Replies

9. UNIX for Dummies Questions & Answers

How to uniq third field in a file

Hi ; I have a question regarding the uniq command in unix How do I uniq 3rd field in a file ? original file : zoom coord 39 18652 39 18652 zoom coord 39 18653 39 18653 zoom coord 39 18818 39 18818 zoom coord 39 18840 39 18840 zoom coord 41 15096 41 15096 zoom... (1 Reply)
Discussion started by: babycakes
1 Replies

10. UNIX for Dummies Questions & Answers

Uniq using only the first field

Hi all, I have a file that contains a list of codes (shown below). I want to 'uniq' the file using only the first field. Anyone know an easy way of doing it? Cheers, Dave ##### Input File ##### 1xr1 1xws 1yxt 1yxu 1yxv 1yxx 2o3p 2o63 2o64 2o65 1xr1 1xws 1yxt 1yxv 1yxx 2o3p 2o63 2o64... (8 Replies)
Discussion started by: Digby
8 Replies
Login or Register to Ask a Question