Filter uniq field values (non-substring) Post: 302900619

Sponsored Content

Top Forums Shell Programming and Scripting Filter uniq field values (non-substring) Post 302900619 by alister on Wednesday 7th of May 2014 11:01:49 PM

05-08-2014

Registered User

Since the strings tested aren't regular expressions, using the regular expression operator is, at best, unnecessarily expensive. At worst, if the strings are allowed to contain regular expression metacharacters, it can lead to an erroneous result.

I suggest using index() instead. For non-trivial data sets, it will also speed things up dramatically.

Testing a near-worst case scenario. The file contains 1501 lines and only the last line contains a string which is a substring of another. Note that while gawk is used, testing with mawk and nawk showed similar improvements:

Code:

$ yes | awk 'NR>1500 {exit} {print NR, NR+1000} END {print NR, 25}' > 1501_1.txt
$ tail -n5 1501_1.txt 
1497 2497
1498 2498
1499 2499
1500 2500
1501 25
$ time gawk '{for(i in a) {if (i~$2) next;if ($2 ~ i) delete a[i]};a[$2]=$0}END {for (i in a) print a[i]}' 1501_1.txt | tail -n5
638 1638
269 1269
228 1228
639 1639
229 1229

real	0m10.462s
user	0m10.149s
sys	0m0.276s
$ time gawk '{for(i in a) {if (index(i,$2)) next;if (index($2, i)) delete a[i]};a[$2]=$0}END {for (i in a) print a[i]}' 1501_1.txt | tail -n5
638 1638
269 1269
228 1228
639 1639
229 1229

real	0m0.895s
user	0m0.892s
sys	0m0.004s

Regards,
Alister

These 4 Users Gave Thanks to alister For This Post:

alister

View Public Profile for alister

Find all posts by alister

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Uniq using only the first field

Hi all, I have a file that contains a list of codes (shown below). I want to 'uniq' the file using only the first field. Anyone know an easy way of doing it? Cheers, Dave ##### Input File ##### 1xr1 1xws 1yxt 1yxu 1yxv 1yxx 2o3p 2o63 2o64 2o65 1xr1 1xws 1yxt 1yxv 1yxx 2o3p 2o63 2o64...

2. UNIX for Dummies Questions & Answers

How to uniq third field in a file

Hi ; I have a question regarding the uniq command in unix How do I uniq 3rd field in a file ? original file : zoom coord 39 18652 39 18652 zoom coord 39 18653 39 18653 zoom coord 39 18818 39 18818 zoom coord 39 18840 39 18840 zoom coord 41 15096 41 15096 zoom...

3. Shell Programming and Scripting

How to use uniq on a certain field?

How can I use uniq on a certain field or what else could I use? If I want to use uniq on the second field and the output would remove one of the lines with a 5. bob 5 hand jane 3 leg jon 4 head chris 5 lungs

4. Shell Programming and Scripting

filter the uniq record problem

Anyone can help for filter the uniq record for below example? Thank you very much Input file 20090503011111|test|abc 20090503011112|tet1|abc|def 20090503011112|test1|bcd|def 20090503011131|abc|abc 20090503011131|bbc|bcd 20090503011152|bcd|abc 20090503011151|abc|abc...

5. Shell Programming and Scripting

Uniq based on first field

Hi New to unix. I want to display only the unrepeated lines from a file using first field. Ex: 1234 uname1 status1 1235 uname2 status2 1234 uname3 status3 1236 uname5 status5 I used sort filename | uniq -u output: 1234 uname1 status1 1235 uname2 status2 1234 uname3 status3 1236...

6. Shell Programming and Scripting

Sort field and uniq

7. Shell Programming and Scripting

Printing uniq first field with the the highest second field

Hi All, I am searching for a script which will produce an output file with the uniq first field with the second field having highest value among all the duplicates.. The output file will produce only the uniqs which are duplicate 3 times.. Input file X 9 B 5 A 1 Z 9 T 4 C 9 A 4...

8. Shell Programming and Scripting

Grok filter to extract substring from path and add to host field in logstash

Hii, I am reading data from files by defining path as *.log etc, Files names are like app1a_test2_heep.log , cdc2a_test3_heep.log etc How to configure logstash so that the part of string that is string before underscore (app1a, cdc2a..) should be grepped and added to host field and...

9. Shell Programming and Scripting

HELP - uniq values per column

Hi All, I am trying to output uniq values per column. see file below. can you please assist? Thank you in advance. cat names joe allen ibm joe smith ibm joe allen google joe smith google rachel allen google desired output is: joe allen google rachel smith ibm

10. Shell Programming and Scripting

awk to update field using matching value in file1 and substring in field in file2

In the awk below I am trying to set/update the value of $14 in file2 in bold, using the matching NM_ in $12 or $9 in file2 with the NM_ in $2 of file1. The lengths of $9 and $12 can be variable but what is consistent is the start pattern will always be NM_ and the end pattern is always ;...

LEARN ABOUT PLAN9

regexp

REGEXP(6)							   Games Manual 							 REGEXP(6)

NAME

       regexp - regular expression notation

DESCRIPTION

       A  regular  expression  specifies  a  set  of  strings of characters.  A member of this set of strings is said to be matched by the regular
       expression.  In many applications a delimiter character, commonly bounds a regular expression.  In the following specification for  regular
       expressions the word `character' means any character (rune) but newline.

       The syntax for a regular expression e0 is

	      e3:  literal | charclass | '.' | '^' | '$' | '(' e0 ')'

	      e2:  e3
		|  e2 REP

	      REP: '*' | '+' | '?'

	      e1:  e2
		|  e1 e2

	      e0:  e1
		|  e0 '|' e1

       A literal is any non-metacharacter, or a metacharacter (one of .*+?[]()|^$), or the delimiter preceded by

       A  charclass  is  a  nonempty string s bracketed [s] (or [^s]); it matches any character in (or not in) s.  A negated character class never
       matches newline.  A substring a-b, with a and b in ascending order, stands for the inclusive range of characters between a and  b.   In	s,
       the  metacharacters  an initial and the regular expression delimiter must be preceded by a other metacharacters have no special meaning and
       may appear unescaped.

       A matches any character.

       A matches the beginning of a line; matches the end of the line.

       The REP operators match zero or more (*), one or more (+), zero or one (?), instances respectively of the preceding regular expression e2.

       A concatenated regular expression, e1e2, matches a match to e1 followed by a match to e2.

       An alternative regular expression, e0|e1, matches either a match to e0 or a match to e1.

       A match to any part of a regular expression extends as far as possible without preventing a match to the remainder of the  regular  expres-
       sion.

SEE ALSO

       awk(1), ed(1), sam(1), sed(1), regexp(2)

																	 REGEXP(6)

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Uniq using only the first field

Discussion started by: Digby

2. UNIX for Dummies Questions & Answers

How to uniq third field in a file

Discussion started by: babycakes

3. Shell Programming and Scripting

How to use uniq on a certain field?

Discussion started by: Bandit390

4. Shell Programming and Scripting

filter the uniq record problem

Discussion started by: bleach8578