Rewriting GNU uniq in awk

10-23-2012

Registered User

25, 2

Join Date: Jul 2010

Last Activity: 3 October 2016, 6:36 AM EDT

Posts: 25

Thanks Given: 7

Thanked 2 Times in 2 Posts

Quote:

Originally Posted by alister

What does the data look like? Does it adhere to some format? Does it contain whitespace? Are certain characters guaranteed to appear? Are certain characters guaranteed to not appear? Knowing what we're dealing with might suggest alternative approaches.

Each line has a number, right aligned with leading spaces, which takes up the first 16 characters, a space, then an unquoted string of variable length that can include any characters. There is another version that is sometimes used which has a 32 character MD5 hash, followed again by a space then the string.

The data is sorted so a simple comparison with the previous line is enough to find a match. It could consist of any number of lines, from just a few to tens of thousands, similarly there could be any number with a duplicated first field. The initial number or hash is used to group different strings, which will always be unique. The lines in a duplicated group are then piped into a "while read" loop for processing.

mij

View Public Profile for mij

Find all posts by mij

UNIQ(1) BSD General Commands Manual UNIQ(1) NAME
uniq -- report or filter out repeated lines in a file SYNOPSIS
uniq [-c | -d | -u] [-i] [-f num] [-s chars] [input_file [output_file]] DESCRIPTION
The uniq utility reads the specified input_file comparing adjacent lines, and writes a copy of each unique input line to the output_file. If input_file is a single dash ('-') or absent, the standard input is read. If output_file is absent, standard output is used for output. The second and succeeding copies of identical adjacent input lines are not written. Repeated lines in the input will not be detected if they are not adjacent, so it may be necessary to sort the files first. The following options are available: -c Precede each output line with the count of the number of times the line occurred in the input, followed by a single space. -d Only output lines that are repeated in the input. -f num Ignore the first num fields in each input line when doing comparisons. A field is a string of non-blank characters separated from adjacent fields by blanks. Field numbers are one based, i.e., the first field is field one. -s chars Ignore the first chars characters in each input line when doing comparisons. If specified in conjunction with the -f option, the first chars characters after the first num fields will be ignored. Character numbers are one based, i.e., the first character is character one. -u Only output lines that are not repeated in the input. -i Case insensitive comparison of lines. ENVIRONMENT
The LANG, LC_ALL, LC_COLLATE and LC_CTYPE environment variables affect the execution of uniq as described in environ(7). EXIT STATUS
The uniq utility exits 0 on success, and >0 if an error occurs. COMPATIBILITY
The historic +number and -number options have been deprecated but are still supported in this implementation. SEE ALSO
sort(1) STANDARDS
The uniq utility conforms to IEEE Std 1003.1-2001 (``POSIX.1'') as amended by Cor. 1-2002. HISTORY
A uniq command appeared in Version 3 AT&T UNIX. BSD
July 3, 2004 BSD

Shell Programming and Scripting

Rewriting GNU uniq in awk

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

awk or uniq

Discussion started by: Indra2011

2. Shell Programming and Scripting

awk compare and keep uniq

Discussion started by: giuliangiuseppe

3. Shell Programming and Scripting

Sort uniq or awk

Discussion started by: LDHB2012

4. Shell Programming and Scripting

awk uniq and longest string of a column as index

Discussion started by: yifangt

5. Shell Programming and Scripting

awk - getting uniq count on multiple col

Discussion started by: sanranad

6. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Discussion started by: raidzero

7. Shell Programming and Scripting

Text Proccessing with sort,uniq,awk

Discussion started by: rollyah

8. Shell Programming and Scripting

Help with uniq or awk??

Discussion started by: shinoman28

9. Shell Programming and Scripting

How to replicate data using Uniq or awk

Discussion started by: ahjiefreak

10. Shell Programming and Scripting

using uniq and awk??

Discussion started by: amatuer_lee_3

LEARN ABOUT OSX

uniq