Rewriting GNU uniq in awk


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Rewriting GNU uniq in awk
# 8  
Old 10-23-2012
Quote:
Originally Posted by alister
What does the data look like? Does it adhere to some format? Does it contain whitespace? Are certain characters guaranteed to appear? Are certain characters guaranteed to not appear? Knowing what we're dealing with might suggest alternative approaches.
Each line has a number, right aligned with leading spaces, which takes up the first 16 characters, a space, then an unquoted string of variable length that can include any characters. There is another version that is sometimes used which has a 32 character MD5 hash, followed again by a space then the string.

The data is sorted so a simple comparison with the previous line is enough to find a match. It could consist of any number of lines, from just a few to tens of thousands, similarly there could be any number with a duplicated first field. The initial number or hash is used to group different strings, which will always be unique. The lines in a duplicated group are then piped into a "while read" loop for processing.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

awk or uniq

Hi Help, I have a file which looks like 1 20 30 40 50 60 6 2 20 30 40 50 60 8 7 20 30 40 50 60 7 4 30 40 50 60 70 8 5 30 40 50 60 70 9 2 30 40 50 60 70 8 I want the o/p as 1 20 30 40 50 60 6 4 30 40 50 60 70 8 Is there a way I can use uniq command or awk to do this? ... (11 Replies)
Discussion started by: Indra2011
11 Replies

2. Shell Programming and Scripting

awk compare and keep uniq

Hi all I was wondering if you may help me in resolving an issue. In particular I have a file like this: the ... represent different string and what I wrote Cur or Ent are the constant. Well, what I would like to obtain is a file in which are reported only the ID in which the second column... (6 Replies)
Discussion started by: giuliangiuseppe
6 Replies

3. Shell Programming and Scripting

Sort uniq or awk

Hi again, I have files with the following contents datetime,ip1,port1,ip2,port2,number How would I find out how many times ip1 field shows up a particular file? Then how would I find out how many time ip1 and port 2 shows up? Please mind the file may contain 100k lines. (8 Replies)
Discussion started by: LDHB2012
8 Replies

4. Shell Programming and Scripting

awk uniq and longest string of a column as index

I met a challenge to filter ~70 millions of sequence rows and I want using awk with conditions: 1) longest string of each pattern in column 2, ignore any sub-string, as the index; 2) all the unique patterns after 1); 3) print the whole row; input: 1 ABCDEFGHI longest_sequence1 2 ABCDEFGH... (12 Replies)
Discussion started by: yifangt
12 Replies

5. Shell Programming and Scripting

awk - getting uniq count on multiple col

Hi My file have 7 column, FIle is pipe delimed Col1|Col2|col3|Col4|col5|Col6|Col7 I want to find out uniq record count on col3, col4 and col2 ( same order) how can I achieve it. ex 1|3|A|V|C|1|1 1|3|A|V|C|1|1 1|4|A|V|C|1|1 Output should be FREQ|A|V|3|2 FREQ|A|V|4|1 Here... (5 Replies)
Discussion started by: sanranad
5 Replies

6. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Hello again, I am wanting to remove all duplicate blocks of XML code in a file. This is an example: input: <string-array name="threeItems"> <item>item1</item> <item>item2</item> <item>item3</item> </string-array> <string-array name="twoItems"> <item>item1</item> <item>item2</item>... (19 Replies)
Discussion started by: raidzero
19 Replies

7. Shell Programming and Scripting

Text Proccessing with sort,uniq,awk

Hello, I have a log file with the following input: X , ID , Date, Time, Y 01,01368,2010-12-02,09:07:00,Pass 01,01368,2010-12-02,10:54:00,Pass 01,01368,2010-12-02,13:07:04,Pass 01,01368,2010-12-02,18:54:01,Pass 01,01368,2010-12-03,09:02:00,Pass 01,01368,2010-12-03,13:53:00,Pass... (12 Replies)
Discussion started by: rollyah
12 Replies

8. Shell Programming and Scripting

Help with uniq or awk??

Hi, my dilemna is this: example i got a file of fruit.txt which contains: Apple 6 Apple_new 7 old_orange 9 orange 10 Is there any way for me to have an output of Apple 13 Orange 19 using shell script: (6 Replies)
Discussion started by: shinoman28
6 Replies

9. Shell Programming and Scripting

How to replicate data using Uniq or awk

Hi, I have this scenario; where there are two classes:- apple and orange. 1,2,3,4,5,6,apple 1,1,0,4,2,3,apple 1,3,3,3,3,4,apple 1,1,1,1,1,1,orange 1,2,3,1,1,1,orange Basically for apple, i have 3 entries in the file, and for orange, I have 2 entries. Im trying to edit the file and find... (5 Replies)
Discussion started by: ahjiefreak
5 Replies

10. Shell Programming and Scripting

using uniq and awk??

I have a file that is populated: hits/books.hits:143.217.64.204 Thu Sep 21 22:24:57 GMT 2006 hits/books.hits:62.145.39.14 Fri Sep 22 00:38:32 GMT 2006 hits/books.hits:81.140.86.170 Fri Sep 22 08:45:26 GMT 2006 hits/books.hits:81.140.86.170 Fri Sep 22 09:13:57 GMT... (13 Replies)
Discussion started by: amatuer_lee_3
13 Replies
Login or Register to Ask a Question
uniq(1) 							   User Commands							   uniq(1)

NAME
uniq - report or filter out repeated lines in a file SYNOPSIS
/usr/bin/uniq /usr/bin/uniq [-c | -d | -u] [-f fields] [-s char] [input_file [output_file]] /usr/bin/uniq [-c | -d | -u] [-n] [+ m] [input_file [output_file]] ksh93 uniq [-cdiu] [-D[delimit]] [-f fields] [-s chars] [-w chars] [input_file [output_file]] uniq [-cdiu] [-D[delimit]] [-n] [+m] [-w chars] [input_file [output_file]] DESCRIPTION
/usr/bin/uniq The uniq utility reads an input file comparing adjacent lines and writes one copy of each input line on the output. The second and succeed- ing copies of repeated adjacent input lines are not written. Repeated lines in the input are not detected if they are not adjacent. ksh93 The uniq built-in in ksh93 is associated with the /bin or /usr/bin path. It is invoked when uniq is executed without a pathname prefix and the pathname search finds a /bin/uniq or /usr/bin/uniq executable. uniq reads an input, comparing adjacent lines, and writing one copy of each input line on the output. The second and succeeding copies of the repeated adjacent lines are not written. If output_file is not specified, uniq writes to standard output. If input_file is not specified, or if input_file is -, uniq reads from standard input, and the start of the file is defined as the current offset. OPTIONS
/usr/bin/uniq The following options are supported by /usr/bin/uniq: -c Precedes each output line with a count of the number of times the line occurred in the input. -d Suppresses the writing of lines that are not repeated in the input. -f fields Ignores the first fields fields on each input line when doing comparisons, where fields is a positive decimal integer. A field is the maximal string matched by the basic regular expression: [[:blank:]]*[^[:blank:]]* If fields specifies more fields than appear on an input line, a null string is used for comparison. +m Equivalent to -s chars with chars set to m. -n Equivalent to -f fields with fields set to n. -s chars Ignores the first chars characters when doing comparisons, where chars is a positive decimal integer. If specified in conjunc- tion with the -f option, the first chars characters after the first fields fields is ignored. If chars specifies more charac- ters than remain on an input line, a null string is used for comparison. -u Suppresses the writing of lines that are repeated in the input. ksh93 The following options are supported by the uniq built-in command is ksh93: -c Outputs the number of times each line occurred along with the line. --count -d Outputs only duplicate lines. --repeated | duplicates -D Outputs all duplicate lines as a group with an empty line delimiter specified by delimit. --all-repeated[=delimit] Specify delimit as one of the following: none Do not delimit duplicate groups. prepend Prepend an empty line before each group. separate Separate each group with an empty line. The value for delimit can be omitted. The default value is none. -f Skips over fields number of fields before checking for uniqueness. A field is the minimal string matching the --skip-fields=fields BRE [[:blank:]]*[^[:blank:]]*. -i Ignore case in comparisons. --ignore-case +m Equivalent to the -s chars option, with chars set to m. -n Equivalent to the -f fields option, with fields set to n. -s Skips over chars number of characters before checking for uniqueness. --skip-chars=chars If specified with the -f option, the first chars after the first fields are ignored. If the chars specifies more characters than are on the line, an empty string is used for comparison. -u Outputs unique lines. --uniq -w Skips over any specified fields and characters, then compares chars number of characters. --check-chars=chars OPERANDS
The following operands are supported: input_file A path name of the input file. If input_file is not specified, or if the input_file is -, the standard input is used. output_file A path name of the output file. If output_file is not specified, the standard output is used. The results are unspecified if the file named by output_file is the file named by input_file. EXAMPLES
Example 1 Using the uniq Command The following example lists the contents of the uniq.test file and outputs a copy of the repeated lines. example% cat uniq.test This is a test. This is a test. TEST. Computer. TEST. TEST. Software. example% uniq -d uniq.test This is a test. TEST. example% The next example outputs just those lines that are not repeated in the uniq.test file. example% uniq -u uniq.test TEST. Computer. Software. example% The last example outputs a report with each line preceded by a count of the number of times each line occurred in the file: example% uniq -c uniq.test 2 This is a test. 1 TEST. 1 Computer. 2 TEST. 1 Software. example% ENVIRONMENT VARIABLES
See environ(5) for descriptions of the following environment variables that affect the execution of uniq: LANG, LC_ALL, LC_CTYPE, LC_MES- SAGES, and NLSPATH. EXIT STATUS
The following exit values are returned: 0 Successful completion. >0 An error occurred. ATTRIBUTES
See attributes(5) for descriptions of the following attributes: /usr/bin/uniq +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWesu | +-----------------------------+-----------------------------+ |CSI |Enabled | +-----------------------------+-----------------------------+ |Interface Stability |Committed | +-----------------------------+-----------------------------+ |Standard |See standards(5). | +-----------------------------+-----------------------------+ ksh93 +-----------------------------+-----------------------------+ | ATTRIBUTE TYPE | ATTRIBUTE VALUE | +-----------------------------+-----------------------------+ |Availability |SUNWcsu | +-----------------------------+-----------------------------+ |Interface Stability |See below. | +-----------------------------+-----------------------------+ The ksh93 built-in binding to /bin and /usr/bin is Volatile. The built-in interfaces are Uncommitted. SEE ALSO
comm(1), ksh93(1), , pcat(1), sort(1), uncompress(1), attributes(5), environ(5), standards(5) SunOS 5.11 13 Mar 2008 uniq(1)