awk remove duplicate code

01-25-2012

Registered User

2, 0

Join Date: Jan 2012

Last Activity: 2 February 2012, 12:41 PM EST

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

awk remove duplicate code

Hi,

In a previous, now closed thread, I found the following awk script:

Code:

awk '{t[$1" "$2" "$3" "$4]=$5" "$6" "$7}END{for (i in t){print i,t[i]}}'

This code does a great job of removing duplicates by the the first four fields from a 7-field set of columns. I would very very much like to understand how this code works, but can't find anything in the awk documentation. Could someone explain it please? Is the t[ ]= some special function?

Thanks,
Pawel

pawelrc

View Public Profile for pawelrc

Find all posts by pawelrc

01-25-2012

Registered User

23,310, 4,623

Join Date: Aug 2005

Last Activity: 7 July 2020, 11:47 AM EDT

Location: Saskatchewan

Posts: 23,310

Thanks Given: 1,331

Thanked 4,623 Times in 4,217 Posts

The big { } code block without the END gets run for every line. The values of $1, $2, $3, etc are set to the values for the columns read. Putting variables and strings in a row sticks them together into a longer string.

So every line, it sets a value like this in the array:

Code:

T["a b c"]="d e f"

If there's a duplicate line, setting the same value in the array twice doesn't put two elements in the array. The previous contents of T["a b c"], for instance, would just get overwritten given another line starting with a b c

Once all the lines have been read, only then will awk run the END {} block, which goes through each thing in the array and prints them (in no particular order).
The syntax for(X in ARRAY) loops through every element in an array, with X being the array index, and ARRAY[X] being the contents of that index.

Last edited by Corona688; 01-25-2012 at 11:49 AM..

Corona688

View Public Profile for Corona688

Visit Corona688's homepage!

Find all posts by Corona688

01-25-2012

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

t[] is an associative array. In plainer terms, the index ( t[ index goes in here ] ) can be any characters or groups of fields. A lot of other languages use an integer to reference array elements. awk can use numbers but most times it is character strings

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

01-25-2012

Registered User

2, 0

Join Date: Jan 2012

Last Activity: 2 February 2012, 12:41 PM EST

Posts: 2

Thanks Given: 0

Thanked 0 Times in 0 Posts

Thanks. That cleared it up!

pawelrc

View Public Profile for pawelrc

Find all posts by pawelrc

Shell Programming and Scripting

awk remove duplicate code

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to put the command to remove duplicate lines in my awk script?

Discussion started by: Tim2424

2. UNIX for Dummies Questions & Answers

Using awk to remove duplicate line if field is empty

Discussion started by: tugar

3. Shell Programming and Scripting

Remove duplicate

Discussion started by: samrat dutta

4. UNIX for Dummies Questions & Answers

Remove area code using from awk output

Discussion started by: Nirav4

5. Shell Programming and Scripting

Cant get awk 1liner to remove duplicate lines from Delimited file, get "event not found" error..help

Discussion started by: andy b

6. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Discussion started by: raidzero

7. Shell Programming and Scripting

remove duplicate lines using awk

Discussion started by: sudvishw

8. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Discussion started by: cola

9. Shell Programming and Scripting

remove duplicate

Discussion started by: ccp

10. Shell Programming and Scripting

awk script to remove duplicate rows in line

Discussion started by: kiranmosarla