Removing duplicates on a single "column" (delimited file)

01-28-2016

Registered User

2, 0

Join Date: Jan 2016

Last Activity: 29 January 2016, 7:28 AM EST

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

Removing duplicates on a single "column" (delimited file)

Hello !

I'm quite new to linux but haven't found a script to do this task, unfortunately my knowledge is quite limited on shellscripts...

Could you guys help me removing the duplicate lines of a file, based only on a single "column"?

For example:

Code:

M202034357;01/2008;J30RJ021;Ciclo 01 de Faturamento;4000029579;01F010800017;270500591331;175130959;000074873-AB;9.9;RIO DE JANEIRO
M202034357;01/2008;J30AP096;Ciclo 01 de Faturamento;4000029579;01F010800017;270500589332;175123672;000001842-AB;9.9;MACAPA
M202034357;01/2008;J30RJ021;Ciclo 01 de Faturamento;4000043657;01F010800002;118000613348;175138146;000161122-AA;9.9;RIO DE JANEIRO
M202034357;01/2008;J30DF061;Ciclo 06 de Faturamento;4000034956;06F010800020;269800607228;173691920;000030011-AA;9.9;GUARA
M202034357;01/2008;J30RJ021;Ciclo 01 de Faturamento;4000029579;01F010800017;270500588743;175121705;000188224-AA;9.9;NITEROI
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500589299;175123639;000241055-AB;9.9;SAO PAULO
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500589787;175125437;000256241-AB;9.9;SAO PAULO
M202034357;01/2008;J30AM097;Ciclo 01 de Faturamento;4000043657;01F010800002;118000614870;175142866;000026153-AA;4.99;MANAUS
M202034357;01/2008;J30PA091;Ciclo 01 de Faturamento;4000043657;01F010800002;118000614087;175140485;000023707-AA;9.9;BELEM
M202034357;01/2008;J30PA091;Ciclo 01 de Faturamento;4000043785;01F010800027;270200624370;175114167;000011219-AB;9.9;BEL�M
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500591956;175132948;000441734-AA;9.9;SAO BERNARDO DO CAMPO
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500590036;175126399;000458131-AA;9.9;SAO CAETANO DO SUL
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500591958;175132950;000441735-AA;9.9;SAO PAULO
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000043657;01F010800002;118000612017;175130959;000469327-AA;9.9;GUARULHOS

So, the yellow field is found duplicate on a few lines... Like the first and last ones. But the data between them are different many times.

It doesn't matter for my purpose to have the ocurrence twice, even if the info before and after is different... So what I need is a script (maybe awk or cut) that recognizes the same string on position 8 and, if it was already found before, delete that whole line, but keep every other lines that do not contain a repeated string at position 8.

Ideas?

Last edited by jim mcnamara; 01-28-2016 at 04:15 PM.. Reason: code tags

Rufinofr

View Public Profile for Rufinofr

Find all posts by Rufinofr

01-28-2016

Registered User

11,728, 1,345

Join Date: Feb 2004

Last Activity: 8 May 2020, 9:07 AM EDT

Location: NM

Posts: 11,728

Thanks Given: 903

Thanked 1,345 Times in 1,201 Posts

Try:

Code:

awk -F ';'  '!arr[$8]++' oldfile > newfile

This User Gave Thanks to jim mcnamara For This Post:

jim mcnamara

View Public Profile for jim mcnamara

Find all posts by jim mcnamara

01-29-2016

Registered User

2, 0

Join Date: Jan 2016

Last Activity: 29 January 2016, 7:28 AM EST

Posts: 2

Thanks Given: 1

Thanked 0 Times in 0 Posts

Thanks you very much, it worked... I shall be studying this function from now on... It is proving to be very useful.

Rufinofr

View Public Profile for Rufinofr

Find all posts by Rufinofr

01-29-2016

Registered User

5,091, 1,931

Join Date: May 2012

Last Activity: 15 July 2020, 4:46 AM EDT

Location: Simplicity

Posts: 5,091

Thanks Given: 565

Thanked 1,931 Times in 1,668 Posts

The following variant saves some memory (an integer per line):

Code:

awk -F ';'  '!($8 in A) {A[$8]; print}' oldfile > newfile

MadeInGermany

View Public Profile for MadeInGermany

Find all posts by MadeInGermany

01-29-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

I prefer the awk solutions suggested by Jim McNamara and MadeInGermany for your stated problem, but you could also consider this alternative for cases where you want the output sorted on the field you're using to select records:

Code:

sort -t';' -u -k8,8 oldfile > newfile

which, with your sample input in oldfile, produces the output:

Code:

M202034357;01/2008;J30DF061;Ciclo 06 de Faturamento;4000034956;06F010800020;269800607228;173691920;000030011-AA;9.9;GUARA
M202034357;01/2008;J30PA091;Ciclo 01 de Faturamento;4000043785;01F010800027;270200624370;175114167;000011219-AB;9.9;BEL�M
M202034357;01/2008;J30RJ021;Ciclo 01 de Faturamento;4000029579;01F010800017;270500588743;175121705;000188224-AA;9.9;NITEROI
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500589299;175123639;000241055-AB;9.9;SAO PAULO
M202034357;01/2008;J30AP096;Ciclo 01 de Faturamento;4000029579;01F010800017;270500589332;175123672;000001842-AB;9.9;MACAPA
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500589787;175125437;000256241-AB;9.9;SAO PAULO
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500590036;175126399;000458131-AA;9.9;SAO CAETANO DO SUL
M202034357;01/2008;J30RJ021;Ciclo 01 de Faturamento;4000029579;01F010800017;270500591331;175130959;000074873-AB;9.9;RIO DE JANEIRO
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500591956;175132948;000441734-AA;9.9;SAO BERNARDO DO CAMPO
M202034357;01/2008;J30SP011;Ciclo 01 de Faturamento;4000029579;01F010800017;270500591958;175132950;000441735-AA;9.9;SAO PAULO
M202034357;01/2008;J30RJ021;Ciclo 01 de Faturamento;4000043657;01F010800002;118000613348;175138146;000161122-AA;9.9;RIO DE JANEIRO
M202034357;01/2008;J30PA091;Ciclo 01 de Faturamento;4000043657;01F010800002;118000614087;175140485;000023707-AA;9.9;BELEM
M202034357;01/2008;J30AM097;Ciclo 01 de Faturamento;4000043657;01F010800002;118000614870;175142866;000026153-AA;4.99;MANAUS

in newfile.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

Shell Programming and Scripting

Removing duplicates on a single "column" (delimited file)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Bash script - Print an ascii file using specific font "Latin Modern Mono 12" "regular" "9"

Discussion started by: jcdole

2. Shell Programming and Scripting

replace the contents of first column of file "X" with second Column of file "X" in file "Y"

Discussion started by: shahid1632

3. UNIX for Dummies Questions & Answers

Replacing "." with "GG" in a certain column of a file that has heading

Discussion started by: kush

4. UNIX for Dummies Questions & Answers

Using "mailx" command to read "to" and "cc" email addreses from input file

Discussion started by: asjaiswal

5. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Discussion started by: kevinprood

6. Shell Programming and Scripting

PERL "filtering the log file removing the duplicates

Discussion started by: scriptscript

7. Shell Programming and Scripting

Cant get awk 1liner to remove duplicate lines from Delimited file, get "event not found" error..help

Discussion started by: andy b

8. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Discussion started by: shis100

9. Shell Programming and Scripting

how to create flat file delimited by "\002"

Discussion started by: injey

10. Shell Programming and Scripting

"Join" or "Merge" more than 2 files into single output based on common key (column)

Discussion started by: Katabatic