Duplicate line removal matching some columns only

03-19-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Duplicate line removal matching some columns only

I'm looking to remove duplicate rows from a CSV file with a twist.

The first row is a header.
There are 31 columns. I want to remove duplicates when the first 29 rows are identical ignoring row 30 and 31 BUT the duplicate that is kept should have the shortest total character length in rows 30 and 31 combined.
Col 31 can contain commas that are not delimiters (quoted)

Example
col1,col2,col2,...,col29,col30,"col31, may have some commas in it, but they are within quotes"

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-19-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Untested, just dreamed up:

Code:

awk -F, '{$30=length ($30$31) FS $30}1' file | sort -u | awk '{$30 = ""; $0=$0; $1=$1}1'

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-19-2013

Read Only

1,278, 486

Join Date: Sep 2012

Last Activity: 27 February 2020, 8:59 PM EST

Location: Houston, Texas, USA

Posts: 1,278

Thanks Given: 0

Thanked 486 Times in 451 Posts

try:

Code:

awk '
NR==1 { a[cnt++]=$0 }
NR>1 {
   s=""; for (i=30; i<=NF; i++) s=s","$i;
   t=$0; NF=29;
   if (! b[$0]) {
      a[cnt++]=t; b[$0]=$0; d[$0]=s;
   } else {
      if (length(s) < length(d[$0])) {a[cnt-1]=t; d[$0]=s;};
   }
}
END {for (i=0; i<cnt; i++) print a[i]}' FS="," OFS="," infile

rdrtx1

View Public Profile for rdrtx1

Find all posts by rdrtx1

03-19-2013

Registered User

1,613, 160

Join Date: Oct 2007

Last Activity: 12 February 2019, 12:19 PM EST

Location: USA

Posts: 1,613

Thanks Given: 40

Thanked 160 Times in 150 Posts

Post a sample of the input and output wanted...

shamrock

View Public Profile for shamrock

Find all posts by shamrock

03-19-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Quote:

Originally Posted by RudiC

Untested, just dreamed up:

Code:

awk -F, '{$30=length ($30$31) FS $30}1' file | sort -u | awk '{$30 = ""; $0=$0; $1=$1}1'

Mangles the data, looses the delimiters, header row at bottom.

Mike

---------- Post updated at 05:14 PM ---------- Previous update was at 05:13 PM ----------

Quote:

Originally Posted by rdrtx1

try:

Code:

awk '
NR==1 { a[cnt++]=$0 }
NR>1 {
   s=""; for (i=30; i<=NF; i++) s=s","$i;
   t=$0; NF=29;
   if (! b[$0]) {
      a[cnt++]=t; b[$0]=$0; d[$0]=s;
   } else {
      if (length(s) < length(d[$0])) {a[cnt-1]=t; d[$0]=s;};
   }
}
END {for (i=0; i<cnt; i++) print a[i]}' FS="," OFS="," infile

The output file is identical to the input file (number of rows and row order).

Mike

---------- Post updated at 05:26 PM ---------- Previous update was at 05:14 PM ----------

Quote:

Originally Posted by shamrock

Post a sample of the input and output wanted...

Code:

Input:
header 1,header2,header3,header4,...,header29,header30,header31
vvv,www,xxx,yyy,...,zzz,longer,"really, darn, long, entry"
vvv,www,xxx,yyy,...,zzz,short,"not,so,long"
123,yyy,zzz,aaa,...,bbb,short ,"really, darn, long, entry"
123,yyy,zzz,aaa,...,bbb,longer ,"really, darn, long, entry"
123,yyy,456,aaa,...,bbb,short,"really, darn, long, entry"

Output: (sorting would be nice but not required unless the implementation requires it)
header 1,header2,header3,header4,...,header29,header30,header31
vvv,www,xxx,yyy,...,zzz,short,"not,so,long"
123,yyy,zzz,aaa,...,bbb,short ,"really, darn, long, entry"
123,yyy,456,aaa,...,bbb,short,"really, darn, long, entry"

When the partial rows in bold are not unique keep the one where the total length of the last two columns concatenated is shortest.

Last edited by Michael Stora; 03-19-2013 at 09:33 PM..

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

03-20-2013

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Above proposal was thought to be a starting point to be adapted and refined. For a nearly three year member the reaction is highly disappointing.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

03-20-2013

Registered User

183, 15

Join Date: Jul 2010

Last Activity: 22 June 2015, 3:25 PM EDT

Posts: 183

Thanks Given: 56

Thanked 15 Times in 13 Posts

Quote:

Originally Posted by RudiC

Above proposal was thought to be a starting point to be adapted and refined. For a nearly three year member the reaction is highly disappointing.

Sorry to dissapoint you. I am pretty inexperianced with AWK but trying to learn. I generally turn to this forum only for AWK questions as I know how powerful it is but still find the syntax cryptic.
As far as scripting itself, I've written some pretty good BASH scripts over the years but I seem to do it so infrequently that when I start a project, I need to relearn a lot of things--most come quickly. One of these days I will have to learn AWK from the bottom up.

Mike

Michael Stora

View Public Profile for Michael Stora

Find all posts by Michael Stora

Shell Programming and Scripting

Duplicate line removal matching some columns only

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Print only the duplicate line only with matching columns

Discussion started by: Indra2011

2. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

Discussion started by: Michael Stora

3. Shell Programming and Scripting

awk to copy previous line matching a particular columns

Discussion started by: Indra2011

4. UNIX for Advanced & Expert Users

Duplicate removal

Discussion started by: duplicate

5. Shell Programming and Scripting

Remove duplicate lines (the first matching line by field criteria)

Discussion started by: joggdial3000

6. Shell Programming and Scripting

Column Search and Line Removal

Discussion started by: leepet01

7. Shell Programming and Scripting

using command line arguments as columns for pattern matching using awk

Discussion started by: invinclible0009

8. Shell Programming and Scripting

Removal of Duplicate Entries from the file

Discussion started by: ravi_rn

9. UNIX for Dummies Questions & Answers

exclude columns with a matching line pattern

Discussion started by: greptastic

10. UNIX for Dummies Questions & Answers

Sort, duplicate removal - Query

Discussion started by: novice1324