Duplicate line removal matching some columns only


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Duplicate line removal matching some columns only
# 1  
Old 03-19-2013
Duplicate line removal matching some columns only

I'm looking to remove duplicate rows from a CSV file with a twist.

The first row is a header.
There are 31 columns. I want to remove duplicates when the first 29 rows are identical ignoring row 30 and 31 BUT the duplicate that is kept should have the shortest total character length in rows 30 and 31 combined.
Col 31 can contain commas that are not delimiters (quoted)

Example
col1,col2,col2,...,col29,col30,"col31, may have some commas in it, but they are within quotes"

Mike
# 2  
Old 03-19-2013
Untested, just dreamed up:
Code:
awk -F, '{$30=length ($30$31) FS $30}1' file | sort -u | awk '{$30 = ""; $0=$0; $1=$1}1'

# 3  
Old 03-19-2013
try:
Code:
awk '
NR==1 { a[cnt++]=$0 }
NR>1 {
   s=""; for (i=30; i<=NF; i++) s=s","$i;
   t=$0; NF=29;
   if (! b[$0]) {
      a[cnt++]=t; b[$0]=$0; d[$0]=s;
   } else {
      if (length(s) < length(d[$0])) {a[cnt-1]=t; d[$0]=s;};
   }
}
END {for (i=0; i<cnt; i++) print a[i]}' FS="," OFS="," infile

# 4  
Old 03-19-2013
Post a sample of the input and output wanted...
# 5  
Old 03-19-2013
Quote:
Originally Posted by RudiC
Untested, just dreamed up:
Code:
awk -F, '{$30=length ($30$31) FS $30}1' file | sort -u | awk '{$30 = ""; $0=$0; $1=$1}1'

Mangles the data, looses the delimiters, header row at bottom.

Mike

---------- Post updated at 05:14 PM ---------- Previous update was at 05:13 PM ----------

Quote:
Originally Posted by rdrtx1
try:
Code:
awk '
NR==1 { a[cnt++]=$0 }
NR>1 {
   s=""; for (i=30; i<=NF; i++) s=s","$i;
   t=$0; NF=29;
   if (! b[$0]) {
      a[cnt++]=t; b[$0]=$0; d[$0]=s;
   } else {
      if (length(s) < length(d[$0])) {a[cnt-1]=t; d[$0]=s;};
   }
}
END {for (i=0; i<cnt; i++) print a[i]}' FS="," OFS="," infile

The output file is identical to the input file (number of rows and row order).

Mike

---------- Post updated at 05:26 PM ---------- Previous update was at 05:14 PM ----------

Quote:
Originally Posted by shamrock
Post a sample of the input and output wanted...
Code:
Input:
header 1,header2,header3,header4,...,header29,header30,header31
vvv,www,xxx,yyy,...,zzz,longer,"really, darn, long, entry"
vvv,www,xxx,yyy,...,zzz,short,"not,so,long"
123,yyy,zzz,aaa,...,bbb,short ,"really, darn, long, entry"
123,yyy,zzz,aaa,...,bbb,longer ,"really, darn, long, entry"
123,yyy,456,aaa,...,bbb,short,"really, darn, long, entry"

Output: (sorting would be nice but not required unless the implementation requires it)
header 1,header2,header3,header4,...,header29,header30,header31
vvv,www,xxx,yyy,...,zzz,short,"not,so,long"
123,yyy,zzz,aaa,...,bbb,short ,"really, darn, long, entry"
123,yyy,456,aaa,...,bbb,short,"really, darn, long, entry"

When the partial rows in bold are not unique keep the one where the total length of the last two columns concatenated is shortest.


Last edited by Michael Stora; 03-19-2013 at 09:33 PM..
# 6  
Old 03-20-2013
Above proposal was thought to be a starting point to be adapted and refined. For a nearly three year member the reaction is highly disappointing.
# 7  
Old 03-20-2013
Quote:
Originally Posted by RudiC
Above proposal was thought to be a starting point to be adapted and refined. For a nearly three year member the reaction is highly disappointing.
Sorry to dissapoint you. I am pretty inexperianced with AWK but trying to learn. I generally turn to this forum only for AWK questions as I know how powerful it is but still find the syntax cryptic.
As far as scripting itself, I've written some pretty good BASH scripts over the years but I seem to do it so infrequently that when I start a project, I need to relearn a lot of things--most come quickly. One of these days I will have to learn AWK from the bottom up.

Mike
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Print only the duplicate line only with matching columns

Hi There, I have an I/P which looks like -- 1 2 3 4 5 1 2 3 4 6 4 7 8 9 9 5 6 7 8 9 I would like O/P to be --- 1 2 3 4 5 1 2 3 4 6 So, printing only the consecutive lines where $1,$2,$3,$4 are matching. Is there any command to do this or small awk script? Thanks, (12 Replies)
Discussion started by: Indra2011
12 Replies

2. Shell Programming and Scripting

Honey, I broke awk! (duplicate line removal in 30M line 3.7GB csv file)

I have a script that builds a database ~30 million lines, ~3.7 GB .cvs file. After multiple optimzations It takes about 62 min to bring in and parse all the files and used to take 10 min to remove duplicates until I was requested to add another column. I am using the highly optimized awk code: awk... (34 Replies)
Discussion started by: Michael Stora
34 Replies

3. Shell Programming and Scripting

awk to copy previous line matching a particular columns

Hello Help, 2356798 7689867 999 000 123678 20385907 9797 666 17978975 87468976 968978 98798 I am trying to have out put which actually look for the third column value of 9797 and then it insert line there after with first, second column value exactly as the previous line and replace the third... (3 Replies)
Discussion started by: Indra2011
3 Replies

4. UNIX for Advanced & Expert Users

Duplicate removal

I have an input file of 5GB which contains duplicate records and have to remove duplicate records by retaing first instance of that record . Based on 5 fields the duplicates has to be removed . Kindly request to help me in writing a Unix Script. Thanks Asim (11 Replies)
Discussion started by: duplicate
11 Replies

5. Shell Programming and Scripting

Remove duplicate lines (the first matching line by field criteria)

Hello to all, I have this file 2002 1 23 0 0 2435.60 131.70 5.60 20.99 0.89 0.00 285.80 2303.90 2002 1 23 15 0 2436.60 132.90 6.45 21.19 1.03 0.00 285.80 2303.70 2002 1 23 ... (6 Replies)
Discussion started by: joggdial3000
6 Replies

6. Shell Programming and Scripting

Column Search and Line Removal

Hello Gurus, I need to remove lines within a file if it contains specific criteria. Here is what I am trying to resolve: Users of AppRuntime: (Total of 10 licenses issued; Total of 6 licenses in use) buih02 dsktp501 AppGui 1 (compute_lic/27006 3122), start Mon 2/22 7:58 dingj1... (3 Replies)
Discussion started by: leepet01
3 Replies

7. Shell Programming and Scripting

using command line arguments as columns for pattern matching using awk

Hi, I wish to use a column, as inputted by a user from command line, for pattern matching. awk file: { if($1 ~ /^8/) { print $0> "temp2.csv" } } something like this, but i want '$1' to be any column as selected by the user from command line. ... (1 Reply)
Discussion started by: invinclible0009
1 Replies

8. Shell Programming and Scripting

Removal of Duplicate Entries from the file

I have a file which consists of 1000 entries. Out of 1000 entries i have 500 Duplicate Entires. I want to remove the first Duplicate Entry (i,e entire Line) in the File. The example of the File is shown below: 8244100010143276|MARISOL CARO||MORALES|HSD768|CARR 430 KM 1.7 ... (1 Reply)
Discussion started by: ravi_rn
1 Replies

9. UNIX for Dummies Questions & Answers

exclude columns with a matching line pattern

Hi , I have 5 columns total and am wanting to search lines in columns 3-5 and basically grep -v patterns that match 'BBB_0123' 'BVG_0895' 'BSD_0987' Does anyone know how to do this? I tried combining grep -v with grep -e but, it didn't work. Thanks! (5 Replies)
Discussion started by: greptastic
5 Replies

10. UNIX for Dummies Questions & Answers

Sort, duplicate removal - Query

Hi All, I have a problem with the sort and duplicate filter command I am using in one of my scripts. I have a '|' delimited file and want to sort and remove duplicates on the 1,2,15 fields. These fields constitute the primary key of the table I will be loading the data into. But I see that some... (4 Replies)
Discussion started by: novice1324
4 Replies
Login or Register to Ask a Question