perl/shell need help to remove duplicate lines from files

12-22-2010

Registered User

44, 0

Join Date: Aug 2008

Last Activity: 24 May 2014, 5:11 AM EDT

Posts: 44

Thanks Given: 0

Thanked 0 Times in 0 Posts

perl/shell need help to remove duplicate lines from files

Dear All,

I have multiple files having number of records, consist of more than 10 columns some column values are duplicate and i want to remove these duplicate values from these files.
Duplicate values may come in different files.... all files laying in single directory..

Need help to remove line contain duplicate values, and store in another files with same file name having .dup extention...

Sample files
Input_file_001.txt

Code:

AAAAAC01            0397fa           AB2010120211200500000000200009904136515                 099999999999                 IUVSN11                                 MOB
          AAAAAA01  03981d           AB2010120211130100000007430009588004780                 888888888888888                                 GGGCZ11                     MOB                                              76457499048         3122
          BBBBBBB01  03982f           AB2010120211203400000000150009588000696                 909090909090909                                 KKKKKG11                     MOB                                              64325984725         4107
AAAAAC01            0396fa           AB2010120211200500000000200009904136515                 099999999999                 IUVSN11                                 MOB ------ contain duplicate value
          AAAAAA01  03901d           AB2010120211130100000007430009588004780                 888888888888888                                 GGGCZ11                     MOB                                              76457499048         3122 ------ contain duplicate value

Input_file_002.txt

Code:

CCCCCCA01  03981d           AB2010120211130100000007430009588004780                 11111111111118                                 GGGCZ11                     MOB                                              76457499048         3122
          BBBBBBB01  03932f           AB2010120211203400000000150009588000696                 909090909090909                                 KKKKKG11                     MOB                                              64325984725         4107 – contain duplicate values of first file

Need out put something like this
Input_file_001.txt

Code:

AAAAAC01            0397fa           AB2010120211200500000000200009904136515                 099999999999                 IUVSN11                                 MOB
          AAAAAA01  03981d           AB2010120211130100000007430009588004780                 888888888888888                                 GGGCZ11                     MOB                                              76457499048         3122
          BBBBBBB01  03982f           AB2010120211203400000000150009588000696                 909090909090909                                 KKKKKG11                     MOB                                              64325984725         4107

Input_file_001.txt.dup

Code:

AAAAAC01            0396fa           AB2010120211200500000000200009904136515                 099999999999                 IUVSN11                                 MOB 
          AAAAAA01  03901d           AB2010120211130100000007430009588004780                 888888888888888                                 GGGCZ11                     MOB                                              76457499048         3122

Input_file_002.txt

Code:

CCCCCCA01  03981d           AB2010120211130100000007430009588004780                 11111111111118                                 GGGCZ11                     MOB                                              76457499048         3122

Input_file_002.txt.dup

Code:

BBBBBBB01  03932f           AB2010120211203400000000150009588000696                 909090909090909                                 KKKKKG11                     MOB                                              64325984725         4107

Currently i’m using following command to remove duplicate.... but not able to store duplicate lines .dup file

Code:

awk '!x [substr($0,38,93), substr($0,94,141)]++'   * > all_files_

Last edited by radoulov; 12-22-2010 at 10:42 AM.. Reason: code tags (well trying...)

arvindng

View Public Profile for arvindng

Find all posts by arvindng

12-22-2010

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

If we are talking utterly duplicate lines, it can get VM intensive to do it all in memory so you can preserve order. Are duplicate lines always in the same file? Here is a robust dup finder using sort:

Code:

for file in *.txt
do
  sort $file|uniq -d >>$file.dups
  if [ ! -s $file.dups ]
  then
    rm -f $file.dups
  fi
done

You only get one copy of each dup.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

12-24-2010

Registered User

44, 0

Join Date: Aug 2008

Last Activity: 24 May 2014, 5:11 AM EDT

Posts: 44

Thanks Given: 0

Thanked 0 Times in 0 Posts

duplicate lines may contain in different files...

duplicate need to identify if substr($0,38,93), substr($0,94,141) values are duplicate

thank you

arvindng

View Public Profile for arvindng

Find all posts by arvindng

12-24-2010

Registered User

4,673, 588

Join Date: Oct 2010

Last Activity: 1 February 2016, 3:35 PM EST

Location: Southern NJ, USA (Nord)

Posts: 4,673

Thanks Given: 8

Thanked 588 Times in 561 Posts

OK, A. the key is not the whole line, and B. duplicates across files are bad, two complications. Reporting the duplicate means a definition of the original, expecially for non-key data.

If the lines have identical keys and not identical payload (fields not keys), then will file name order and order in file pick a winner?
We need to survey all files for duplicate keys, then extract the unique and winners to load, and the losers to report. Think of them as two important products, not picking favorites. While most days there may be no duplicates, if one day there are tons, you still want it to blast through.
There are two approaches to dealing with duplicate filtering. You can save every key in an associative array (magic box that recalls by value, but may not be robust in speed and stability with huge volume) or you can sort in key, priority order (more traditional and quite robust if you have the disk space. Store just the last key, process the first of every key and log the others. Worked great on tape in 1960 with 16K or RAM! :-)
Tagging the duplicates by original file means adding the file name to every record, possible but a bit of a luxury if not needed.

DGPickett

View Public Profile for DGPickett

Find all posts by DGPickett

Shell Programming and Scripting

perl/shell need help to remove duplicate lines from files

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove duplicate lines?

Discussion started by: nalu

2. Windows & DOS: Issues & Discussions

Remove duplicate lines from text files.

Discussion started by: pasc

3. UNIX for Dummies Questions & Answers

Remove Duplicate Lines

Discussion started by: tara123

4. Shell Programming and Scripting

[uniq + awk?] How to remove duplicate blocks of lines in files?

Discussion started by: raidzero

5. Shell Programming and Scripting

Remove duplicate lines

Discussion started by: zhshqzyc

6. Shell Programming and Scripting

remove duplicate lines using awk

Discussion started by: sudvishw

7. Shell Programming and Scripting

Command to remove duplicate lines with perl,sed,awk

Discussion started by: cola

8. Shell Programming and Scripting

remove all duplicate lines from all files in one folder

Discussion started by: lowmaster

9. Shell Programming and Scripting

how to remove duplicate lines

Discussion started by: fredao

10. Shell Programming and Scripting

Remove Duplicate Lines in File

Discussion started by: Teh Tiack Ein