List Duplicate

06-27-2007

Registered User

71, 4

Join Date: Feb 2007

Last Activity: 21 December 2014, 8:12 AM EST

Posts: 71

Thanks Given: 9

Thanked 4 Times in 1 Post

List Duplicate

Hi All
This is not class assignment . I would like to know awk script how to
list all the duplicate name from a file ,have a look below
Sl No Name Dt of birth Location
1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 bbb 2/03/1977 bangalore

Here what I would like is if the DOB is same and name is same then print all the details . I have tried with the command "uniq -D " in the awk script
but could not succeed.

With thanks in advance for guidence !!!

vakharia Mahesh

View Public Profile for vakharia Mahesh

Find all posts by vakharia Mahesh

06-27-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

You can do something like that :

Code:

sort -k2,3 inputfile | \
awk '
   BEGIN { first_duplicate = 1 }
   {
     name = $2;
     dob  = $3;
     if (name == prv_name && dob == prv_dob) {
         if (first_duplicate)
            print "\n" prv_rec;
         print $0;
         first_duplicate = 0;
     } else {
        prv_name = name;
        prv_dob  = dob;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
'

Output for your sample datas:

Code:

1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

2 bbb 2/03/1977 mumbai
6 bbb 2/03/1977 bangalore

aigles

View Public Profile for aigles

Find all posts by aigles

06-27-2007

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Code:

nawk '{
  idx= $2 SUBSEP $3
  arr[idx] = (idx in arr) ? arr[idx] ORS $0 : $0
  arrCnt[idx]++
}
END {
  for (i in arr)
     if (arrCnt[i] > 1) print arr[i]
}' myInputFile

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

06-27-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

The user asked for:

Quote:

Here what I would like is if the DOB is same and name is same then
print all the details.

In other words, he wants to output the lines when both the date
and the name are the same.

I have the following test data:

Code:

1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 xxx 1/01/1976 mumbai
7 bbb 2/03/1977 bangalore
8 aaa 1/01/1976 mumbai

Based on the requirement, the correct output should be:

Code:

3 aaa 1/01/1976 mumbai
6 xxx 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

Running Aigles code:

Code:

1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

3 aaa 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

2 bbb 2/03/1977 mumbai
7 bbb 2/03/1977 bangalore

Running Vgersh code:

Code:

2 bbb 2/03/1977 mumbai
7 bbb 2/03/1977 bangalore
1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta
3 aaa 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

06-27-2007

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

Shell_Life,
'DOB and name' - not 'DOB and location'. 'Second and Third' - not 'Third and Fourth' fields.

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

06-27-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Vgersh,
Thanks for clarifying.
I was under the impression that 'name' was Mumbai, Kolkatta, etc.
Great catch!
Cheers.

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

06-28-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

When I read the question, I had in mind a solution using arrays like that of vgersh99.
Finally I tried to see whether it were easy to make without arrays, and it's that solution that i have posted.
The vgersh99' solution is simpler and more readable.

I wanted to see the differences in performance between the two solution for a large volume of data.
For that I adapted the two solutions to determine the number of files duplicated files on my system.

I have build a file containing the list of all the files (field 1: directory path, field 2: name of the file)
The result file contains 64000 duplicate files approximately.

Code:

# find / | sed 's!/\([^/]*\)$!/ \1!' > files.txt
# wc files.txt
  534733 1069473 34359804 files.txt
# head -10 files.txt
/ 
/ lost+found
/ home
/home/ lost+found
/home/ guest
/home/guest/ .sh_history
/home/ gseyjr
/home/gseyjr/ .profile
/home/ usertest
/home/usertest/ .profile
#

The solution with arrays :

Code:

$ cat dup1.sh
awk '
   { 
      Files[$2] = ($2 in Files) ? Files[$2] ORS $0 : $0; 
      FilesCnt[$2]++ 
   }
   END { 
      for (f in Files) {
         if (FilesCnt[f] > 1) {
            print Files[f];
            duplicates++;
         }
      }
      print "\nDuplicates : " duplicates;
   }
' files.txt
$ time dup1.sh > /dev/null
real    0m27.22s
user    0m26.74s
sys     0m0.40s
$

The solution without arrays :

The -T option of the sort command was required because there wasn't sufficient space available for work files on the current filesystem.

Code:

$ cat dup2.sh
sort -T /refiea/tmp -k2,2 files.txt |
awk '
   BEGIN { first_duplicate = 1 }
   {
     file = $2;
     if (file == prv_file) {
         if (first_duplicate) {
            print prv_rec;
            duplicates++
         }
         print $0;
         first_duplicate = 0;
     } else {
        prv_file = file;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
   END {
      print "Duplicates : " duplicates;
   }
'
$time dup2.sh > /dev/null
real    0m39.85s
user    0m2.92s
sys     0m0.10s
$

In fact, the sort itself takes more time to run that the complete solution with arrays.

Code:

$ time sort -T /refiea/tmp -k2,2 files.txt > /dev/null
real   33.06
user   32.28
sys    0.73
$

Conclusion:

The arrays win the contest.

Awk' arrays are yours friends.
They are easy to use and powerful.

aigles

View Public Profile for aigles

Find all posts by aigles

Shell Programming and Scripting

List Duplicate

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

Discussion started by: worky

2. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

3. Shell Programming and Scripting

List duplicate files based on Name and size

Discussion started by: prvnrk

4. Shell Programming and Scripting

Find and remove duplicate record and print list

Discussion started by: jiam912

5. Shell Programming and Scripting

Duplicate files and output list

Discussion started by: jiam912

6. Shell Programming and Scripting

Duplicate value

Discussion started by: bmk

7. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

8. Shell Programming and Scripting

Splitting a list @list by space delimiter so i can access it by using $list[0 ..1..2]

Discussion started by: shriyer

9. Shell Programming and Scripting

Removing duplicate files from list with different path

Discussion started by: vino

10. Shell Programming and Scripting

Get a none duplicate list file

Discussion started by: trynew