List Duplicate


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting List Duplicate
# 1  
Old 06-27-2007
List Duplicate

Hi All
This is not class assignment . I would like to know awk script how to
list all the duplicate name from a file ,have a look below
Sl No Name Dt of birth Location
1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 bbb 2/03/1977 bangalore

Here what I would like is if the DOB is same and name is same then print all the details . I have tried with the command "uniq -D " in the awk script
but could not succeed.Smilie
With thanks in advance for guidence !!!
# 2  
Old 06-27-2007
You can do something like that :
Code:
sort -k2,3 inputfile | \
awk '
   BEGIN { first_duplicate = 1 }
   {
     name = $2;
     dob  = $3;
     if (name == prv_name && dob == prv_dob) {
         if (first_duplicate)
            print "\n" prv_rec;
         print $0;
         first_duplicate = 0;
     } else {
        prv_name = name;
        prv_dob  = dob;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
'

Output for your sample datas:
Code:
1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

2 bbb 2/03/1977 mumbai
6 bbb 2/03/1977 bangalore

# 3  
Old 06-27-2007
Code:
nawk '{
  idx= $2 SUBSEP $3
  arr[idx] = (idx in arr) ? arr[idx] ORS $0 : $0
  arrCnt[idx]++
}
END {
  for (i in arr)
     if (arrCnt[i] > 1) print arr[i]
}' myInputFile

# 4  
Old 06-27-2007
The user asked for:
Quote:
Here what I would like is if the DOB is same and name is same then
print all the details.
In other words, he wants to output the lines when both the date
and the name are the same.

I have the following test data:
Code:
1 aaa 1/01/1975 delhi
2 bbb 2/03/1977 mumbai
3 aaa 1/01/1976 mumbai
4 bbb 2/03/1975 chennai
5 aaa 1/01/1975 kolkatta
6 xxx 1/01/1976 mumbai
7 bbb 2/03/1977 bangalore
8 aaa 1/01/1976 mumbai

Based on the requirement, the correct output should be:
Code:
3 aaa 1/01/1976 mumbai
6 xxx 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

Running Aigles code:
Code:
1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta

3 aaa 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

2 bbb 2/03/1977 mumbai
7 bbb 2/03/1977 bangalore

Running Vgersh code:
Code:
2 bbb 2/03/1977 mumbai
7 bbb 2/03/1977 bangalore
1 aaa 1/01/1975 delhi
5 aaa 1/01/1975 kolkatta
3 aaa 1/01/1976 mumbai
8 aaa 1/01/1976 mumbai

# 5  
Old 06-27-2007
Shell_Life,
'DOB and name' - not 'DOB and location'. 'Second and Third' - not 'Third and Fourth' fields.
# 6  
Old 06-27-2007
Vgersh,
Thanks for clarifying.
I was under the impression that 'name' was Mumbai, Kolkatta, etc.
Great catch!
Cheers.
# 7  
Old 06-28-2007
When I read the question, I had in mind a solution using arrays like that of vgersh99.
Finally I tried to see whether it were easy to make without arrays, and it's that solution that i have posted.
The vgersh99' solution is simpler and more readable.

I wanted to see the differences in performance between the two solution for a large volume of data.
For that I adapted the two solutions to determine the number of files duplicated files on my system.

I have build a file containing the list of all the files (field 1: directory path, field 2: name of the file)
The result file contains 64000 duplicate files approximately.

Code:
# find / | sed 's!/\([^/]*\)$!/ \1!' > files.txt
# wc files.txt
  534733 1069473 34359804 files.txt
# head -10 files.txt
/ 
/ lost+found
/ home
/home/ lost+found
/home/ guest
/home/guest/ .sh_history
/home/ gseyjr
/home/gseyjr/ .profile
/home/ usertest
/home/usertest/ .profile
#

The solution with arrays :
Code:
$ cat dup1.sh
awk '
   { 
      Files[$2] = ($2 in Files) ? Files[$2] ORS $0 : $0; 
      FilesCnt[$2]++ 
   }
   END { 
      for (f in Files) {
         if (FilesCnt[f] > 1) {
            print Files[f];
            duplicates++;
         }
      }
      print "\nDuplicates : " duplicates;
   }
' files.txt
$ time dup1.sh > /dev/null
real    0m27.22s
user    0m26.74s
sys     0m0.40s
$

The solution without arrays :

The -T option of the sort command was required because there wasn't sufficient space available for work files on the current filesystem.
Code:
$ cat dup2.sh
sort -T /refiea/tmp -k2,2 files.txt |
awk '
   BEGIN { first_duplicate = 1 }
   {
     file = $2;
     if (file == prv_file) {
         if (first_duplicate) {
            print prv_rec;
            duplicates++
         }
         print $0;
         first_duplicate = 0;
     } else {
        prv_file = file;
        prv_rec  = $0;
        first_duplicate = 1;
     }
   }
   END {
      print "Duplicates : " duplicates;
   }
'
$time dup2.sh > /dev/null
real    0m39.85s
user    0m2.92s
sys     0m0.10s
$

In fact, the sort itself takes more time to run that the complete solution with arrays.
Code:
$ time sort -T /refiea/tmp -k2,2 files.txt > /dev/null
real   33.06
user   32.28
sys    0.73
$

Conclusion:

The arrays win the contest.

Awk' arrays are yours friends.
They are easy to use and powerful.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

I have a job that produces a file of barcodes that gets added to every time the job runs I want to check the list to see if the barcode is already in the list and report it out if it is. (3 Replies)
Discussion started by: worky
3 Replies

2. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

3. Shell Programming and Scripting

List duplicate files based on Name and size

Hello, I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size. I know fdupes but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files. Can anyone please suggest a script or... (7 Replies)
Discussion started by: prvnrk
7 Replies

4. Shell Programming and Scripting

Find and remove duplicate record and print list

Gents, I needs to delete duplicate values and only get uniq values based in columns 2-27 Always we should keep the last record found... I need to store one clean file and other with the duplicate values removed. Input : S3033.0 7305.01 0 420123.8... (18 Replies)
Discussion started by: jiam912
18 Replies

5. Shell Programming and Scripting

Duplicate files and output list

Gents, I have a file like this. 1 1 1 2 2 3 2 4 2 5 3 6 3 7 4 8 5 9 I would like to get something like it 1 1 2 2 3 4 5 3 6 7 Thanks in advance for your support :b: (8 Replies)
Discussion started by: jiam912
8 Replies

6. Shell Programming and Scripting

Duplicate value

Hi All, i have file like ID|Indiv_ID 12345|10001 |10001 |10001 23456|10002 |10002 |10002 |10002 |10003 |10004 if indiv_id having duplicate values and corresponding ID column is null then copy the id. I need output like: ID|Indiv_ID 12345|10001... (11 Replies)
Discussion started by: bmk
11 Replies

7. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

8. Shell Programming and Scripting

Splitting a list @list by space delimiter so i can access it by using $list[0 ..1..2]

EDIT : This is for perl @data2 = grep(/$data/, @list_now); This gives me @data2 as Printing data2 11 testzone1 running /zones/testzone1 ***-*****-****-*****-***** native shared But I really cant access data2 by its individual elements. $data2 is the entire list, while $data,2,3...... (1 Reply)
Discussion started by: shriyer
1 Replies

9. Shell Programming and Scripting

Removing duplicate files from list with different path

I have a list which contains all the jar files shipped with the product I am involved with. Now, in this list I have some jar files which appear again and again. But these jar files are present in different folders. My input file looks like this /path/1/to a.jar /path/2/to a.jar /path/1/to... (10 Replies)
Discussion started by: vino
10 Replies

10. Shell Programming and Scripting

Get a none duplicate list file

Dear sir i got a file like following format with the duplicate line: AAA AAA AAA AAA AAA BBB BBB BBB BBB CCC CCC CCC CCC ... (5 Replies)
Discussion started by: trynew
5 Replies
Login or Register to Ask a Question