List Duplicate

06-28-2007

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

nicely done anlysis, aigles!

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

06-28-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

I created a test file with 114,688 records:

Code:

    1 aaa 1/01/1975 delhi
    2 bbb 2/03/1977 mumbai
    3 aaa 1/01/1976 mumbai
    4 bbb 2/03/1975 chennai
    5 aaa 1/01/1975 kolkatta
    6 xxx 1/01/1976 mumbai
    7 aaa 1/01/1975 delhi
...
114686 xxx 1/01/1976 mumbai
114687 bbb 2/03/1977 bangalore
114688 aaa 1/01/1976 mumbai

Running the Vgersh's arrays solution:

Code:

>time nawk_array.sh > /dev/null
real    9m2.96s
user    9m2.67s
sys     0m0.08s

Running Aigles' sort solution:

Code:

>time nawk_dups.sh > /dev/null
real    0m10.22s
user    0m2.55s
sys     0m0.03s

Guys, the 'sort' command is very optmized.
Arrays work great for a short number of occurrences or when constants must be used.
Otherwise, arrays should be used with caution, especially when several
thousands of occurrences are involved.

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

06-28-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Strange ..Your results are at the opposite of mine.

I ran again my test scripts under AIX with the same result, my input file contains 534733 records for a total size of 34Mb.

The solution with arrays is faster than the solution with sort (and again the sort itself is longer than the entire solution with arrays).

An idea to explain this mystery?

aigles

View Public Profile for aigles

Find all posts by aigles

06-28-2007

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Aigles,
Try to create a similar data set that I tested against as follows:

Code:

1) Create a file 'A' with:
aaa 1/01/1975 delhi
bbb 2/03/1977 mumbai
aaa 1/01/1976 mumbai
bbb 2/03/1975 chennai
aaa 1/01/1975 kolkatta
xxx 1/01/1976 mumbai
aaa 1/01/1975 delhi
xxx 1/01/1976 mumbai
bbb 2/03/1977 bangalore
aaa 1/01/1976 mumbai

2) Keep repeating the following process until you get to over 110,000 records:
cp A B
cat B > A

3) After you have the number of records you want:
cat -n A > B
sed 's/^I/ /' B > A    ### This will remove the ctl-I (tabs) after the numbers.

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

06-28-2007

Registered User

71, 4

Join Date: Feb 2007

Last Activity: 21 December 2014, 8:12 AM EST

Posts: 71

Thanks Given: 9

Thanked 4 Times in 1 Post

Dupllicate

Dear Guru
What a great froum I am in !!! I really feel proud to be a member of it !

I thank to aigles, Shell Life ,vgersh99 from bottom of my heart 'cause
i have been trying with the code also by moving the previous but to no
avail and totally forgot about the array . Once again my sincere thanks to
all of you for sparing your time and providing me with the solution. One more
clarification I would like though I have yet to test ,
Will the array works with large volume ??
Or the simple script will do the job ? will you please throw some detail?
Something for my detail knowledge .

vakharia Mahesh

View Public Profile for vakharia Mahesh

Find all posts by vakharia Mahesh

06-29-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Quote:

Originally Posted by Shell_Life

Aigles,
Try to create a similar data set that I tested against as follows:

Code:

1) Create a file 'A' with:
aaa 1/01/1975 delhi
bbb 2/03/1977 mumbai
aaa 1/01/1976 mumbai
bbb 2/03/1975 chennai
aaa 1/01/1975 kolkatta
xxx 1/01/1976 mumbai
aaa 1/01/1975 delhi
xxx 1/01/1976 mumbai
bbb 2/03/1977 bangalore
aaa 1/01/1976 mumbai

2) Keep repeating the following process until you get to over 110,000 records:
cp A B
cat B > A

3) After you have the number of records you want:
cat -n A > B
sed 's/^I/ /' B > A    ### This will remove the ctl-I (tabs) after the numbers.

I confirm you results :

Code:

$ wc -l vdup.txt
  163840 vdup.txt
$ time vdup_noarrays.sh

r�el    0m14,68s
util    0m7,45s
sys     0m0,04s
$ time vdup_arrays.sh  

r�el    16m51,15s
util    16m41,74s
sys     0m0,21s
$

I think that the problem doesn't come from the number of elements in the array.
In this test the array contains only 5 elements, but the elements are very large (up to 1300 Kb) and modified very often.

The situation was inverse in my previous test.
There were more than 100000 elements with a maximal size of 200 Kb and low update rate.

aigles

View Public Profile for aigles

Find all posts by aigles

06-29-2007

Registered User

1,714, 63

Join Date: Apr 2004

Last Activity: 15 May 2020, 11:27 AM EDT

Location: Bordeaux, France

Posts: 1,714

Thanks Given: 2

Thanked 63 Times in 59 Posts

Arrays win !

A little modification to vger'sh99 solution and arrays win !

Code:

nawk '{
  idx= $2 SUBSEP $3
  arr[idx, ++arrCnt[idx]] = $0
}
END {
  for (i in arrCnt)
     if (arrCnt[i] > 1) 
        for (c=1; c<=arrCnt[i]; c++)
           print arr[i, c];
}' vdup.txt > /dev/null

Code:

$ time vdup_noarrays.sh 

real   6.11
user   2.75
sys    0.03
$ time vdup_arrays.sh

real   1008.69
user   1001.02
sys    0.21
$ time vdup_arrays2.sh    # Modified solution

real   5.74
user   5.55
sys    0.15
$

aigles

View Public Profile for aigles

Find all posts by aigles

Shell Programming and Scripting

List Duplicate

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

Discussion started by: worky

2. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

3. Shell Programming and Scripting

List duplicate files based on Name and size

Discussion started by: prvnrk

4. Shell Programming and Scripting

Find and remove duplicate record and print list

Discussion started by: jiam912

5. Shell Programming and Scripting

Duplicate files and output list

Discussion started by: jiam912

6. Shell Programming and Scripting

Duplicate value

Discussion started by: bmk

7. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Discussion started by: machomaddy

8. Shell Programming and Scripting

Splitting a list @list by space delimiter so i can access it by using $list[0 ..1..2]

Discussion started by: shriyer

9. Shell Programming and Scripting

Removing duplicate files from list with different path

Discussion started by: vino

10. Shell Programming and Scripting

Get a none duplicate list file

Discussion started by: trynew