List Duplicate


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting List Duplicate
# 8  
Old 06-28-2007
nicely done anlysis, aigles!
# 9  
Old 06-28-2007
I created a test file with 114,688 records:
Code:
    1 aaa 1/01/1975 delhi
    2 bbb 2/03/1977 mumbai
    3 aaa 1/01/1976 mumbai
    4 bbb 2/03/1975 chennai
    5 aaa 1/01/1975 kolkatta
    6 xxx 1/01/1976 mumbai
    7 aaa 1/01/1975 delhi
...
114686 xxx 1/01/1976 mumbai
114687 bbb 2/03/1977 bangalore
114688 aaa 1/01/1976 mumbai

Running the Vgersh's arrays solution:
Code:
>time nawk_array.sh > /dev/null
real    9m2.96s
user    9m2.67s
sys     0m0.08s

Running Aigles' sort solution:
Code:
>time nawk_dups.sh > /dev/null
real    0m10.22s
user    0m2.55s
sys     0m0.03s

Guys, the 'sort' command is very optmized.
Arrays work great for a short number of occurrences or when constants must be used.
Otherwise, arrays should be used with caution, especially when several
thousands of occurrences are involved.
# 10  
Old 06-28-2007
Strange ..Your results are at the opposite of mine. Smilie

I ran again my test scripts under AIX with the same result, my input file contains 534733 records for a total size of 34Mb.

The solution with arrays is faster than the solution with sort (and again the sort itself is longer than the entire solution with arrays).

An idea to explain this mystery?
# 11  
Old 06-28-2007
Aigles,
Try to create a similar data set that I tested against as follows:
Code:
1) Create a file 'A' with:
aaa 1/01/1975 delhi
bbb 2/03/1977 mumbai
aaa 1/01/1976 mumbai
bbb 2/03/1975 chennai
aaa 1/01/1975 kolkatta
xxx 1/01/1976 mumbai
aaa 1/01/1975 delhi
xxx 1/01/1976 mumbai
bbb 2/03/1977 bangalore
aaa 1/01/1976 mumbai

2) Keep repeating the following process until you get to over 110,000 records:
cp A B
cat B > A

3) After you have the number of records you want:
cat -n A > B
sed 's/^I/ /' B > A    ### This will remove the ctl-I (tabs) after the numbers.

# 12  
Old 06-28-2007
Question Dupllicate

Dear Guru
What a great froum I am in !!! I really feel proud to be a member of it !

I thank to aigles, Shell Life ,vgersh99 from bottom of my heart 'cause
i have been trying with the code also by moving the previous but to no
avail and totally forgot about the array . Once again my sincere thanks to
all of you for sparing your time and providing me with the solution. One more
clarification I would like though I have yet to test ,
Will the array works with large volume ??
Or the simple script will do the job ? will you please throw some detail?
Something for my detail knowledge .
# 13  
Old 06-29-2007
Quote:
Originally Posted by Shell_Life
Aigles,
Try to create a similar data set that I tested against as follows:
Code:
1) Create a file 'A' with:
aaa 1/01/1975 delhi
bbb 2/03/1977 mumbai
aaa 1/01/1976 mumbai
bbb 2/03/1975 chennai
aaa 1/01/1975 kolkatta
xxx 1/01/1976 mumbai
aaa 1/01/1975 delhi
xxx 1/01/1976 mumbai
bbb 2/03/1977 bangalore
aaa 1/01/1976 mumbai

2) Keep repeating the following process until you get to over 110,000 records:
cp A B
cat B > A

3) After you have the number of records you want:
cat -n A > B
sed 's/^I/ /' B > A    ### This will remove the ctl-I (tabs) after the numbers.

I confirm you results :

Code:
$ wc -l vdup.txt
  163840 vdup.txt
$ time vdup_noarrays.sh

réel    0m14,68s
util    0m7,45s
sys     0m0,04s
$ time vdup_arrays.sh  

réel    16m51,15s
util    16m41,74s
sys     0m0,21s
$

I think that the problem doesn't come from the number of elements in the array.
In this test the array contains only 5 elements, but the elements are very large (up to 1300 Kb) and modified very often.

The situation was inverse in my previous test.
There were more than 100000 elements with a maximal size of 200 Kb and low update rate.
# 14  
Old 06-29-2007
Bug Arrays win !

A little modification to vger'sh99 solution and arrays win !
Code:
nawk '{
  idx= $2 SUBSEP $3
  arr[idx, ++arrCnt[idx]] = $0
}
END {
  for (i in arrCnt)
     if (arrCnt[i] > 1) 
        for (c=1; c<=arrCnt[i]; c++)
           print arr[i, c];
}' vdup.txt > /dev/null

Code:
$ time vdup_noarrays.sh 

real   6.11
user   2.75
sys    0.03
$ time vdup_arrays.sh

real   1008.69
user   1001.02
sys    0.21
$ time vdup_arrays2.sh    # Modified solution

real   5.74
user   5.55
sys    0.15
$

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Iterate through a list - checking for a duplicate then report it ot

I have a job that produces a file of barcodes that gets added to every time the job runs I want to check the list to see if the barcode is already in the list and report it out if it is. (3 Replies)
Discussion started by: worky
3 Replies

2. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

3. Shell Programming and Scripting

List duplicate files based on Name and size

Hello, I have a huge directory (with millions of files) and need to find out duplicates based on BOTH file name and File size. I know fdupes but it calculates MD5 which is very time-consuming and especially it takes forever as I have millions of files. Can anyone please suggest a script or... (7 Replies)
Discussion started by: prvnrk
7 Replies

4. Shell Programming and Scripting

Find and remove duplicate record and print list

Gents, I needs to delete duplicate values and only get uniq values based in columns 2-27 Always we should keep the last record found... I need to store one clean file and other with the duplicate values removed. Input : S3033.0 7305.01 0 420123.8... (18 Replies)
Discussion started by: jiam912
18 Replies

5. Shell Programming and Scripting

Duplicate files and output list

Gents, I have a file like this. 1 1 1 2 2 3 2 4 2 5 3 6 3 7 4 8 5 9 I would like to get something like it 1 1 2 2 3 4 5 3 6 7 Thanks in advance for your support :b: (8 Replies)
Discussion started by: jiam912
8 Replies

6. Shell Programming and Scripting

Duplicate value

Hi All, i have file like ID|Indiv_ID 12345|10001 |10001 |10001 23456|10002 |10002 |10002 |10002 |10003 |10004 if indiv_id having duplicate values and corresponding ID column is null then copy the id. I need output like: ID|Indiv_ID 12345|10001... (11 Replies)
Discussion started by: bmk
11 Replies

7. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

8. Shell Programming and Scripting

Splitting a list @list by space delimiter so i can access it by using $list[0 ..1..2]

EDIT : This is for perl @data2 = grep(/$data/, @list_now); This gives me @data2 as Printing data2 11 testzone1 running /zones/testzone1 ***-*****-****-*****-***** native shared But I really cant access data2 by its individual elements. $data2 is the entire list, while $data,2,3...... (1 Reply)
Discussion started by: shriyer
1 Replies

9. Shell Programming and Scripting

Removing duplicate files from list with different path

I have a list which contains all the jar files shipped with the product I am involved with. Now, in this list I have some jar files which appear again and again. But these jar files are present in different folders. My input file looks like this /path/1/to a.jar /path/2/to a.jar /path/1/to... (10 Replies)
Discussion started by: vino
10 Replies

10. Shell Programming and Scripting

Get a none duplicate list file

Dear sir i got a file like following format with the duplicate line: AAA AAA AAA AAA AAA BBB BBB BBB BBB CCC CCC CCC CCC ... (5 Replies)
Discussion started by: trynew
5 Replies
Login or Register to Ask a Question