Duplicate


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Duplicate
# 1  
Old 07-08-2009
Duplicate

I am looking for a way to delete duplicate entries in a VERY large file (approx 2gb)

However I need to compare several fields before determining if this is a duplicate. I setup a hash in perl but it seems to not function correctly.

Any help appreciated.

of the 19 comma separated fields I need to check the second, the 5th and 10th to determine if they are truly duplicated. I cannot use the whole line as some of the fields are variable.


I have tried several iterations of hashes in perl but when I use a

my $USER=(split /,/)[1];
my $DAY=(split /,/)[4];
my $NUM=(split /,/)[9];

if (exists $seen{$USER} && exists $seen{$DAY} && exists $seen{$NUM})
.....


This seems to not work at all...I'm a newbie at perl so any help appreciated.

Thanks
G

Last edited by Goyde; 07-08-2009 at 06:50 PM.. Reason: clarity
# 2  
Old 07-08-2009
guess should be sth like below, if not understand you wrongly

Code:
my $hash;
while(<FH>){
	my @tmp=split(",",$_);
	my $key=sprintf("%s%s%s",$tmp[2],$tmp[5],$tmp[11]);
	if (not exists $hash{$key}){
		print;
		$hash{$key} = 1;
	}
}

# 3  
Old 07-09-2009
Quote:
Originally Posted by summer_cherry
guess should be sth like below, if not understand you wrongly

Code:
my $hash;
while(<FH>){
    my @tmp=split(",",$_);
    my $key=sprintf("%s%s%s",$tmp[2],$tmp[5],$tmp[11]);
    if (not exists $hash{$key}){
        print;
        $hash{$key} = 1;
    }
}



Ok that worked! I am suprised honestly. I guess I need to understand more about hashes. also I am curious about the use of sprintf.


优秀 !
感谢.

G
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

2. Shell Programming and Scripting

Duplicate records

Gents, I have a file which contends duplicate records in column 1, but the values in column 2 are different. 3099753489 3 3099753489 5 3101954341 12 3101954341 14 3102153285 3 3102153285 5 3102153297 3 3102153297 5 I will like to get something like this: output desired... (16 Replies)
Discussion started by: jiam912
16 Replies

3. Shell Programming and Scripting

Duplicate value

Hi All, i have file like ID|Indiv_ID 12345|10001 |10001 |10001 23456|10002 |10002 |10002 |10002 |10003 |10004 if indiv_id having duplicate values and corresponding ID column is null then copy the id. I need output like: ID|Indiv_ID 12345|10001... (11 Replies)
Discussion started by: bmk
11 Replies

4. Shell Programming and Scripting

Find duplicate based on 'n' fields and mark the duplicate as 'D'

Hi, In a file, I have to mark duplicate records as 'D' and the latest record alone as 'C'. In the below file, I have to identify if duplicate records are there or not based on Man_ID, Man_DT, Ship_ID and I have to mark the record with latest Ship_DT as "C" and other as "D" (I have to create... (7 Replies)
Discussion started by: machomaddy
7 Replies

5. UNIX for Dummies Questions & Answers

Getting non-duplicate records

Hi, I have a file with these records abc xyz xyz pqr uvw cde cde In my o/p file , I want all the non duplicate rows to be shown. o/p abc pqr uvw Any suggestions how to do this? Thanks for the help. rs (2 Replies)
Discussion started by: rs123
2 Replies

6. UNIX and Linux Applications

Qmail and duplicate

hello, I have a qmail server on Freebsd that is sending periodic duplicate emails. I dug around and found out that when connecting to the server through a telnet session everything goes good until i send the DATA command, type a message and enter the "." it takes like 5 minutes to get the 250 ok... (4 Replies)
Discussion started by: mike171562
4 Replies

7. AIX

Duplicate IP After reboot

Hi All I recived the following error, after a reboot on a 550 anyone know where I should start: LABEL: AIXIF_ARP_DUP_ADDR IDENTIFIER: FE2DEE00 Date/Time: Wed Mar 26 19:27:44 USAST 2008 Sequence Number: 100231 Machine Id: 00C9D08A4C00 Node Id: welsdx05... (1 Reply)
Discussion started by: Pashman
1 Replies

8. UNIX for Dummies Questions & Answers

duplicate emails

Hi all new here. Question. My mail server is sending out duplicate emails but its not system wide. We have several virtual host that use our email server and some are having the duplicate emails issue and some are not. For example in the office some people are getting duplicate emails but i am... (1 Reply)
Discussion started by: mcraul
1 Replies

9. Shell Programming and Scripting

Records Duplicate

Hi Everyone, I have a flat file of 1000 unique records like following : For eg Andy,Flower,201-987-0000,12/23/01 Andrew,Smith,101-387-3400,11/12/01 Ani,Ross,401-757-8640,10/4/01 Rich,Finny,245-308-0000,2/27/06 Craig,Ford,842-094-8740,1/3/04 . . . . . . Now I want to duplicate... (9 Replies)
Discussion started by: ganesh123
9 Replies

10. HP-UX

Disk duplicate in 10.20

HI: I know this topic already exist in this forum but not exactly with my problem. I want to duplicate a disk , my source disk is like 2gb size, while the new disk is like 36 gb size. The problems: When I use the command dd it fails, I think because the disk sizes, and the sizes of the... (13 Replies)
Discussion started by: pmoren
13 Replies
Login or Register to Ask a Question