Delete Duplicate line (not really) from the file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Delete Duplicate line (not really) from the file
# 1  
Old 03-28-2012
Delete Duplicate line (not really) from the file

I need help in figuring out hoe to delete lines in a data file. The data file is huge. I am currently using "vi" to search and delete the lines - which is cumbersome since it takes lots of time to save that file (due to its huge size).

Here is the issue. I have a data file with the following data - seperated by "|"
Code:
fld1|xxx|yyy|zzz|aaa|bbb|ccc|
fld2|qqq|www|eee|rrr|ttt|yyy|
fld3|aaa|sss|ddd|fff|ggg|hhh|
fld4|zzz|xxx|ccc|vvv|bbb|nnn|
fld2|qqq|www|eee|rrr|ooo|yyy|


I want to remove the line which is almost duplicate which is line#5. Line # 2 and line #5 are almost duplicate but the fifth field is different.I need to search only on the 1st field of the record (which in this case is "fld2") and then delete the 2nd occurence of the same 1st field.

Can this be done ? if yes how? The file contains around 500K rows.

Last edited by Franklin52; 04-03-2012 at 10:25 AM.. Reason: Please use code tags for data and code samples, thank you
# 2  
Old 03-28-2012
Code:
awk -F\| '{if(!y[$1]) print y[$1]=$0}' file

This User Gave Thanks to shamrock For This Post:
# 3  
Old 03-28-2012
Thanks ShamRock.

It works on the test file which i posted.

However when i tried it on my actual file of around 5.3 million rows, it stripped out 600K rows which is kind of wrong because when i load this into my database, it complains only for 3 rows. So ideally the difference between the original file and the new file (created by redirecting the awk output) should be 3. This 3 rows are stipped out but i am not sure why other rows were stripped out. I did a check for few and there were no duplicates for them in the original file.

I might be missing something - which i am investigating now. But can you explain your "awk" script? Or if i have to add one more field for checking - how do i check it in the awk script?
# 4  
Old 03-28-2012
Do you get the same issue with this?
Code:
awk -F\| '!y[$1]++' file

To just see lines it would remove:
Code:
awk -F\| 'y[$1]++' file

# 5  
Old 04-02-2012
ShamRock's suggestion worked fine.
I figured it out - it was not the first field that needed to be checked for duplicates but the second field. I changed ShamRock's awk script appropriately and it worked like a charm.

Code:
awk -F\| '{if(!y[$2]) print y[$2]=$0}' old_file > new_file

Moderator's Comments:
Mod Comment Use code tags for code, please.

Last edited by Corona688; 04-02-2012 at 04:11 PM..
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Log file - Delete duplicate line & keep last date

Hello All ! I need your help on this case, I have a csv file with this: ITEM105;ARI FSR;2016-02-01 08:02;243 ITEM101;ARI FSR;2016-02-01 06:02;240 ITEM032;RNO TLE;2016-02-01 11:03;320 ITEM032;RNO TLE;2016-02-02 05:43;320 ITEM032;RNO TLE;2016-02-01 02:03;320 ITEM032;RNO... (2 Replies)
Discussion started by: vadim-bzh
2 Replies

2. Shell Programming and Scripting

Delete duplicate strings in a line

Hi, i need help to remove duplicates in my file. The problem is i need to delete one duplicate for each line only. the input file as follows and it is not tab delimited:- The output need to remove 2nd word (in red) that duplicate with 1st word (in blue). Other duplicates should remained... (12 Replies)
Discussion started by: redse171
12 Replies

3. UNIX for Dummies Questions & Answers

Sort and delete partical duplicate file

I want to delete partical duplicate file >gma-miR156d Gm01,PACID=26323927 150.00 -18.28 2 18 17 35 16 75.00% 81.25% >>gma-miR156d Gm01,PACID=26323927 150.00 -18.28 150.00 -18.28 1 21 119 17 I want to order by the second column and delete the... (1 Reply)
Discussion started by: grace_shen
1 Replies

4. UNIX for Dummies Questions & Answers

Delete duplicate second line

Hi ALL I need a help I need to retain only the first line of 035 if I have two line before =040 , if only one then need to take that Eg: Input =035 (ABC)12324141241 =035 (XYZPQR)704124 =040 AB$QS$WEWR =035 (ABC)08080880809 =035 (XYZPQR)9809314 =040 ... (4 Replies)
Discussion started by: umapearl
4 Replies

5. Shell Programming and Scripting

How to delete a duplicate line and original with sed.

I am completely new to shell scripting but have been assigned the task of creating several batch files to manipulate data. My final task requires me to find lines that have duplicates present then delete not only the duplicate but the original as well. The script will be used in a windows... (9 Replies)
Discussion started by: chino_1
9 Replies

6. UNIX for Dummies Questions & Answers

How to delete or remove duplicate lines in a file

Hi please help me how to remove duplicate lines in any file. I have a file having huge number of lines. i want to remove selected lines in it. And also if there exists duplicate lines, I want to delete the rest & just keep one of them. Please help me with any unix commands or even fortran... (7 Replies)
Discussion started by: reva
7 Replies

7. UNIX for Dummies Questions & Answers

Delete duplicate lines and print to file

OK, I have read several things on how to do this, but can't make it work. I am writing this to a vi file then calling it as an awk script. So I need to search a file for duplicate lines, delete duplicate lines, then write the result to another file, say /home/accountant/files/docs/nodup ... (2 Replies)
Discussion started by: bfurlong
2 Replies

8. Shell Programming and Scripting

how to delete duplicate rows in a file

I have a file content like below. "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","","" "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","","" "0000000","ABLNCYI","BOTH",1049,2058,"XYZ","5711002","","Y","","","","","","","",""... (5 Replies)
Discussion started by: vamshikrishnab
5 Replies

9. Shell Programming and Scripting

delete semi-duplicate lines from file?

Ok here's what I'm trying to do. I need to get a listing of all the mountpoints on a system into a file, which is easy enough, just using something like "mount | awk '{print $1}'" However, on a couple of systems, they have some mount points looking like this: /stage /stand /usr /MFPIS... (2 Replies)
Discussion started by: paqman
2 Replies

10. Shell Programming and Scripting

Delete Duplicate records from a tilde delimited file

Hi All, I want to delete duplicate records from a tilde delimited file. Criteria is considering the first 2 fields, the combination of which has to be unique, below is a sample of records in the input file 1620000010338~2446694087~0~20061130220000~A00BCC1CT... (5 Replies)
Discussion started by: irshadm
5 Replies
Login or Register to Ask a Question