To remove duplicates from pipe delimited file

10-19-2013

Registered User

72, 1

Join Date: Mar 2013

Last Activity: 5 March 2019, 7:16 AM EST

Posts: 72

Thanks Given: 20

Thanked 1 Time in 1 Post

To remove duplicates from pipe delimited file

Hi some one please help me to remove duplicates from a pipe delimited file based on first two columns.

HTML Code:

123|asdf|sfsd|qwrer
431|yui|qwer|opws
123|asdf|pol|njio

Here My first record and last record are duplicates.As per my requirement I want all the latest records into one file.

I want the output looks like below

Code:

431|yui|qwer|opws
123|asdf|pol|njio

My file is having around 20 million records.So needs a faster output

ginrkf

View Public Profile for ginrkf

Find all posts by ginrkf

10-19-2013

Registered User

2,163, 123

Join Date: Nov 2007

Last Activity: 31 July 2016, 9:42 AM EDT

Location: H3X

Posts: 2,163

Thanks Given: 11

Thanked 123 Times in 116 Posts

Code:

sort -ut '|' -k 1,2 file.txt

danmero

View Public Profile for danmero

Find all posts by danmero

10-19-2013

Registered User

72, 1

Join Date: Mar 2013

Last Activity: 5 March 2019, 7:16 AM EST

Posts: 72

Thanks Given: 20

Thanked 1 Time in 1 Post

thanks for the reply..But its not working fine.Its not showing me any error.but not giving me the correct result also.Its just displaying what ever there in the file

ginrkf

View Public Profile for ginrkf

Find all posts by ginrkf

10-19-2013

Registered User

2,163, 123

Join Date: Nov 2007

Last Activity: 31 July 2016, 9:42 AM EDT

Location: H3X

Posts: 2,163

Thanks Given: 11

Thanked 123 Times in 116 Posts

Can you post a real sample data and in the same time state your OS & version

danmero

View Public Profile for danmero

Find all posts by danmero

10-19-2013

Registered User

172, 49

Join Date: Nov 2011

Last Activity: 29 January 2017, 12:48 PM EST

Location: Newtown, PA

Posts: 172

Thanks Given: 12

Thanked 49 Times in 45 Posts

Here is an awk solution that doesn't require sorting:

Code:

awk -F"|" '!x[$1 $2]++' file.txt

mjf

View Public Profile for mjf

Find all posts by mjf

10-19-2013

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

This is a much harder problem that it appears at first glance.
The sort solution proposed by danmero should just give one line sort each set of lines with identical values in the 1st 2 fields, but the one printed depends on the sort order of the remaining fields. The order ginkrf requested was that the last line in the (unsorted) file be printed for each set of lines with identical values in the 1st two fields.

The awk solution proposed by mjf will print the 1st line of each matching set instead of the last line of each matching set. (And, if the 1st 2 fields when concatenated yield the same key even though the fields are different, some desired output lines may be skipped. For example if $1 is "ab" and $2 is "c" in one record and "a" and "bc" in another, they will both have key "abc".)

Since ginkrf didn't say whether the order of the lines in the output has to match the order in which they appeared in the input, I won't try to guess at an efficient way to do what has been requested. If the order is important, the input file could be reversed, fed through mfj's awk script (with !x[$1 $2]++ changed to !x[$1,$2]++), and then reverse the order of the output again. Depending on the output constraints this might or might not be grossly inefficient.

If the output order is not important, it could be done easily with an awk script, but could require almost 400mb of virtual address space to process 20 million 20 byte records.

With a better description of the input (is there anything in a record other than its position in the input that can be used to determine which of several lines with the 1st two fields matching should be printed) and the output constraints, we might be able to provide a better solution. Are there ever more than two lines with the same 1st two fields? If yes, out of the 20 million input records, how many output records do you expect to be produced? Are there likely to be lots of lines that only have one occurrence of the 1st two fields? What are the file sizes (input and output) in bytes (instead of records)? What is the longest input line in bytes?

What OS and hardware are you using? How much memory? How much swap space?

Last edited by Don Cragun; 10-20-2013 at 10:31 AM.. Reason: Fix explanation of the possible failure of mjf's awk proposal.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10-20-2013

Registered User

2,288, 480

Join Date: Apr 2007

Last Activity: 3 May 2020, 8:28 AM EDT

Location: Saint Paul, MN USA / BSD, CentOS, Debian, OS X, Solaris

Posts: 2,288

Thanks Given: 430

Thanked 480 Times in 395 Posts

Hi.

We once needed a code that would run on a number of different systems, yet produce consistent results. We ran into the situation that utility uniq was not consistent among the systems. We introducing an option:

Code:

--last
allows over-writing, effectively keeping the most-recently
seen instance. Some versions of uniq on other *nix systems use
the most recent (Solaris), the default is compatibility with
GNU/Linux uniq, which keeps the first occurrence.

By substituting this idea for the system version of uniq, we were able to produce consistent results.

I think this problem can approached with the sort idea of danmero, but with the stable option set, and a "final filter" that eliminates duplicates. Because the file is already sorted, no additional storage is needed: in the final filter, if the fields of the incoming record differ from that in storage, then write out the saved line, and save the new line. If the fields are the same, then save the new instance of the line. Our code was in perl, but awk could be as easily used.

Best wishes ... cheers, drl

Last edited by drl; 10-20-2013 at 08:44 AM..

drl

View Public Profile for drl

Find all posts by drl

Shell Programming and Scripting

To remove duplicates from pipe delimited file

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to remove new line characters from data rows in a Pipe delimited file?

Discussion started by: styris

2. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Discussion started by: raja kakitapall

3. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Discussion started by: kevinprood

4. Shell Programming and Scripting

How to ignore Pipe in Pipe delimited file?

Discussion started by: rohit_shinez

5. Shell Programming and Scripting

Remove few columns from pipe delimited file

Discussion started by: greenworld123

6. Shell Programming and Scripting

Help with converting Pipe delimited file to Tab Delimited

Discussion started by: karumudi7

7. Shell Programming and Scripting

How to convert a space delimited file into a pipe delimited file using shellscript?

Discussion started by: nithins007

8. Shell Programming and Scripting

Remove SPACES between PIPE delimited file

Discussion started by: srimitta

9. Shell Programming and Scripting

convert a pipe delimited file to a':" delimited file

Discussion started by: priyanka3006

10. Shell Programming and Scripting

How to generate a pipe ( | ) delimited file?

Discussion started by: anushree.a