To remove duplicates from pipe delimited file Post: 302865765

Sponsored Content

Top Forums Shell Programming and Scripting To remove duplicates from pipe delimited file Post 302865765 by Don Cragun on Saturday 19th of October 2013 04:18:49 PM

10-19-2013

Registered User

This is a much harder problem that it appears at first glance.
The sort solution proposed by danmero should just give one line sort each set of lines with identical values in the 1st 2 fields, but the one printed depends on the sort order of the remaining fields. The order ginkrf requested was that the last line in the (unsorted) file be printed for each set of lines with identical values in the 1st two fields.

The awk solution proposed by mjf will print the 1st line of each matching set instead of the last line of each matching set. (And, if the 1st 2 fields when concatenated yield the same key even though the fields are different, some desired output lines may be skipped. For example if $1 is "ab" and $2 is "c" in one record and "a" and "bc" in another, they will both have key "abc".)

Since ginkrf didn't say whether the order of the lines in the output has to match the order in which they appeared in the input, I won't try to guess at an efficient way to do what has been requested. If the order is important, the input file could be reversed, fed through mfj's awk script (with !x[$1 $2]++ changed to !x[$1,$2]++), and then reverse the order of the output again. Depending on the output constraints this might or might not be grossly inefficient.

If the output order is not important, it could be done easily with an awk script, but could require almost 400mb of virtual address space to process 20 million 20 byte records.

With a better description of the input (is there anything in a record other than its position in the input that can be used to determine which of several lines with the 1st two fields matching should be printed) and the output constraints, we might be able to provide a better solution. Are there ever more than two lines with the same 1st two fields? If yes, out of the 20 million input records, how many output records do you expect to be produced? Are there likely to be lots of lines that only have one occurrence of the 1st two fields? What are the file sizes (input and output) in bytes (instead of records)? What is the longest input line in bytes?

What OS and hardware are you using? How much memory? How much swap space?

Last edited by Don Cragun; 10-20-2013 at 10:31 AM.. Reason: Fix explanation of the possible failure of mjf's awk proposal.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to generate a pipe ( | ) delimited file?

:)Hi Friends, I have certain log files extracted. I want it to be converted in pipe ( | ) delimited file. How do i do it? E.g. Account Balance : 123456789 Rs O/P (Account Balance: | 123456789 Rs) Account Balance (Last) > 987654321 Rs O/P (Account Balance (Last) | 987654321 Rs) Last...

2. Shell Programming and Scripting

convert a pipe delimited file to a':" delimited file

i have a file whose data is like this:: osr_pe_assign|-120|wg000d@att.com|4| osr_evt|-21|wg000d@att.com|4| pe_avail|-21|wg000d@att.com|4| osr_svt|-11|wg000d@att.com|4| pe_mop|-13|wg000d@att.com|4| instar_ready|-35|wg000d@att.com|4| nsdnet_ready|-90|wg000d@att.com|4|...

3. Shell Programming and Scripting

Remove SPACES between PIPE delimited file

This is my input file with extra information in the HEADER and leading & trailing SPACES between PIPE delimiter. 02/04/2010 Dynamic List Display 1 --------------------------------------------------------------------------------------...

4. Shell Programming and Scripting

How to convert a space delimited file into a pipe delimited file using shellscript?

Hi All, I have space delimited file similar to the one as shown below.. I need to convert it as a pipe delimited, the values inside the pipe delimited file should be as highlighted... AA ATIU2345098809 009697 005374 BB ATIU2345097809 005445 006518 CC ATIU9685098809 003215 003571 DD...

5. Shell Programming and Scripting

Help with converting Pipe delimited file to Tab Delimited

I have a file which was pipe delimited, I need to make it tab delimited. I tried with sed but no use cat file | sed 's/|//t/g' The above command substituted "/t" not tab in the place of pipe. Sample file: abc|123|2012-01-30|2012-04-28|xyz have to convert to: abc 123...

6. Shell Programming and Scripting

Remove few columns from pipe delimited file

7. Shell Programming and Scripting

How to ignore Pipe in Pipe delimited file?

Hi guys, I need to know how i can ignore Pipe '|' if Pipe is coming as a column in Pipe delimited file for eg: file 1: xx|yy|"xyz|zzz"|zzz|12... using below awk command awk 'BEGIN {FS=OFS="|" } print $3 i would get xyz But i want as : xyz|zzz to consider as whole column...

8. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker... Column #1 is a simple ID, which is used to identify the duplicate. Once dups are identified, I need to only keep the one...

9. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Hi, I have a rquirement in unix as below . I have a text file with me seperated by | symbol and i need to generate a excel file through unix commands/script so that each value will go to each column. ex: Input Text file: 1|A|apple 2|B|bottle excel file to be generated as output as...

10. Shell Programming and Scripting

How to remove new line characters from data rows in a Pipe delimited file?

I have a file as below Emp1|FirstName|MiddleName|LastName|Address|Pincode|PhoneNumber 1234|FirstName1|MiddleName2|LastName3| Add1 || ADD2|123|000000000 2345|FirstName2|MiddleName3|LastName4| Add1 || ADD2| 234|000000000 OUTPUT : ...

LEARN ABOUT BSD

uniq

UNIQ(1) 						      General Commands Manual							   UNIQ(1)

NAME

       uniq - report repeated lines in a file

SYNOPSIS

       uniq [ -udc [ +n ] [ -n ] ] [ input [ output ] ]

DESCRIPTION

       Uniq  reads  the  input file comparing adjacent lines.  In the normal case, the second and succeeding copies of repeated lines are removed;
       the remainder is written on the output file.  Note that repeated lines must be adjacent in order to be found; see sort(1).  If the -u  flag
       is  used, just the lines that are not repeated in the original file are output.	The -d option specifies that one copy of just the repeated
       lines is to be written.	The normal mode output is the union of the -u and -d mode outputs.

       The -c option supersedes -u and -d and generates an output report in default style but with each line preceded by a count of the number	of
       times it occurred.

       The n arguments specify skipping an initial portion of each line in the comparison:

       -n      The  first n fields together with any blanks before each are ignored.  A field is defined as a string of non-space, non-tab charac-
	       ters separated by tabs and spaces from its neighbors.

       +n      The first n characters are ignored.  Fields are skipped before characters.

SEE ALSO

       sort(1), comm(1)

7th Edition							  April 29, 1985							   UNIQ(1)

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to generate a pipe ( | ) delimited file?

Discussion started by: anushree.a

2. Shell Programming and Scripting

convert a pipe delimited file to a':" delimited file

Discussion started by: priyanka3006

3. Shell Programming and Scripting

Remove SPACES between PIPE delimited file

Discussion started by: srimitta

4. Shell Programming and Scripting

How to convert a space delimited file into a pipe delimited file using shellscript?

Discussion started by: nithins007