Sponsored Content
Top Forums Shell Programming and Scripting To remove duplicates from pipe delimited file Post 302865765 by Don Cragun on Saturday 19th of October 2013 04:18:49 PM
Old 10-19-2013
This is a much harder problem that it appears at first glance.
The sort solution proposed by danmero should just give one line sort each set of lines with identical values in the 1st 2 fields, but the one printed depends on the sort order of the remaining fields. The order ginkrf requested was that the last line in the (unsorted) file be printed for each set of lines with identical values in the 1st two fields.

The awk solution proposed by mjf will print the 1st line of each matching set instead of the last line of each matching set. (And, if the 1st 2 fields when concatenated yield the same key even though the fields are different, some desired output lines may be skipped. For example if $1 is "ab" and $2 is "c" in one record and "a" and "bc" in another, they will both have key "abc".)

Since ginkrf didn't say whether the order of the lines in the output has to match the order in which they appeared in the input, I won't try to guess at an efficient way to do what has been requested. If the order is important, the input file could be reversed, fed through mfj's awk script (with !x[$1 $2]++ changed to !x[$1,$2]++), and then reverse the order of the output again. Depending on the output constraints this might or might not be grossly inefficient.

If the output order is not important, it could be done easily with an awk script, but could require almost 400mb of virtual address space to process 20 million 20 byte records.

With a better description of the input (is there anything in a record other than its position in the input that can be used to determine which of several lines with the 1st two fields matching should be printed) and the output constraints, we might be able to provide a better solution. Are there ever more than two lines with the same 1st two fields? If yes, out of the 20 million input records, how many output records do you expect to be produced? Are there likely to be lots of lines that only have one occurrence of the 1st two fields? What are the file sizes (input and output) in bytes (instead of records)? What is the longest input line in bytes?

What OS and hardware are you using? How much memory? How much swap space?

Last edited by Don Cragun; 10-20-2013 at 10:31 AM.. Reason: Fix explanation of the possible failure of mjf's awk proposal.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

How to generate a pipe ( | ) delimited file?

:)Hi Friends, I have certain log files extracted. I want it to be converted in pipe ( | ) delimited file. How do i do it? E.g. Account Balance : 123456789 Rs O/P (Account Balance: | 123456789 Rs) Account Balance (Last) > 987654321 Rs O/P (Account Balance (Last) | 987654321 Rs) Last... (5 Replies)
Discussion started by: anushree.a
5 Replies

2. Shell Programming and Scripting

convert a pipe delimited file to a':" delimited file

i have a file whose data is like this:: osr_pe_assign|-120|wg000d@att.com|4| osr_evt|-21|wg000d@att.com|4| pe_avail|-21|wg000d@att.com|4| osr_svt|-11|wg000d@att.com|4| pe_mop|-13|wg000d@att.com|4| instar_ready|-35|wg000d@att.com|4| nsdnet_ready|-90|wg000d@att.com|4|... (6 Replies)
Discussion started by: priyanka3006
6 Replies

3. Shell Programming and Scripting

Remove SPACES between PIPE delimited file

This is my input file with extra information in the HEADER and leading & trailing SPACES between PIPE delimiter. 02/04/2010 Dynamic List Display 1 --------------------------------------------------------------------------------------... (6 Replies)
Discussion started by: srimitta
6 Replies

4. Shell Programming and Scripting

How to convert a space delimited file into a pipe delimited file using shellscript?

Hi All, I have space delimited file similar to the one as shown below.. I need to convert it as a pipe delimited, the values inside the pipe delimited file should be as highlighted... AA ATIU2345098809 009697 005374 BB ATIU2345097809 005445 006518 CC ATIU9685098809 003215 003571 DD... (7 Replies)
Discussion started by: nithins007
7 Replies

5. Shell Programming and Scripting

Help with converting Pipe delimited file to Tab Delimited

I have a file which was pipe delimited, I need to make it tab delimited. I tried with sed but no use cat file | sed 's/|//t/g' The above command substituted "/t" not tab in the place of pipe. Sample file: abc|123|2012-01-30|2012-04-28|xyz have to convert to: abc 123... (6 Replies)
Discussion started by: karumudi7
6 Replies

6. Shell Programming and Scripting

Remove few columns from pipe delimited file

I have file as below column1|column2|column3|column4|column5| fill1|fill2|fill3|fill4|fill5| abc1|abc2|abc3|abc4|abc5| . . . . i need to remove column2,3, from that file column1|column4|column5| fill1|fill4|fill5| abc1|abc4|abc5| . . . (3 Replies)
Discussion started by: greenworld123
3 Replies

7. Shell Programming and Scripting

How to ignore Pipe in Pipe delimited file?

Hi guys, I need to know how i can ignore Pipe '|' if Pipe is coming as a column in Pipe delimited file for eg: file 1: xx|yy|"xyz|zzz"|zzz|12... using below awk command awk 'BEGIN {FS=OFS="|" } print $3 i would get xyz But i want as : xyz|zzz to consider as whole column... (13 Replies)
Discussion started by: rohit_shinez
13 Replies

8. Shell Programming and Scripting

Removing duplicates from delimited file based on 2 columns

Hi guys,Got a bit of a bind I'm in. I'm looking to remove duplicates from a pipe delimited file, but do so based on 2 columns. Sounds easy enough, but here's the kicker... Column #1 is a simple ID, which is used to identify the duplicate. Once dups are identified, I need to only keep the one... (2 Replies)
Discussion started by: kevinprood
2 Replies

9. UNIX for Dummies Questions & Answers

Need to convert a pipe delimited text file to tab delimited

Hi, I have a rquirement in unix as below . I have a text file with me seperated by | symbol and i need to generate a excel file through unix commands/script so that each value will go to each column. ex: Input Text file: 1|A|apple 2|B|bottle excel file to be generated as output as... (9 Replies)
Discussion started by: raja kakitapall
9 Replies

10. Shell Programming and Scripting

How to remove new line characters from data rows in a Pipe delimited file?

I have a file as below Emp1|FirstName|MiddleName|LastName|Address|Pincode|PhoneNumber 1234|FirstName1|MiddleName2|LastName3| Add1 || ADD2|123|000000000 2345|FirstName2|MiddleName3|LastName4| Add1 || ADD2| 234|000000000 OUTPUT : ... (1 Reply)
Discussion started by: styris
1 Replies
join-dctrl(1)						      General Commands Manual						     join-dctrl(1)

NAME
join-dctrl - perform relational join on data in dctrl format SYNOPSIS
join-dctrl [ options ] filename filename join-dctrl --version join-dctrl --help DESCRIPTION
join-dctrl performs a relational join operation on data given to it in Debian control file format. A join field must be specified using either the switches -1 and -2 or the switch -j. Conceptually, the program creates all ordered pairs of records that can be formed by having a record from the first file as the first member of the pair and having a record from the second file as the second member of the pair; and then it deletes all such pairs where the join fields are not equal. Effectively, each of the input files is treated as a relational database table. Every input file must be in ascending order on its join field; this allows the program to work fast. The sort-dctrl(1) program can be used to make it so. OPTIONS
-1 field, --1st-join-field=field Specify the join field of the first input file. -2 field, --2nd-join-field=field Specify the join field of the second input file. -j field, --join-field=field Specify a common join field for all files. -a fileno, --unpairable-from=fileno Specify that unmatched paragraphs from the first (if 1 is given) or the second (if 2 is given) file are printed. -o fieldspec, --output-fields=fieldspec Specify which fields are included in the output. Fields are separated by commas (more than one -o option can be used, too). Each field is specified in the format fileno.field in which fileno is the ordinal number of the input file from which the field is drawn (either 1 or 2), and field gives the name of the field to use. As a special case, simple 0 can be used instead of fileno.field to refer to the common value of the join fields. The name of the field (not including the file number) is used in the output as the name of the field. However, a different name for output purposes can be specified by suffixing the field specification by a colon and the preferred visible name. For example, the option -o 0,1.Version:Old-Version,2.Version specifies that the first field in any output record should be the join field, the second field should be Old-Version drawing its data from the Version field of the first input file, and the third field should be Version drawing its data from the field with the same name in the second input file, and these are the only fields in an output record. If no -o option is given, all fields of all the records being joined are included in the output. -l level, --errorlevel=level Set log level to level. level is one of fatal, important, informational and debug, but the last may not be available, depending on the compile-time options. These categories are given here in order; every message that is emitted when fatal is in effect, will be emitted in the important error level, and so on. The default is important. -V, --version Print out version information. -C, --copying Print out the copyright license. This produces much output; be sure to redirect or pipe it somewhere (such as your favourite pager). -h, --help Print out a help summary. OPERANDS
join-dctrl will treat each file named on the command line as a relational database table. A file called - represents the program's stan- dard input stream. Currently, exactly two files must be named. STDIN
The standard input stream may be used as input as specified above in the OPERANDS section. INPUT FILES
All input to join-dctrl is in the format of a Debian control file. A Debian control (dctrl) file is a semistructured single-table database stored in a machine-parseable text file. Such a database consists of a set of records; each record is a mapping from field names to field content. Textually, records are separated by empty lines, while each field is encoded as one or more nonempty lines inside a record. A field starts with its name, followed by a colon, followed by the field content. The colon must reside on the first line of the field, and the first line must start with no whitespace. Subsequent lines, in contrast, always start with linear whitespace (one or more space or tab characters). Each input file must be in the ascending order of its join field. ENVIRONMENT VARIABLES
The standard locale environment, specifically its character set setting, affects the interpretation of input and output as character streams. ASYNCHRONOUS EVENTS
Standard UNIX signals have their usual meaning. STDOUT
All output is sent to the standard output stream. The output is in the format of a Debian control file, described above in the INPUT FILES section. The output will be in the ascending order of the join field, if that field is included in the output. OUTPUT FILES
There are no output files. EXIT STATUS
This utility exits with 0 when successful. It uses a nonzero exit code inconsistently when an error is noticed (this is a bug). CONSEQUENCES OF ERRORS
In case of errors in the input, the output will be partially or completely garbage. In case of errors in invocation, the program will refuse to function. EXAMPLES
Suppose that a file containing data about binary packages for the AMD64 architecture contained in the Debian squeeze (6.0) release, section main, is in the current directory and named Packages. Suppose that we are currently on a Debian system. Suppose further that the current directory does not contain files named stat and pkg. The following commands gives, for each package currently installed and available in Debian squeeze (6.0), its currently installed version (as Old-Version) and the version in squeeze (as New-Version): $ sort-dctrl -kPackage /var/lib/dpkg/status > stat $ sort-dctrl -kPackage Packages > pkg $ join-dctrl -j Package -o 0,1.Version:Old-Version,2.Version:New-Version stat pkg AUTHOR
The join-dctrl program and this manual page were written by Antti-Juhani Kaijanaho. SEE ALSO
grep-dctrl(1), sort-dctrl(1), tbl-dctrl(1) join-dctrl(1)
All times are GMT -4. The time now is 09:13 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy