Remove duplicate values in a column(not in the file)

08-22-2016

Registered User

47, 3

Join Date: Jun 2011

Last Activity: 21 August 2019, 11:31 AM EDT

Posts: 47

Thanks Given: 19

Thanked 3 Times in 2 Posts

Remove duplicate values in a column(not in the file)

Hi Gurus,

I have a file(weblog) as below

Code:

 
 abc|xyz|123|agentcode=sample code abcdeeess,agentcode=sample code abcdeeess,agentcode=sample code abcdeeess|agentadd=abcd stereet 23343,agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss,agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd,agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt

I want to remove the duplicate values from column 4 and 5. There is a possibility that same value may repeat with comma delimited. Comma can also come in the data as well .
My algorithm is to take column 1, 2, 3 (makes record unique) and then split column 4 and 5 based on commas, and then remove duplicate, join them back with comma(so that comma in the record wont be lost)

Is there a command with awk or perl

Out put should be like

Code:

 
 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

ratheeshjulk

View Public Profile for ratheeshjulk

Find all posts by ratheeshjulk

08-22-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Does the order of the resulting strings in fields 4 and 5 matter?
In other words, does it matter if the last line of the output shown above would be:

Code:

 twe|tez|555|dddddd,agentcode=sample3 code ddddddddd|agentadd=tttt stereet ttttt

instead of:

Code:

 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

It is easier and faster if the output order can be random; but it isn't hard to keep the input order if it matters.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-22-2016

Moderator

8,825, 1,112

Join Date: Feb 2005

Last Activity: 23 August 2021, 11:26 AM EDT

Location: Foxborough, MA

Posts: 8,825

Thanks Given: 579

Thanked 1,112 Times in 1,003 Posts

could probably be simplified a bit, but it's a start...
awk -f rath.awk wevlogFile where rath.awk is:

Code:

BEGIN {
  FS=OFS="|"
  fA[4];fA[5]
}
function uniq(f,   s,a,at,i)
{
   s=""
   split($f, a, ",")
   for(i in a)
       at[a[i]]
   for(i in at)
     s=(!s)? i:s "," i
   return(s)
}
{
   for(i=1; i<=NF; i++)
     printf("%s%s", (i in fA)?uniq(i):$i, (i==NF)?ORS:OFS)

}

This User Gave Thanks to vgersh99 For This Post:

vgersh99

View Public Profile for vgersh99

Find all posts by vgersh99

08-22-2016

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi ratheeshjulk,
On my system, with the sample input provided in post#1 in this thread, vgersh99's code produces the output:

Code:

 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|gentcode=sample1 code wqwdeeess,agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|dddddd,agentcode=sample3 code ddddddddd|agentadd=tttt street ttttt

(using random order for the subfields in the fields that are being processed for duplicate entries). Different versions of awk might produce different random orders.

The following similar awk script produces output with the order of subfields in the fields maintained with the first copy of a duplicated subfield kept in place and later copies of that subfield dropped from the output:

Code:

awk '
function nodup(field,	f, n, loop, seen) {
	n = split($field, f, SFS)
	seen[$field = f[1]]
	for(loop = 2; loop <= n; loop++)
		if(!(f[loop] in seen)) {
			$field = $field SFS f[loop]
			seen[f[loop]]
		}
}
BEGIN {	FS = OFS = "|"
	SFS = ","
	low = 4
	high = 5
}
{	for(i = low; i <= high; i++)
		nodup(i)
}
1' weblog

and produces the output:

Code:

 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt street ttttt

with the same input. Note that the text marked in red is shown in both of our outputs because there is a difference between agentcode and gentcode that causes both subfields to appear in the output even though it does not appear in the output you said you wanted.

You haven't said what operating system you're using. If you are using a Solaris/SunOS system, you'll need to use /usr/xpg4/bin/awk or nawk instead of awk for both of our suggestions.

My code also assumes that the fields to be processed will always be adjacent no matter how many fields in your real input files need to be processed; vgersh99's code lets you select any set (contiguous or non-contiguous) of fields to be processed. If the fields you want to process in your real files are not contiguous, but it is necessary to keep output subfields in the order in which they were found in the input; it would be easy to modify my code to use the same scheme vgersh99 used to identify fields to be processed.

This User Gave Thanks to Don Cragun For This Post:

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

08-23-2016

Registered User

47, 3

Join Date: Jun 2011

Last Activity: 21 August 2019, 11:31 AM EDT

Posts: 47

Thanks Given: 19

Thanked 3 Times in 2 Posts

Thanks.. solution worked..

ratheeshjulk

View Public Profile for ratheeshjulk

Find all posts by ratheeshjulk

Shell Programming and Scripting

Remove duplicate values in a column(not in the file)

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Discussion started by: sajmar

2. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Discussion started by: LMHmedchem

3. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Discussion started by: deadyetagain

4. Shell Programming and Scripting

Remove duplicate values with condition

Discussion started by: jiam912

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Discussion started by: jiam912

6. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

Discussion started by: dhruuv369

7. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Discussion started by: avikaljain

8. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Discussion started by: pathunkathunk

9. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Discussion started by: risk_sly