Remove duplicate values in a column(not in the file)


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicate values in a column(not in the file)
# 1  
Old 08-22-2016
Remove duplicate values in a column(not in the file)

Hi Gurus,

I have a file(weblog) as below

Code:
 
 abc|xyz|123|agentcode=sample code abcdeeess,agentcode=sample code abcdeeess,agentcode=sample code abcdeeess|agentadd=abcd stereet 23343,agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss,agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd,agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt

I want to remove the duplicate values from column 4 and 5. There is a possibility that same value may repeat with comma delimited. Comma can also come in the data as well .
My algorithm is to take column 1, 2, 3 (makes record unique) and then split column 4 and 5 based on commas, and then remove duplicate, join them back with comma(so that comma in the record wont be lost)

Is there a command with awk or perl

Out put should be like

Code:
 
 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

# 2  
Old 08-22-2016
Does the order of the resulting strings in fields 4 and 5 matter?
In other words, does it matter if the last line of the output shown above would be:
Code:
 twe|tez|555|dddddd,agentcode=sample3 code ddddddddd|agentadd=tttt stereet ttttt

instead of:
Code:
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

It is easier and faster if the output order can be random; but it isn't hard to keep the input order if it matters.
# 3  
Old 08-22-2016
could probably be simplified a bit, but it's a start...
awk -f rath.awk wevlogFile where rath.awk is:
Code:
BEGIN {
  FS=OFS="|"
  fA[4];fA[5]
}
function uniq(f,   s,a,at,i)
{
   s=""
   split($f, a, ",")
   for(i in a)
       at[a[i]]
   for(i in at)
     s=(!s)? i:s "," i
   return(s)
}
{
   for(i=1; i<=NF; i++)
     printf("%s%s", (i in fA)?uniq(i):$i, (i==NF)?ORS:OFS)

}

This User Gave Thanks to vgersh99 For This Post:
# 4  
Old 08-22-2016
Hi ratheeshjulk,
On my system, with the sample input provided in post#1 in this thread, vgersh99's code produces the output:
Code:
 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|gentcode=sample1 code wqwdeeess,agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|dddddd,agentcode=sample3 code ddddddddd|agentadd=tttt street ttttt

(using random order for the subfields in the fields that are being processed for duplicate entries). Different versions of awk might produce different random orders.

The following similar awk script produces output with the order of subfields in the fields maintained with the first copy of a duplicated subfield kept in place and later copies of that subfield dropped from the output:
Code:
awk '
function nodup(field,	f, n, loop, seen) {
	n = split($field, f, SFS)
	seen[$field = f[1]]
	for(loop = 2; loop <= n; loop++)
		if(!(f[loop] in seen)) {
			$field = $field SFS f[loop]
			seen[f[loop]]
		}
}
BEGIN {	FS = OFS = "|"
	SFS = ","
	low = 4
	high = 5
}
{	for(i = low; i <= high; i++)
		nodup(i)
}
1' weblog

and produces the output:
Code:
 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt street ttttt

with the same input. Note that the text marked in red is shown in both of our outputs because there is a difference between agentcode and gentcode that causes both subfields to appear in the output even though it does not appear in the output you said you wanted.

You haven't said what operating system you're using. If you are using a Solaris/SunOS system, you'll need to use /usr/xpg4/bin/awk or nawk instead of awk for both of our suggestions.

My code also assumes that the fields to be processed will always be adjacent no matter how many fields in your real input files need to be processed; vgersh99's code lets you select any set (contiguous or non-contiguous) of fields to be processed. If the fields you want to process in your real files are not contiguous, but it is necessary to keep output subfields in the order in which they were found in the input; it would be easy to modify my code to use the same scheme vgersh99 used to identify fields to be processed.
This User Gave Thanks to Don Cragun For This Post:
# 5  
Old 08-23-2016
Thanks.. solution worked..
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies

2. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

3. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Input 1,ABCD,no 2,system,yes 3,ABCD,yes 4,XYZ,no 5,XYZ,yes 6,pc,noCode used to find duplicate with regard to 2nd column awk 'NR == 1 {p=$2; next} p == $2 { print "Line" NR "$2 is duplicated"} {p=$2}' FS="," ./input.csv Now is there a wise way to de-duplicate the entire line (remove... (4 Replies)
Discussion started by: deadyetagain
4 Replies

4. Shell Programming and Scripting

Remove duplicate values with condition

Hi Gents, Please can you help me to get the desired output . In the first column I have some duplicate records, The condition is that all need to reject the duplicate record keeping the last occurrence. But the condition is. If the last occurrence is equal to value 14 or 98 in column 3 and... (2 Replies)
Discussion started by: jiam912
2 Replies

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Dear Experts, Kindly help me please, I have a big file where there is duplicate values in col 11 till col 23, every 2 rows appers a new numbers, but in each row there is different coordinates x and y in col 57 till col 74. Please i will like to get a single value and average of the x and y... (8 Replies)
Discussion started by: jiam912
8 Replies

6. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

(14 Replies)
Discussion started by: dhruuv369
14 Replies

7. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Hello experts, I have a requirement where I have to implement two checks on a csv file: 1. Check to see if the value in first column is duplicate, if any value is duplicate script should exit. 2. Check to verify if the value at second column is between "yes" or "no", if it is anything else... (4 Replies)
Discussion started by: avikaljain
4 Replies

8. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Hi, I've got a file that I'd like to uniquely sort based on column 2 (values in column 2 begin with "comp"). I tried sort -t -nuk2,3 file.txtBut got: sort: multi-character tab `-nuk2,3' "man sort" did not help me out Any pointers? Input: Output: (5 Replies)
Discussion started by: pathunkathunk
5 Replies

9. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Hi, I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g. COL1 COL2 COL3 A 1234 1234 B 3k32 2322 C Xk32 TTT A NEW XX22 B 3k32 ... (7 Replies)
Discussion started by: risk_sly
7 Replies
Login or Register to Ask a Question