Remove duplicate values in a column(not in the file)
Hi Gurus,
I have a file(weblog) as below
I want to remove the duplicate values from column 4 and 5. There is a possibility that same value may repeat with comma delimited. Comma can also come in the data as well .
My algorithm is to take column 1, 2, 3 (makes record unique) and then split column 4 and 5 based on commas, and then remove duplicate, join them back with comma(so that comma in the record wont be lost)
Does the order of the resulting strings in fields 4 and 5 matter?
In other words, does it matter if the last line of the output shown above would be:
instead of:
It is easier and faster if the output order can be random; but it isn't hard to keep the input order if it matters.
Hi ratheeshjulk,
On my system, with the sample input provided in post#1 in this thread, vgersh99's code produces the output:
(using random order for the subfields in the fields that are being processed for duplicate entries). Different versions of awk might produce different random orders.
The following similar awk script produces output with the order of subfields in the fields maintained with the first copy of a duplicated subfield kept in place and later copies of that subfield dropped from the output:
and produces the output:
with the same input. Note that the text marked in red is shown in both of our outputs because there is a difference between agentcode and gentcode that causes both subfields to appear in the output even though it does not appear in the output you said you wanted.
You haven't said what operating system you're using. If you are using a Solaris/SunOS system, you'll need to use /usr/xpg4/bin/awk or nawk instead of awk for both of our suggestions.
My code also assumes that the fields to be processed will always be adjacent no matter how many fields in your real input files need to be processed; vgersh99's code lets you select any set (contiguous or non-contiguous) of fields to be processed. If the fields you want to process in your real files are not contiguous, but it is necessary to keep output subfields in the order in which they were found in the input; it would be easy to modify my code to use the same scheme vgersh99 used to identify fields to be processed.
This User Gave Thanks to Don Cragun For This Post:
Dear folks
I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Hello,
I have a script that is generating a tab delimited output file.
num Name PCA_A1 PCA_A2 PCA_A3
0 compound_00 -3.5054 -1.1207 -2.4372
1 compound_01 -2.2641 0.4287 -1.6120
3 compound_03 -1.3053 1.8495 ... (3 Replies)
Input
1,ABCD,no
2,system,yes
3,ABCD,yes
4,XYZ,no
5,XYZ,yes
6,pc,noCode used to find duplicate with regard to 2nd column
awk 'NR == 1 {p=$2; next} p == $2 { print "Line" NR "$2 is duplicated"} {p=$2}' FS="," ./input.csv
Now is there a wise way to de-duplicate the entire line (remove... (4 Replies)
Hi Gents,
Please can you help me to get the desired output .
In the first column I have some duplicate records, The condition is that all need to reject the duplicate record keeping the last occurrence. But the condition is. If the last occurrence is equal to value 14 or 98 in column 3 and... (2 Replies)
Dear Experts,
Kindly help me please,
I have a big file where there is duplicate values in col 11 till col 23, every 2 rows appers a new numbers, but in each row there is different coordinates x and y in col 57 till col 74.
Please i will like to get a single value and average of the x and y... (8 Replies)
Hello experts,
I have a requirement where I have to implement two checks on a csv file:
1. Check to see if the value in first column is duplicate, if any value is duplicate script should exit.
2. Check to verify if the value at second column is between "yes" or "no", if it is anything else... (4 Replies)
Hi, I've got a file that I'd like to uniquely sort based on column 2 (values in column 2 begin with "comp").
I tried sort -t -nuk2,3 file.txtBut got:
sort: multi-character tab `-nuk2,3'
"man sort" did not help me out
Any pointers?
Input:
Output: (5 Replies)
Hi,
I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g.
COL1 COL2 COL3
A 1234 1234
B 3k32 2322
C Xk32 TTT
A NEW XX22
B 3k32 ... (7 Replies)