Sponsored Content
Top Forums Shell Programming and Scripting Remove duplicate values in a column(not in the file) Post 302979992 by ratheeshjulk on Monday 22nd of August 2016 05:25:49 PM
Old 08-22-2016
Remove duplicate values in a column(not in the file)

Hi Gurus,

I have a file(weblog) as below

Code:
 
 abc|xyz|123|agentcode=sample code abcdeeess,agentcode=sample code abcdeeess,agentcode=sample code abcdeeess|agentadd=abcd stereet 23343,agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess,gentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss,agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess,agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd,agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt,agentadd=tttt stereet ttttt

I want to remove the duplicate values from column 4 and 5. There is a possibility that same value may repeat with comma delimited. Comma can also come in the data as well .
My algorithm is to take column 1, 2, 3 (makes record unique) and then split column 4 and 5 based on commas, and then remove duplicate, join them back with comma(so that comma in the record wont be lost)

Is there a command with awk or perl

Out put should be like

Code:
 
 abc|xyz|123|agentcode=sample code abcdeeess|agentadd=abcd stereet 23343
 sss|wwq|999|agentcode=sample1 code wqwdeeess|agentadd=ssss stereet sssss
 awe|rez|777|agentcode=sample2 code dfsdfeess|agentadd=tttt stereet ttttt
 twe|tez|555|agentcode=sample3 code ddddddddd,dddddd|agentadd=tttt stereet ttttt

 

9 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Remove duplicate rows of a file based on a value of a column

Hi, I am processing a file and would like to delete duplicate records as indicated by one of its column. e.g. COL1 COL2 COL3 A 1234 1234 B 3k32 2322 C Xk32 TTT A NEW XX22 B 3k32 ... (7 Replies)
Discussion started by: risk_sly
7 Replies

2. UNIX for Dummies Questions & Answers

[SOLVED] remove lines that have duplicate values in column two

Hi, I've got a file that I'd like to uniquely sort based on column 2 (values in column 2 begin with "comp"). I tried sort -t -nuk2,3 file.txtBut got: sort: multi-character tab `-nuk2,3' "man sort" did not help me out Any pointers? Input: Output: (5 Replies)
Discussion started by: pathunkathunk
5 Replies

3. Shell Programming and Scripting

Check to identify duplicate values at first column in csv file

Hello experts, I have a requirement where I have to implement two checks on a csv file: 1. Check to see if the value in first column is duplicate, if any value is duplicate script should exit. 2. Check to verify if the value at second column is between "yes" or "no", if it is anything else... (4 Replies)
Discussion started by: avikaljain
4 Replies

4. Shell Programming and Scripting

Remove the values from a certain column without deleting the Column name in a .CSV file

(14 Replies)
Discussion started by: dhruuv369
14 Replies

5. Shell Programming and Scripting

Get the average from column, and eliminate the duplicate values.

Dear Experts, Kindly help me please, I have a big file where there is duplicate values in col 11 till col 23, every 2 rows appers a new numbers, but in each row there is different coordinates x and y in col 57 till col 74. Please i will like to get a single value and average of the x and y... (8 Replies)
Discussion started by: jiam912
8 Replies

6. Shell Programming and Scripting

Remove duplicate values with condition

Hi Gents, Please can you help me to get the desired output . In the first column I have some duplicate records, The condition is that all need to reject the duplicate record keeping the last occurrence. But the condition is. If the last occurrence is equal to value 14 or 98 in column 3 and... (2 Replies)
Discussion started by: jiam912
2 Replies

7. Shell Programming and Scripting

Identify duplicate values at first column in csv file

Input 1,ABCD,no 2,system,yes 3,ABCD,yes 4,XYZ,no 5,XYZ,yes 6,pc,noCode used to find duplicate with regard to 2nd column awk 'NR == 1 {p=$2; next} p == $2 { print "Line" NR "$2 is duplicated"} {p=$2}' FS="," ./input.csv Now is there a wise way to de-duplicate the entire line (remove... (4 Replies)
Discussion started by: deadyetagain
4 Replies

8. Shell Programming and Scripting

Filter file to remove duplicate values in first column

Hello, I have a script that is generating a tab delimited output file. num Name PCA_A1 PCA_A2 PCA_A3 0 compound_00 -3.5054 -1.1207 -2.4372 1 compound_01 -2.2641 0.4287 -1.6120 3 compound_03 -1.3053 1.8495 ... (3 Replies)
Discussion started by: LMHmedchem
3 Replies

9. Shell Programming and Scripting

Find duplicate values in specific column and delete all the duplicate values

Dear folks I have a map file of around 54K lines and some of the values in the second column have the same value and I want to find them and delete all of the same values. I looked over duplicate commands but my case is not to keep one of the duplicate values. I want to remove all of the same... (4 Replies)
Discussion started by: sajmar
4 Replies
bup-margin(1)						      General Commands Manual						     bup-margin(1)

NAME
bup-margin - figure out your deduplication safety margin SYNOPSIS
bup margin [options...] DESCRIPTION
bup margin iterates through all objects in your bup repository, calculating the largest number of prefix bits shared between any two entries. This number, n, identifies the longest subset of SHA-1 you could use and still encounter a collision between your object ids. For example, one system that was tested had a collection of 11 million objects (70 GB), and bup margin returned 45. That means a 46-bit hash would be sufficient to avoid all collisions among that set of objects; each object in that repository could be uniquely identified by its first 46 bits. The number of bits needed seems to increase by about 1 or 2 for every doubling of the number of objects. Since SHA-1 hashes have 160 bits, that leaves 115 bits of margin. Of course, because SHA-1 hashes are essentially random, it's theoretically possible to use many more bits with far fewer objects. If you're paranoid about the possibility of SHA-1 collisions, you can monitor your repository by running bup margin occasionally to see if you're getting dangerously close to 160 bits. OPTIONS
--predict Guess the offset into each index file where a particular object will appear, and report the maximum deviation of the correct answer from the guess. This is potentially useful for tuning an interpolation search algorithm. --ignore-midx don't use .midx files, use only .idx files. This is only really useful when used with --predict. EXAMPLE
$ bup margin Reading indexes: 100.00% (1612581/1612581), done. 40 40 matching prefix bits 1.94 bits per doubling 120 bits (61.86 doublings) remaining 4.19338e+18 times larger is possible Everyone on earth could have 625878182 data sets like yours, all in one repository, and we would expect 1 object collision. $ bup margin --predict PackIdxList: using 1 index. Reading indexes: 100.00% (1612581/1612581), done. 915 of 1612581 (0.057%) SEE ALSO
bup-midx(1), bup-save(1) BUP
Part of the bup(1) suite. AUTHORS
Avery Pennarun <apenwarr@gmail.com>. Bup unknown- bup-margin(1)
All times are GMT -4. The time now is 05:28 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy