Remove duplicates separated by delimiter


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Remove duplicates separated by delimiter
# 1  
Old 05-21-2018
Remove duplicates separated by delimiter

First post, been browsing for 3 days and came out with nothing so far.

Code:
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A2,A1-4 B4-B6,B2-B4,B4-B6,B1-B2

output should be
Code:
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B2-B4,B4-B6,B1-B2

On col 6 and 7 there are strings in form of Ax-Ax and Bx-Bx respectively. Each string are separated by a comma ",".

How can i remove strings that are duplicates across col 6 and col 7.
For e.g if A1-A2,A1-A2 are present on col 6, i want to keep only one.

Code:
awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ","; i=split("",a); print ""}' data

Saw a question like mine on SO , but im stuck.

What am i doing wrong ?

Code:
awk ' 
BEGIN { FS="\t" } ;
{
  split($6, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $8

}'

After many failed attempts i came out with the solution of breaking the delimiters to remove duplicate fields across each row, only to realize that I need to regroup all As under col 6 and Bs under col 7. Back to square 1 !

So , the solution for me would be to remove duplicates separated by delimiter in a column. Tried a Perl approach but in vain.

Thank you for your help

*Update
Please note that in this given example, Col 6 and Col 7 are not sorted

Last edited by enrikS; 05-22-2018 at 10:16 AM..
# 2  
Old 05-21-2018
Welcome to the forum.


Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.
# 3  
Old 05-21-2018
Howsoever, how far would
Code:
awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T)   {TX = TX DL t
                                                 DL = ","
                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B4,B6,B1-B2,B2-B4,B4-B6

get you?
This User Gave Thanks to RudiC For This Post:
# 4  
Old 05-22-2018
Code:
Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.

made an error on first post, updated.

It worked. The way you wrote the code is self-explanatory, I'm baffled.
Thank you for your time.
# 5  
Old 05-22-2018
Hi enrikS,
If keeping the input order of elements in fields 6 and 7 is important and you want <tab> as your output field separator (as shown in your second code snippet), you could also try:
Code:
awk '
BEGIN {	OFS = "\t"
}
function RMDUP(input,	i, n, NoDupArray, output, ValueArray) {
	n = split(input, ValueArray, /,/)
	NoDupArray[output = ValueArray[1]]
	for(i = 2; i <= n; i++)
		if(!(ValueArray[i] in NoDupArray)) {
			output = output "," ValueArray[i]
			NoDupArray[ValueArray[i]]
		}
	return output
}
{	$6 = RMDUP($6)
	$7 = RMDUP($7)
}
1' data

In addition to the change you have already made to your original post, note also that if you want the field 6 output to be:
Code:
A1-A2,A5-A6,A1-A4

you can't have the input be:
Code:
A1-A2,A5-A6,A1-A2,A1-4

The above code produces the output:
Code:
M3	C2	V5	D5	HH:FF	A1-A2,A5-A6,A1-4	B4-B6,B2-B4,B1-B2

from the sample input you provided in post #1.

For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator (FS) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split()ing an empty field (e.g., $6) produces an array with zero elements.
# 6  
Old 05-22-2018
Quote:
For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator (FS) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split()ing an empty field (e.g., $6) produces an array with zero elements.
Spent hours trying to figure out why it was not working, All this because of misinterpretation of space for tab.

As for Col 6 or Col 7, all my strings are sorted. [The ones used in this example are not ]. As order of the output was not necessary, I did not mind when i ran the test this morning. But it good to know that it can be sort. Will edit the post to include that info.
.
One question in regards to
Code:
{for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]  
                                                              for (t in T)   {TX = TX DL t                                                                                          DL = ","    }

,

Don't know how to formulate it properly, just going to give an e.g
Code:
M3    C2    A1    D5    HH:FF    A1-A2,A5-A6,A1-A4    B4-B6,B2-B4,B1-B2

delete array if $3 is present
In this case $3 = A1, ; A1-A2 and A1-A4 must be removed.

So basically, before I saw your method, I put the 3rd column in a new text file, and search for these arrays. I was wondering if using your method is less complex. Hopefully this week end, will give it a try.
Im still learning how to write my codes using different approach. 3 weeks ago did not even know how to use linux lol Been so hard to comment and ask question on SO without being labeled [witch-hunt]. Glad I found this forum.










# 7  
Old 05-22-2018
Not sure I understand correctly, but if you want to remove all elements from the array that match / contain $3, try (with your new sample code):

Code:
awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T) if (!(t ~ $3))    {TX = TX DL t
                                                                 DL = ","
                                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 A1 D5 HH:FF A5-A6 B1-B2,B2-B4,B4-B


Last edited by RudiC; 05-22-2018 at 12:06 PM..
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Remove duplicates from comma separated list

Hi, I have following input file: niki niki niki1 niki niki2 niki,niki2 niki3 niki,niki3,niki niki4 niki4,blabla niki5 jkjkl niki6 niki60,niki6 I would like to delete lines with identical matches completely and remove the selfmatches in the other lines. ... (2 Replies)
Discussion started by: niki0211
2 Replies

2. Shell Programming and Scripting

Remove leading zeros separated by pipe

I have a below file and I wanted to remove the leading zeros in each field separated by pipe File: 01/09/2017|2017/09/06|2017/02/06|02/06/2017|02/06/2017 06:50:06 AM|2017/02/06|02/06/2017|02/07/2017 05:45:06 AM| 02/08/2017|2017/08/06|2017/09/06|02/05/2017|02/07/2017 05:40:06... (4 Replies)
Discussion started by: Joselouis
4 Replies

3. Shell Programming and Scripting

How to remove duplicates using for loop?

values=(1 2 3 5 4 2 3 1 6 8 3 5 ) #i need the output like this by removing the duplicates 1 2 3 5 4 6 8 #i dont need sorting in my program #plz explain me as simple using for loop #os-ubuntu ,shell=bash (5 Replies)
Discussion started by: Meeran Rizvi
5 Replies

4. Shell Programming and Scripting

Remove duplicates

Hi I have a below file structure. 200,1245,E1,1,E1,,7611068,KWH,30, ,,,,,,,, 200,1245,E1,1,E1,,7611070,KWH,30, ,,,,,,,, 300,20140223,0.001,0.001,0.001,0.001,0.001 300,20140224,0.001,0.001,0.001,0.001,0.001 300,20140225,0.001,0.001,0.001,0.001,0.001 300,20140226,0.001,0.001,0.001,0.001,0.001... (1 Reply)
Discussion started by: tejashavele
1 Replies

5. Shell Programming and Scripting

Sort and Remove duplicates

Here is my task : I need to sort two input files and remove duplicates in the output files : Sort by 13 characters from 97 Ascending Sort by 1 characters from 96 Ascending If duplicates are found retain the first value in the file the input files are variable length, convert... (4 Replies)
Discussion started by: ysvsr1
4 Replies

6. Shell Programming and Scripting

Remove duplicates

I have a file with the following format: fields seperated by "|" title1|something class|long...content1|keys title2|somhing class|log...content1|kes title1|sothing class|lon...content1|kes title3|shing cls|log...content1|ks I want to remove all duplicates with the same "title field"(the... (3 Replies)
Discussion started by: dtdt
3 Replies

7. Shell Programming and Scripting

need help extracting values from string separated by a delimiter

hi guys, basically what i'm trying to do is fetching a set of columns from an oracle database like so... my_row=`sqlplus -s user/pwd << EOF set head off select user_id, username from all_users where rownum = 1; EOF` echo $my_row the code above returns... 1 ADSHOCKER so then i... (3 Replies)
Discussion started by: adshocker
3 Replies

8. Shell Programming and Scripting

Script to remove duplicates

Hi I need a script that removes the duplicate records and write it to a new file for example I have a file named test.txt and it looks like abcd.23 abcd.24 abcd.25 qwer.25 qwer.26 qwer.98 I want to pick only $1 and compare with the next record and the output should be abcd.23... (6 Replies)
Discussion started by: antointoronto
6 Replies

9. Shell Programming and Scripting

Extract semicolon separated delimiter

The log reads as follows. fname1;lname1;eid1;addr;pincode1; fname2;lname2;eid2;addr2;pincode2; fname3;lname3;eid3;addr3;pincode3; fname4;lname4;eid;addr4;pincode4; how do i extract only fname and save it in an array similarly for lname and so on i tried reading a file and cutting each... (5 Replies)
Discussion started by: vkca
5 Replies

10. Shell Programming and Scripting

Remove duplicates

Hello Experts, I have two files named old and new. Below are my example files. I need to compare and print the records that only exist in my new file. I tried the below awk script, this script works perfectly well if the records have exact match, the issue I have is my old file has got extra... (4 Replies)
Discussion started by: forumthreads
4 Replies
Login or Register to Ask a Question