×
UNIX.COM Login
Username:
Password:  
Show Password






👤


Shell Programming and Scripting

BSD, Linux, and UNIX shell scripting — Post awk, bash, csh, ksh, perl, php, python, sed, sh, shell scripts, and other shell scripting languages questions here.

Remove duplicates separated by delimiter

awk, perl, solved

👤 Login to reply
 
Thread Tools Search this Thread Display Modes
    #1  
Old 05-21-2018
enrikS enrikS is offline
Registered User
 
Join Date: May 2018
Last Activity: 16 July 2018, 4:14 PM EDT
Posts: 4
Thanks: 4
Thanked 0 Times in 0 Posts
Remove duplicates separated by delimiter

First post, been browsing for 3 days and came out with nothing so far.

Code:
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A2,A1-4 B4-B6,B2-B4,B4-B6,B1-B2

output should be
Code:
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B2-B4,B4-B6,B1-B2

On col 6 and 7 there are strings in form of Ax-Ax and Bx-Bx respectively. Each string are separated by a comma ",".

How can i remove strings that are duplicates across col 6 and col 7.
For e.g if A1-A2,A1-A2 are present on col 6, i want to keep only one.

Code:
awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ","; i=split("",a); print ""}' data

Saw a question like mine on SO , but im stuck.

What am i doing wrong ?

Code:
awk ' 
BEGIN { FS="\t" } ;
{
  split($6, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $8

}'

After many failed attempts i came out with the solution of breaking the delimiters to remove duplicate fields across each row, only to realize that I need to regroup all As under col 6 and Bs under col 7. Back to square 1 !

So , the solution for me would be to remove duplicates separated by delimiter in a column. Tried a Perl approach but in vain.

Thank you for your help

*Update
Please note that in this given example, Col 6 and Col 7 are not sorted

Last edited by enrikS; 05-22-2018 at 09:16 AM..
Sponsored Links
    #2  
Old 05-21-2018
RudiC RudiC is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 17 July 2018, 4:35 AM EDT
Location: Aachen, Germany
Posts: 13,065
Thanks: 448
Thanked 4,012 Times in 3,689 Posts
Welcome to the forum.


Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.
Sponsored Links
    #3  
Old 05-21-2018
RudiC RudiC is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 17 July 2018, 4:35 AM EDT
Location: Aachen, Germany
Posts: 13,065
Thanks: 448
Thanked 4,012 Times in 3,689 Posts
Howsoever, how far would
Code:
awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T)   {TX = TX DL t
                                                 DL = ","
                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B4,B6,B1-B2,B2-B4,B4-B6

get you?
The Following User Says Thank You to RudiC For This Useful Post:
enrikS (05-22-2018)
    #4  
Old 05-22-2018
enrikS enrikS is offline
Registered User
 
Join Date: May 2018
Last Activity: 16 July 2018, 4:14 PM EDT
Posts: 4
Thanks: 4
Thanked 0 Times in 0 Posts
Code:
Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.

made an error on first post, updated.

It worked. The way you wrote the code is self-explanatory, I'm baffled.
Thank you for your time.
Sponsored Links
    #5  
Old 05-22-2018
Don Cragun's Unix or Linux Image
Don Cragun Don Cragun is online now Forum Staff  
Administrator
 
Join Date: Jul 2012
Last Activity: 17 July 2018, 5:15 AM EDT
Location: San Jose, CA, USA
Posts: 11,407
Thanks: 649
Thanked 3,970 Times in 3,393 Posts
Hi enrikS,
If keeping the input order of elements in fields 6 and 7 is important and you want <tab> as your output field separator (as shown in your second code snippet), you could also try:
Code:
awk '
BEGIN {	OFS = "\t"
}
function RMDUP(input,	i, n, NoDupArray, output, ValueArray) {
	n = split(input, ValueArray, /,/)
	NoDupArray[output = ValueArray[1]]
	for(i = 2; i <= n; i++)
		if(!(ValueArray[i] in NoDupArray)) {
			output = output "," ValueArray[i]
			NoDupArray[ValueArray[i]]
		}
	return output
}
{	$6 = RMDUP($6)
	$7 = RMDUP($7)
}
1' data

In addition to the change you have already made to your original post, note also that if you want the field 6 output to be:
Code:
A1-A2,A5-A6,A1-A4

you can't have the input be:
Code:
A1-A2,A5-A6,A1-A2,A1-4

The above code produces the output:
Code:
M3	C2	V5	D5	HH:FF	A1-A2,A5-A6,A1-4	B4-B6,B2-B4,B1-B2

from the sample input you provided in post #1.

For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator (FS) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split()ing an empty field (e.g., $6) produces an array with zero elements.
Sponsored Links
    #6  
Old 05-22-2018
enrikS enrikS is offline
Registered User
 
Join Date: May 2018
Last Activity: 16 July 2018, 4:14 PM EDT
Posts: 4
Thanks: 4
Thanked 0 Times in 0 Posts
Quote:
For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator (FS) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split()ing an empty field (e.g., $6) produces an array with zero elements.
Spent hours trying to figure out why it was not working, All this because of misinterpretation of space for tab.

As for Col 6 or Col 7, all my strings are sorted. [The ones used in this example are not ]. As order of the output was not necessary, I did not mind when i ran the test this morning. But it good to know that it can be sort. Will edit the post to include that info.
.
One question in regards to
Code:
{for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]  
                                                              for (t in T)   {TX = TX DL t                                                                                          DL = ","    }

,

Don't know how to formulate it properly, just going to give an e.g
Code:
M3    C2    A1    D5    HH:FF    A1-A2,A5-A6,A1-A4    B4-B6,B2-B4,B1-B2

delete array if $3 is present
In this case $3 = A1, ; A1-A2 and A1-A4 must be removed.

So basically, before I saw your method, I put the 3rd column in a new text file, and search for these arrays. I was wondering if using your method is less complex. Hopefully this week end, will give it a try.
Im still learning how to write my codes using different approach. 3 weeks ago did not even know how to use linux lol Been so hard to comment and ask question on SO without being labeled [witch-hunt]. Glad I found this forum.










Sponsored Links
    #7  
Old 05-22-2018
RudiC RudiC is offline Forum Staff  
Moderator
 
Join Date: Jul 2012
Last Activity: 17 July 2018, 4:35 AM EDT
Location: Aachen, Germany
Posts: 13,065
Thanks: 448
Thanked 4,012 Times in 3,689 Posts
Not sure I understand correctly, but if you want to remove all elements from the array that match / contain $3, try (with your new sample code):

Code:
awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T) if (!(t ~ $3))    {TX = TX DL t
                                                                 DL = ","
                                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 A1 D5 HH:FF A5-A6 B1-B2,B2-B4,B4-B


Last edited by RudiC; 05-22-2018 at 11:06 AM..
The Following User Says Thank You to RudiC For This Useful Post:
enrikS (05-28-2018)
Sponsored Links
👤 Login to reply

« Previous Thread | Next Thread »
Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
Remove duplicates tejashavele Shell Programming and Scripting 1 02-23-2016 07:53 AM
Remove duplicates dtdt Shell Programming and Scripting 3 04-20-2014 02:26 PM
need help extracting values from string separated by a delimiter adshocker Shell Programming and Scripting 3 02-21-2011 08:57 PM
Extract semicolon separated delimiter vkca Shell Programming and Scripting 5 01-06-2010 03:14 AM
Remove duplicates forumthreads Shell Programming and Scripting 4 12-03-2008 09:51 AM



All times are GMT -4. The time now is 05:29 AM.

Unix & Linux Forums Content Copyright©1993-2018. All Rights Reserved.