Remove duplicates separated by delimiter

05-21-2018

Registered User

4, 0

Join Date: May 2018

Last Activity: 7 November 2019, 4:26 PM EST

Posts: 4

Thanks Given: 4

Thanked 0 Times in 0 Posts

Remove duplicates separated by delimiter

First post, been browsing for 3 days and came out with nothing so far.

Code:

M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A2,A1-4 B4-B6,B2-B4,B4-B6,B1-B2

output should be

Code:

M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B2-B4,B4-B6,B1-B2

On col 6 and 7 there are strings in form of Ax-Ax and Bx-Bx respectively. Each string are separated by a comma ",".

How can i remove strings that are duplicates across col 6 and col 7.
For e.g if A1-A2,A1-A2 are present on col 6, i want to keep only one.

Code:

awk '{ while(++i<=NF) printf (!a[$i]++) ? $i FS : ","; i=split("",a); print ""}' data

Saw a question like mine on SO , but im stuck.

What am i doing wrong ?

Code:

awk ' 
BEGIN { FS="\t" } ;
{
  split($6, valueArray,",");
  j=0;
  for (i in valueArray) 
  { 
    if (!( valueArray[i] in duplicateArray))
    {
      duplicateArray[j] = valueArray[i];
      j++;
    }
  };
  printf $1 "\t";
  for (j in duplicateArray) 
  {
    if (duplicateArray[j]) {
      printf duplicateArray[j] ",";
    }
  }
  printf "\t";
  print $8

}'

After many failed attempts i came out with the solution of breaking the delimiters to remove duplicate fields across each row, only to realize that I need to regroup all As under col 6 and Bs under col 7. Back to square 1 !

So , the solution for me would be to remove duplicates separated by delimiter in a column. Tried a Perl approach but in vain.

Thank you for your help

*Update
Please note that in this given example, Col 6 and Col 7 are not sorted

Last edited by enrikS; 05-22-2018 at 10:16 AM..

enrikS

View Public Profile for enrikS

Find all posts by enrikS

05-21-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Welcome to the forum.

Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-21-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Howsoever, how far would

Code:

awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T)   {TX = TX DL t
                                                 DL = ","
                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 V5 D5 HH:FF A1-A2,A5-A6,A1-A4 B4,B6,B1-B2,B2-B4,B4-B6

get you?

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

05-22-2018

Registered User

4, 0

Join Date: May 2018

Last Activity: 7 November 2019, 4:26 PM EST

Posts: 4

Thanks Given: 4

Thanked 0 Times in 0 Posts

Code:

Why is B4-B6 considered a duplicate? I can't see it twice or more in the input line.

made an error on first post, updated.

It worked. The way you wrote the code is self-explanatory, I'm baffled.
Thank you for your time.

enrikS

View Public Profile for enrikS

Find all posts by enrikS

05-22-2018

Registered User

12,315, 4,560

Join Date: Jul 2012

Last Activity: 22 November 2019, 4:29 PM EST

Location: San Jose, CA, USA

Posts: 12,315

Thanks Given: 952

Thanked 4,560 Times in 3,818 Posts

Hi enrikS,
If keeping the input order of elements in fields 6 and 7 is important and you want <tab> as your output field separator (as shown in your second code snippet), you could also try:

Code:

awk '
BEGIN {	OFS = "\t"
}
function RMDUP(input,	i, n, NoDupArray, output, ValueArray) {
	n = split(input, ValueArray, /,/)
	NoDupArray[output = ValueArray[1]]
	for(i = 2; i <= n; i++)
		if(!(ValueArray[i] in NoDupArray)) {
			output = output "," ValueArray[i]
			NoDupArray[ValueArray[i]]
		}
	return output
}
{	$6 = RMDUP($6)
	$7 = RMDUP($7)
}
1' data

In addition to the change you have already made to your original post, note also that if you want the field 6 output to be:

Code:

A1-A2,A5-A6,A1-A4

you can't have the input be:

Code:

A1-A2,A5-A6,A1-A2,A1-4

The above code produces the output:

Code:

M3	C2	V5	D5	HH:FF	A1-A2,A5-A6,A1-4	B4-B6,B2-B4,B1-B2

from the sample input you provided in post #1.

For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator (FS) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split()ing an empty field (e.g., $6) produces an array with zero elements.

Don Cragun

View Public Profile for Don Cragun

Find all posts by Don Cragun

05-22-2018

Registered User

4, 0

Join Date: May 2018

Last Activity: 7 November 2019, 4:26 PM EST

Posts: 4

Thanks Given: 4

Thanked 0 Times in 0 Posts

Quote:

For some hints as to why your second code snippet didn't work, note that your awk code is specifying that the input field separator (FS) is a <tab> character, but there are no <tab>s in your sample input (just <space>s; no <tab>s). Therefore, your awk script is only seeing one input field; not eight. And split()ing an empty field (e.g., $6) produces an array with zero elements.

Spent hours trying to figure out why it was not working, All this because of misinterpretation of space for tab.

As for Col 6 or Col 7, all my strings are sorted. [The ones used in this example are not ]. As order of the output was not necessary, I did not mind when i ran the test this morning. But it good to know that it can be sort. Will edit the post to include that info.
.
One question in regards to

Code:

{for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]  
                                                              for (t in T)   {TX = TX DL t                                                                                          DL = ","    }

,

Don't know how to formulate it properly, just going to give an e.g

Code:

M3    C2    A1    D5    HH:FF    A1-A2,A5-A6,A1-A4    B4-B6,B2-B4,B1-B2

delete array if $3 is present
In this case $3 = A1, ; A1-A2 and A1-A4 must be removed.

So basically, before I saw your method, I put the 3rd column in a new text file, and search for these arrays. I was wondering if using your method is less complex. Hopefully this week end, will give it a try.
Im still learning how to write my codes using different approach. 3 weeks ago did not even know how to use linux lol Been so hard to comment and ask question on SO without being labeled [witch-hunt]. Glad I found this forum.

enrikS

View Public Profile for enrikS

Find all posts by enrikS

05-22-2018

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Not sure I understand correctly, but if you want to remove all elements from the array that match / contain $3, try (with your new sample code):

Code:

awk '
function RMDUP(P1, T, TX, DL)   {for (n = split (P1, TMP, ","); n; n--)  T[TMP[n]]
                                 for (t in T) if (!(t ~ $3))    {TX = TX DL t
                                                                 DL = ","
                                                                }
                                 return TX
                                }
        {$6 = RMDUP($6)
         $7 = RMDUP($7)
        }
 1
' file
M3 C2 A1 D5 HH:FF A5-A6 B1-B2,B2-B4,B4-B

Last edited by RudiC; 05-22-2018 at 12:06 PM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Remove duplicates separated by delimiter

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Remove duplicates from comma separated list

Discussion started by: niki0211

2. Shell Programming and Scripting

Remove leading zeros separated by pipe

Discussion started by: Joselouis

3. Shell Programming and Scripting

How to remove duplicates using for loop?

Discussion started by: Meeran Rizvi

4. Shell Programming and Scripting

Remove duplicates

Discussion started by: tejashavele

5. Shell Programming and Scripting

Sort and Remove duplicates

Discussion started by: ysvsr1

6. Shell Programming and Scripting

Remove duplicates

Discussion started by: dtdt

7. Shell Programming and Scripting

need help extracting values from string separated by a delimiter

Discussion started by: adshocker

8. Shell Programming and Scripting

Script to remove duplicates

Discussion started by: antointoronto

9. Shell Programming and Scripting

Extract semicolon separated delimiter

Discussion started by: vkca

10. Shell Programming and Scripting

Remove duplicates

Discussion started by: forumthreads