Filtering duplicates based on lookup table and rules

10-09-2014

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Filtering duplicates based on lookup table and rules

please help solving the following. I have access to redhat linux cluster having 32gigs of ram.

I have duplicate ids for variable names, in the file 1,2 are duplicates;3,4 and 5 are duplicates;6 and 7 are duplicates. My objective is to use only the first occurrence of these duplicates.

Lookup file

Code:

varid varname
1 var1
2 var1
3 varx
4 varx
5 varx
6 vary
7 vary
8 varz

I need to use the following rules to filter the file below per category.
1) If all duplicates ids within a category have the same value, use the first occurrence and print the value.

example input

Code:

3;cat1;val3
4;cat1;val3

example output

Code:

3;cat1;val3

2) If all duplicates within a category do not have the same value, print the first occurrence and print the value as ambiguous.

example input

Code:

3;cat1;val1
4;cat1;val2
5;cat1;val1

example output

Code:

3;cat1;ambiguous

3) If only a single id (out of duplicate ids) is present in a category, then print the row as it is.

Data sample input

Code:

varid;category;value
1;cat1;val1
2;cat1;val2
3;cat1;val3
4;cat1;val3
5;cat1;val3
2;cat2;val2
3;cat2;val3
5;cat2;val3
6;cat2;val3
7;cat2;val4
8;cat2;val4

Filtered sample output

Code:

varid;category;value
1;cat1;ambiguous
3;cat1;val3
2;cat2;val2
3;cat2;val3
6;cat2;ambiguous
8;cat2;val4

Last edited by ritakadm; 10-10-2014 at 12:51 AM.. Reason: added code tags for clarity

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

10-10-2014

Registered User

1,271, 299

Join Date: Sep 2009

Last Activity: 17 July 2019, 5:46 PM EDT

Location: ./India/Bangalore

Posts: 1,271

Thanks Given: 70

Thanked 299 Times in 290 Posts

Could you please provide some more information as I am not able to understand your problem ?

pravin27

View Public Profile for pravin27

Find all posts by pravin27

10-10-2014

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Hi Praveen,

I`m sorry i was unclear. Let me try again. Please let me know if it makes sense now.

If you look at the first lookup file, the variable name var1 has 2 variable ids 1 and 2.
So the ids are duplicated. Similarly varx has three ids 3,4 and 5. I need in the output is just one out of these duplicated ids.

In the input file, there is a column called category. So within the same category, the ids 1,2 must not appear together in the output.

Also there is a input column called value. Within each category , ideally all duplicated ids should have the same value.
For example, in the following line the duplicated ids 1 and 2 within category 1 has value val1.

Code:

1;cat1;val1
2;cat1;val1

In this case i want to report only one the first occurrence of the duplicated id. So in the output, only this line appears.

Code:

1;cat1;val1

Sometimes, in the input if both duplicated ids 1 and 2 does not have the same value, I want to report it as ambiguous. For example , in the following line, in category 1 , the ids 1 and 2 have different values val1 and val2 respectively.

Code:

1;cat1;val1
2;cat1;val2

Then I report only the first duplicated id which is 1, and set the value to be ambiguous.

Code:

1;cat1;ambiguous

Sometimes within a category, only one id out of 1 and 2 is present. In that case I report the row as it is.
So if category 1 has id 2 with a value val1.

Code:

2;cat1;val1

Then we output the same row.

Code:

2;cat1;val1

Please note that all operations are within the same category.

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

10-10-2014

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

According to what you are saying and using your first post input:

Code:

varid;category;value
1;cat1;val1
2;cat1;val2
3;cat1;val3
4;cat1;val3
5;cat1;val3

2;cat2;val2
3;cat2;val3
5;cat2;val3
6;cat2;val3
7;cat2;val4
8;cat2;val4

Only the following should be kept because they are the first of their distinctive categories:

Code:

1;cat1;val1
2;cat2;val2

Further, it should have their value changed to `ambiguous' because their group contain different values among them.

Code:

1;cat1;ambiguous
2;cat2;ambiguous

Unfortunately, that doesn't match your original example output:

Code:

varid;category;value
1;cat1;ambiguous
3;cat1;val3
2;cat2;val2
3;cat2;val3
6;cat2;ambiguous
8;cat2;val4

Last edited by Aia; 10-10-2014 at 11:14 AM..

Aia

View Public Profile for Aia

Find all posts by Aia

10-10-2014

Registered User

76, 1

Join Date: Jul 2013

Last Activity: 1 March 2017, 7:46 PM EST

Location: Bengaluru

Posts: 76

Thanks Given: 68

Thanked 1 Time in 1 Post

Let me go through each category in the original sample input.

Category 1 (cat1 in column 2)

First set of duplicates; 1 and 2

Code:

 
 
1;cat1;val1
2;cat1;val2

Duplicate ids 1 and 2 have different values val1 and val2 in cat1. So we output only the first (1) and call the value ambiguous.

Code:

1;cat1;ambiguous

Moving on to the second set of duplicates,; 3,4 and 5

Code:

 
 
3;cat1;val3
4;cat1;val3
5;cat1;val3

For duplicates 3,4 and 5 in category 1 , they have the same value val3. So we just output the first.

Code:

 
3;cat1;val3

Category 2 (cat2 in column 2)
First set of duplicates; 1 and 2

Code:

 
2;cat2;val2

Only 2 is present, no 1.So we output as it is.

Code:

 
2;cat2;val2

Moving on to second set of duplicates; 3, 4 and 5

Code:

3;cat2;val3
5;cat2;val3

Only 3 and 5 are present, no 4 and they have the same value val3.

So we output the first (3) with value val3.

Code:

 
3;cat2;val3

Moving on to third set of replicates; 6 and 7

Code:

6;cat2;val3

Code:

7;cat2;val4

Both 6 and 7 are present with different values , so output the first (6) and the report the value ambiguous.

Code:

 
6;cat2;ambiguous

Lastly the id 8 is not duplicated in the lookup file.

Code:

 
8;cat2;val4

So report as it is

Code:

 
8;cat2;val4

So the sample input file (combined from the above steps)

Code:

 
varid;category;value
1;cat1;val1
2;cat1;val2
3;cat1;val3
4;cat1;val3
5;cat1;val3
2;cat2;val2
3;cat2;val3
5;cat2;val3
6;cat2;val3
7;cat2;val4
8;cat2;val4

Desired output (combined the red rows from each step)

Code:

 
1;cat1;ambiguous
3;cat1;val3
2;cat2;val2
3;cat2;val3
6;cat2;ambiguous
8;cat2;val4

---------- Post updated at 10:23 AM ---------- Previous update was at 10:10 AM ----------

Quote:

Originally Posted by Aia

According to what you are saying and using your first post input:

Code:

varid;category;value
1;cat1;val1
2;cat1;val2
3;cat1;val3
4;cat1;val3
5;cat1;val3
 
2;cat2;val2
3;cat2;val3
5;cat2;val3
6;cat2;val3
7;cat2;val4
8;cat2;val4

Only the following should be kept because they are the first of their distinctive categories:

Code:

1;cat1;val1
2;cat2;val2

Further, it should have their value changed to `ambiguous' because their group contain different values among them.

Code:

1;cat1;ambiguous
2;cat2;ambiguous

Unfortunately, that doesn't match your original example output:

Code:

varid;category;value
1;cat1;ambiguous
3;cat1;val3
2;cat2;val2
3;cat2;val3
6;cat2;ambiguous
8;cat2;val4

Aia, this will not be the case because they are different categories.
The first row belongs to cat1 and the second row belongs to cat2, so they must be treated independently.

Code:

1;cat1;ambiguous
2;cat2;ambiguous

It should be

Code:

1;cat1;ambiguous
2;cat2;val2

Last edited by ritakadm; 10-10-2014 at 12:16 PM..

ritakadm

View Public Profile for ritakadm

Find all posts by ritakadm

Shell Programming and Scripting

Filtering duplicates based on lookup table and rules

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Korn shell - lookup table

Discussion started by: lafrance

2. Shell Programming and Scripting

PERL "filtering the log file removing the duplicates

Discussion started by: scriptscript

3. Web Development

Help on filtering the table in HTML

Discussion started by: sidhi

4. Shell Programming and Scripting

Filtering out duplicates with the highest version number

Discussion started by: mantis

5. UNIX for Dummies Questions & Answers

Filtering the duplicates

Discussion started by: koneru_18

6. UNIX for Dummies Questions & Answers

string replacement using a lookup table

Discussion started by: roussine

7. Shell Programming and Scripting

Sed variable from lookup table

Discussion started by: milo7

8. Programming

64-bit CRC Transition To Bytewise Lookup-Table

Discussion started by: HeavyJ

9. UNIX for Dummies Questions & Answers

HELP with using a lookup table

Discussion started by: Dolph

10. Shell Programming and Scripting

lookup table in perl??

Discussion started by: Bhups