Filtering duplicates based on lookup table and rules
please help solving the following. I have access to redhat linux cluster having 32gigs of ram.
I have duplicate ids for variable names, in the file 1,2 are duplicates;3,4 and 5 are duplicates;6 and 7 are duplicates. My objective is to use only the first occurrence of these duplicates.
Lookup file
I need to use the following rules to filter the file below per category.
1) If all duplicates ids within a category have the same value, use the first occurrence and print the value.
example input
example output
2) If all duplicates within a category do not have the same value, print the first occurrence and print the value as ambiguous.
example input
example output
3) If only a single id (out of duplicate ids) is present in a category, then print the row as it is.
Data sample input
Filtered sample output
Last edited by ritakadm; 10-10-2014 at 12:51 AM..
Reason: added code tags for clarity
I`m sorry i was unclear. Let me try again. Please let me know if it makes sense now.
If you look at the first lookup file, the variable name var1 has 2 variable ids 1 and 2.
So the ids are duplicated. Similarly varx has three ids 3,4 and 5. I need in the output is just one out of these duplicated ids.
In the input file, there is a column called category. So within the same category, the ids 1,2 must not appear together in the output.
Also there is a input column called value. Within each category , ideally all duplicated ids should have the same value.
For example, in the following line the duplicated ids 1 and 2 within category 1 has value val1.
In this case i want to report only one the first occurrence of the duplicated id. So in the output, only this line appears.
Sometimes, in the input if both duplicated ids 1 and 2 does not have the same value, I want to report it as ambiguous. For example , in the following line, in category 1 , the ids 1 and 2 have different values val1 and val2 respectively.
Then I report only the first duplicated id which is 1, and set the value to be ambiguous.
Sometimes within a category, only one id out of 1 and 2 is present. In that case I report the row as it is.
So if category 1 has id 2 with a value val1.
Then we output the same row.
Please note that all operations are within the same category.
According to what you are saying and using your first post input:
Only the following should be kept because they are the first of their distinctive categories:
Further, it should have their value changed to `ambiguous' because their group contain different values among them.
Unfortunately, that doesn't match your original example output:
Let me go through each category in the original sample input.
Category 1 (cat1 in column 2)
First set of duplicates; 1 and 2
Duplicate ids 1 and 2 have different values val1 and val2 in cat1. So we output only the first (1) and call the value ambiguous.
Moving on to the second set of duplicates,; 3,4 and 5
For duplicates 3,4 and 5 in category 1 , they have the same value val3. So we just output the first.
Category 2 (cat2 in column 2)
First set of duplicates; 1 and 2
Only 2 is present, no 1.So we output as it is.
Moving on to second set of duplicates; 3, 4 and 5
Only 3 and 5 are present, no 4 and they have the same value val3.
So we output the first (3) with value val3.
Moving on to third set of replicates; 6 and 7
Both 6 and 7 are present with different values , so output the first (6) and the report the value ambiguous.
Lastly the id 8 is not duplicated in the lookup file.
So report as it is
So the sample input file (combined from the above steps)
Desired output (combined the red rows from each step)
---------- Post updated at 10:23 AM ---------- Previous update was at 10:10 AM ----------
Quote:
Originally Posted by Aia
According to what you are saying and using your first post input:
Only the following should be kept because they are the first of their distinctive categories:
Further, it should have their value changed to `ambiguous' because their group contain different values among them.
Unfortunately, that doesn't match your original example output:
Aia, this will not be the case because they are different categories.
The first row belongs to cat1 and the second row belongs to cat2, so they must be treated independently.
It should be
Hi All
I need to pass country code into a pipe delimited file for lookup.
It will search country code (column 3) in the file, if the country code matched, it will return value from other columns.
Here is my mapping file.
#CountryName|CountryRegion|CountryCode-3|CountryCode-2... (5 Replies)
Hi folks,
I have a log file in the below format and trying to get the output of the unique ones based on mnemonic IN PERL.
Could any one please let me know with the code and the logic ?
Severity Mnemonic Log Message
7 CLI_SCHEDULER Logfile for scheduled CLI... (3 Replies)
1. how to get the filter option on table so that user can enter the fields which ever they want to print only according to the need ?
2.how to print the full fledge table if there is no value in the rows of the table but it should print the whole rows and column in proper tabular form? (2 Replies)
Hi,
I have a huge text file with filenames which which looks like the following ie uniquenumber_version_filename:
e.g.
1234_1_xxxx
1234_2_vfvfdbb
343333_1_vfvfdvd
2222222_1_ggggg
55555_1_xxxxxx
55555_2_vrbgbgg
55555_3_grgrbr
What I need to do is examine the file, look for... (4 Replies)
Hello,
I want to filter all the duplicates of a record to one place. Sample input and output will give you better idea.
I am new to unix. Can some one help me on this?
Input:
7488 7389 chr1.fa chr1.fa
3546 9887 chr5.fa chr9.fa
7387 7898 chrX.fa chr3.fa
7488 7389 chr1.fa chr1.fa... (2 Replies)
Dear all thanks for helping in advance.. Know this should be fairly simple but I failed in searching for an answer.
I have a file (replacement table) containing two columns, e.g.:
ACICJ ACIDIPHILIUM
ACIF2 ACIDITHIOBACILLUS
ACIF5 ACIDITHIOBACILLUS
ACIC5 ACIDOBACTERIUM
ACIC1 ACIDOTHERMUS... (10 Replies)
I have a file with the following format
--TABLEA_START--
field1=data1;field2=data2;field3=data3
--TABLEA_END--
--TABLEB_START--
field1=data1;field2=data2;field3=data3
--TABLEB_END--
--TABLEA_START--
field1=data1;field2=data2;field3=data3
... (0 Replies)
Good Evening,
I started working on the 17x17 4-colouring challenge, and I ran into a bit of an I/O snag.
It was an enormous headache to detect the differences in very similar 289-char strings.
Eventually, it made more sense to associate a CRC-Digest with each colouring.
After learning... (0 Replies)
Using AIX 5.2, Bourne and Korn Shell.
I have two flat text files. One is a main file and one is a lookup table that contains a number of letter codes and membership numbers as follows:
316707965EGM01
315672908ANM92
Whenever one of these records from the lookup appears in the main file... (6 Replies)
hi,
i am very much new in perl and have this very basic question in the same:(
the requirement is as below:
i have an input file (txt file) in which i have fields invoice number and customer number. Now i have to take input this combination of invoice n customer number and check in a... (2 Replies)