Compare within same group

10-14-2014

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

In case you do not mind to use Perl

Code:

#!/usr/bin/perl
# senhia83.pl

use strict;
use warnings;

my %group_first_line; # to save the first group encounter
my %group_values; # to save first group value encounter

# read every file line by line
while(<>) {
    chomp; # remove the ending newline if there
    # extract group and value
    my ($group, $value) = (split)[1,2];
    # if the group have not been seen yet create a record of 
    # first line and value
    if ( not exists $group_first_line{$group} ) {
        $group_first_line{$group} = $_;
        $group_values{$group} = $value;
        next; # jump to read line again
    }
    # if value does not match the first time, substitute it for missing
    if ($value ne $group_values{$group}) {
        $group_first_line{$group} =~ s/\s\w+?$/ missing/;
    }
}
# display the result
for my $group (keys %group_values) {
    print "$group_first_line{$group}\n";
}

Usage
Save code as senhia83.pl and run.

Code:

perl senhia83.pl file

Aia

View Public Profile for Aia

Find all posts by Aia

10-14-2014

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

Hi Ravinder,

I tested your code with the following dataset

Code:

 
$ cat test3
name2 id1group1 value1
name4 id1group1 value2
name1 id2group1 value2
name2 id2group1 value2
name4 id2group1 value2
name1 id1group2 value1
name2 id1group2 value2

What I am getting

Code:

 
name4 id1group1 missing
name1 id2group1 value2
name2 id1group2 missing

The first column is not correct.
What I should get

Code:

 
name2 id1group1 missing
name1 id2group1 value2
name1 id1group2 missing

Aia, your perl scripts works great, can it be modified slightly to use tab delimited input file?

senhia83

View Public Profile for senhia83

Find all posts by senhia83

10-14-2014

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Quote:

Originally Posted by senhia83

[...]
Aia, your perl scripts works great, can it be modified slightly to use tab delimited input file?

It is already using tab or any other sequence of white spaces as delimiter

This portion is doing the job

Code:

# extract group and value
    my ($group, $value) = (split)[1,2];

Now, if you want only tabs add to the following

Code:

(split '\t')[1,2];

Also

Code:

$group_first_line{$group} =~ s/\s\w+?$/\tmissing/;

---------- Post updated at 10:21 AM ---------- Previous update was at 10:14 AM ----------

Better yet, just change the following:

Code:

$group_first_line{$group} =~ s/(\s)\w+?$/$1missing/;

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

10-15-2014

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

Can the code be made more efficient? Will it help if the data is sorted by second column? Its churning through 140 million records for some time now..

senhia83

View Public Profile for senhia83

Find all posts by senhia83

10-15-2014

Registered User

1,781, 705

Join Date: May 2008

Last Activity: 10 November 2021, 5:38 PM EST

Posts: 1,781

Thanks Given: 62

Thanked 705 Times in 653 Posts

Sorting is expensive.
The issue here is that a state must be kept until all the data is read and that is a lot of memory.
Sorting might help reducing the memory if in the loop, the lines for the same group can be processed, printed and the hash reset. But we might be just trading some burden for another.

Here's a version that reduces the memory footprint, by eliminating the second hash, eliminates the regex search and does not automatically reassign the value if different at each iteration.

Hopefully, that would help

Code:

#!/usr/bin/perl

use strict;
use warnings;

my %records;

while(<>) {
    chomp;
    
    my ($id, $group, $value) = split;
    
    if ( not exists $records{$group} ) {
        $records{$group} = [$id, $group, $value];
        next;
    }
    next if $records{$group}->[2] eq "missing";
    if ($records{$group}->[2] ne $value) {
        $records{$group}->[2] = "missing"} 
}

$,="\t";
for my $group (keys %records) {
    print "@{$records{$group}}\n"; 
}

Last edited by Aia; 10-15-2014 at 07:47 PM.. Reason: grammar

This User Gave Thanks to Aia For This Post:

Aia

View Public Profile for Aia

Find all posts by Aia

10-15-2014

Registered User

174, 45

Join Date: Oct 2014

Last Activity: 8 April 2019, 3:29 PM EDT

Posts: 174

Thanks Given: 78

Thanked 45 Times in 45 Posts

I tested with a small set and it worked fine,,,running on the main data now, will get back to you with fresh troubles

senhia83

View Public Profile for senhia83

Find all posts by senhia83

10-16-2014

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Quote:

Originally Posted by senhia83

Hi Ravinder,
I tested your code with the following dataset

Code:

 
$ cat test3
name2 id1group1 value1
name4 id1group1 value2
name1 id2group1 value2
name2 id2group1 value2
name4 id2group1 value2
name1 id1group2 value1
name2 id1group2 value2

What I am getting

Code:

 
name4 id1group1 missing
name1 id2group1 value2
name2 id1group2 missing

The first column is not correct.
What I should get

Code:

 
name2 id1group1 missing
name1 id2group1 value2
name1 id1group2 missing

Aia, your perl scripts works great, can it be modified slightly to use tab delimited input file?

Hello senhia83,

kindly try following code, I have tesed it with your input file as well as with my teste input file too, hope this helps, will be happy if this works for you.

Input file1:

Code:

cat group_test1
name2 id1group1 value1
name4 id1group1 value2
name1 id2group1 value2
name2 id2group1 value2
name4 id2group1 value2
name1 id1group2 value1
name2 id1group2 value2

Code as follows:

Code:

awk 'NR==1{X=$2;S[$2]=$0;} {if( X != $2 ){if(!S[$2]){S[$2]=$0;}}} {if( X == $2){if( Y != $3 ){split(S[$2],D," ");D[3]="missing";S[$2]=D[1] OFS D[2] OFS D[3];}}} {X=$2;Y=$3} END{for(u in S){print S}}' group_test1

Output will be as follows.

Code:

name1 id2group1 value2
name2 id1group1 missing
name1 id1group2 missing

Now with my previous test file results as follows:
Input file2:

Code:

cat group_test
name2 group1 value1
name1 group2 value1
name4 group1 value2
name2 group3 value2
name3 group3 value2
name2 group2 value2
name3 group2 value1
name1 group4 value1
name2 group4 value1
name1 group4 value1
name4 group4 value2
name2 group5 value2
name3 group5 value2
name2 group5 value2
name3 group5 value1
name3 group6 value1
name3 group6 value1

Code is as follows.

Code:

awk 'NR==1{X=$2;S[$2]=$0;} {if( X != $2 ){if(!S[$2]){S[$2]=$0;}}} {if( X == $2){if( Y != $3 ){split(S[$2],D," ");D[3]="missing";S[$2]=D[1] OFS D[2] OFS D[3];}}} {X=$2;Y=$3} END{for(u in S){print S}}' group_test

Output is as follows.

Code:

name2 group1 missing
name1 group2 missing
name2 group3 value2
name1 group4 missing
name2 group5 missing
name3 group6 value1

EDIT: Adding a non one liner form of solution too.

Code:

awk 'NR==1{
X=$2;S[$2]=$0;
}
        {if( X != $2 )
                {if(!S[$2])
                        {S[$2]=$0;}
                }
        }
{if( X == $2)
        {if( Y != $3 )
                {split(S[$2],D," ");
                D[3]="missing";S[$2]=D[1] OFS D[2] OFS D[3];
                }
        }
}
{X=$2;Y=$3}
END{
{for(u in S){print S[u]}}
}' group_test  ## Your input file name ##

Thanks,
R. Singh

Last edited by RavinderSingh13; 10-16-2014 at 11:53 AM.. Reason: Added a non one liner form of solution

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

UNIX for Dummies Questions & Answers

Compare within same group

9 More Discussions You Might Find Interesting

1. Programming

Sql ORA-00937: not a single-group group function

Discussion started by: progkcp

2. Shell Programming and Scripting

need a one liner to grep a group info from /etc/group and use that result to search passwd file

Discussion started by: chidori

3. AIX

Adding a Volume Group to an HACMP Resource Group?

Discussion started by: aixromeo

4. Shell Programming and Scripting

Sort the file contents in each group....print the group title as well

Discussion started by: prash184u

5. Shell Programming and Scripting

Merge group numbers and add a column containing group names

Discussion started by: Lucky Ali

6. Shell Programming and Scripting

Merge group numbers and add a column containing group names

Discussion started by: Lucky Ali

7. Shell Programming and Scripting

KSH to group records in a file and compare it with another file

Discussion started by: Matrix2682

8. UNIX for Advanced & Expert Users

retrieving all group names with a given group number

Discussion started by: Andrewkl

9. Solaris

entry in /etc/group too long - problem using sudo with %group

Discussion started by: poli