Grab unique record from different files on a condition

02-07-2012

Banned

363, 7

Join Date: Jan 2012

Last Activity: 24 June 2017, 6:25 PM EDT

Posts: 363

Thanks Given: 318

Thanked 7 Times in 7 Posts

Grab unique record from different files on a condition

Hi,

I think this is the toughest prob

I have ever come across and I thankfully owe all of u for helping me cross this.

cat 1.txt

Quote:

chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456

cat 2.txt

Quote:

chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700

K now. This is what I am looking for.

Output.txt

Quote:

chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt

Here is how my output has been generated.

First, the column one of each file has to be matched to column one of other files, like chr1 to chr1, chr2 to chr2 and chr3 to chr3 only. No different column values has to be matched.

Second, if a particular range of column 2 and 3 intersects/comes in between the range of column 2 and 3 of the other file, they have to be eliminated.

Examples from the given input:

chr1 100 200(1.txt) intersects with chr1 156 199(2.txt), chr1 165 230(2.txt). So, they are eliminated.
chr1 450 700(1.txt) intersects with chr1 525 600(2.txt). So, these two are eliminated from the output.

Similarly,

chr2 500 600(1.txt) intersects with chr2 534 676(2.txt). So, it is eliminated.
chr2 345 765(1.txt) intersects with chr2 200 400(2.txt). So, it is eliminated from the output file.

Same is the case for chr3 too. My files have different number of records in each of them which are not sorted. The last column in the output file indicates the file from which the record originates. If you have any questions or suggestions, please write in the reply and I shall reply ASAP to clarify your doubt that might give me a chance to kick this problem out. All your time, patience and attention are highly appreciated.

Thanks in advance.

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

02-07-2012

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Try:

Code:

#!/usr/bin/perl
$file1=shift;
$file2=shift;
open A,"$file1";
open B,"$file2";
while (chomp($line=<A>)){
  @temp=split / /,$line;
  $lowA{$temp[0]}{$.}=$temp[1];
  $highA{$temp[0]}{$.}=$temp[2];
}
while (chomp($line=<B>)){
  @temp=split / /,$line;
  $lowB{$temp[0]}{$.}=$temp[1];
  $highB{$temp[0]}{$.}=$temp[2];
}
for $iA (keys %lowA){
  for $jA (keys %{$lowA{$iA}}){
    for $iB (keys %{$lowB{$iA}}){
      if ($lowA{$iA}{$jA}>=$lowB{$iA}{$iB} && $lowA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
      if ($highA{$iA}{$jA}>=$lowB{$iA}{$iB} && $highA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
    }
  }
}
for $iB (keys %lowB){
  for $jB (keys %{$lowB{$iB}}){
    for $iA (keys %{$lowA{$iB}}){
      if ($lowB{$iB}{$jB}>=$lowA{$iB}{$iA} && $lowB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
      if ($highB{$iB}{$jB}>=$lowA{$iB}{$iA} && $highB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
    }
  }
}
for $i (keys %lowA){
  for $j (keys %{$lowA{$i}}){
    print "$i $lowA{$i}{$j} $highA{$i}{$j} $file1\n" if $elimA{$i}{$j}!=1;
  }
}
for $i (keys %lowB){
  for $j (keys %{$lowB{$i}}){
    print "$i $lowB{$i}{$j} $highB{$i}{$j} $file2\n" if $elimB{$i}{$j}!=1;
  }
}

Run it as: ./script.pl 1.txt 2.txt

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

02-07-2012

Banned

363, 7

Join Date: Jan 2012

Last Activity: 24 June 2017, 6:25 PM EDT

Posts: 363

Thanks Given: 318

Thanked 7 Times in 7 Posts

Quote:

Originally Posted by bartus11

Try:

Code:

#!/usr/bin/perl
$file1=shift;
$file2=shift;
open A,"$file1";
open B,"$file2";
while (chomp($line=<A>)){
  @temp=split / /,$line;
  $lowA{$temp[0]}{$.}=$temp[1];
  $highA{$temp[0]}{$.}=$temp[2];
}
while (chomp($line=<B>)){
  @temp=split / /,$line;
  $lowB{$temp[0]}{$.}=$temp[1];
  $highB{$temp[0]}{$.}=$temp[2];
}
for $iA (keys %lowA){
  for $jA (keys %{$lowA{$iA}}){
    for $iB (keys %{$lowB{$iA}}){
      if ($lowA{$iA}{$jA}>=$lowB{$iA}{$iB} && $lowA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
      if ($highA{$iA}{$jA}>=$lowB{$iA}{$iB} && $highA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
    }
  }
}
for $iB (keys %lowB){
  for $jB (keys %{$lowB{$iB}}){
    for $iA (keys %{$lowA{$iB}}){
      if ($lowB{$iB}{$jB}>=$lowA{$iB}{$iA} && $lowB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
      if ($highB{$iB}{$jB}>=$lowA{$iB}{$iA} && $highB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
    }
  }
}
for $i (keys %lowA){
  for $j (keys %{$lowA{$i}}){
    print "$i $lowA{$i}{$j} $highA{$i}{$j} $file1\n" if $elimA{$i}{$j}!=1;
  }
}
for $i (keys %lowB){
  for $j (keys %{$lowB{$i}}){
    print "$i $lowB{$i}{$j} $highB{$i}{$j} $file2\n" if $elimB{$i}{$j}!=1;
  }
}

Run it as: ./script.pl 1.txt 2.txt

Great and thanks!

Works as I wished.

Also, what if there are multiple files, like say I have more than two files?

But, the condition is still the same, if a record intersects with another record in any of the other files, even one, then it should be eliminated.

Thanks for all your time.

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

02-07-2012

Registered User

2,977, 644

Join Date: Oct 2010

Last Activity: 14 September 2019, 1:15 PM EDT

Location: France

Posts: 2,977

Thanks Given: 88

Thanked 644 Times in 613 Posts

-- deleted --

Last edited by ctsgnb; 02-07-2012 at 08:29 PM.. Reason: Ooops i think i missed the logic of the algo

ctsgnb

View Public Profile for ctsgnb

Find all posts by ctsgnb

02-07-2012

Registered User

628, 174

Join Date: Oct 2010

Last Activity: 2 December 2017, 5:58 AM EST

Location: Madrid, Spain

Posts: 628

Thanks Given: 8

Thanked 174 Times in 171 Posts

Hi jacobs.smith,

Other solution using perl. I think it should work with any number of input files. Give it a try. It could be more efficient, but I struggled a little to get it, so if it works, I will be happy for that:

Code:

$ cat 1.txt 
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
$ cat 2.txt
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-files-1> <input-file-2> ...\n] unless @ARGV > 0;

my (@data);

while ( <> ) {
        my @f = split;
        next unless @f == 3;
        push @data, [ $ARGV, @f ]; 
}

for my $d ( @data ) {
        if ( grep {
                        $d->[0] ne $_->[0] &&
                        $d->[1] eq $_->[1] &&
                        ($d->[2] < $_->[2] &&
                        $d->[3] > $_->[2]
                                        ||
                        $d->[2] < $_->[3] &&
                        $d->[3] > $_->[3]
                                        ||
                        $d->[2] > $_->[2] &&
                        $d->[3] < $_->[3])

                } @data
        ) { 
                next;
        }

        printf qq[%s\n], join qq[ ], @$d[1..3], $d->[0];
}
$ perl script.pl 1.txt 2.txt
chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt

Regards,
Birei

This User Gave Thanks to birei For This Post:

birei

View Public Profile for birei

Find all posts by birei

02-08-2012

Banned

363, 7

Join Date: Jan 2012

Last Activity: 24 June 2017, 6:25 PM EDT

Posts: 363

Thanks Given: 318

Thanked 7 Times in 7 Posts

Quote:

Originally Posted by birei

Code:

$ cat 1.txt 
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
$ cat 2.txt
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-files-1> <input-file-2> ...\n] unless @ARGV > 0;

my (@data);

while ( <> ) {
        my @f = split;
        next unless @f == 3;
        push @data, [ $ARGV, @f ]; 
}

for my $d ( @data ) {
        if ( grep {
                        $d->[0] ne $_->[0] &&
                        $d->[1] eq $_->[1] &&
                        ($d->[2] < $_->[2] &&
                        $d->[3] > $_->[2]
                                        ||
                        $d->[2] < $_->[3] &&
                        $d->[3] > $_->[3]
                                        ||
                        $d->[2] > $_->[2] &&
                        $d->[3] < $_->[3])

                } @data
        ) { 
                next;
        }

        printf qq[%s\n], join qq[ ], @$d[1..3], $d->[0];
}
$ perl script.pl 1.txt 2.txt
chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt

Regards,
Birei

Thanks a lot Birei. The script works for the two files I have mentioned earlier before. And I even tried using it with 3 files. The 3 files and their output has been given below just for your confirmation and my satisfaction

Thanks once again

cat 1.txt

Quote:

chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456

cat 2.txt

Quote:

chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700

cat3.txt

Quote:

chr1 330 420
chr1 50 60
chr1 20 30
chr1 15 20
chr1 220 299
chr2 199 300
chr3 900 1000
chr3 100 200
chr3 110 200

perl newscript.pl 1.txt 2.txt 3.txt

Quote:

chr2 800 1000 2.txt
chr3 500 600 2.txt
chr1 50 60 3.txt
chr1 20 30 3.txt
chr1 15 20 3.txt
chr3 900 1000 3.txt

I find everything to be smooth. Let me know if you see anything.

Thanks

jacobs.smith

View Public Profile for jacobs.smith

Find all posts by jacobs.smith

02-08-2012

Registered User

1,203, 103

Join Date: Mar 2007

Last Activity: 28 January 2020, 10:33 PM EST

Location: Orlando, Florida

Posts: 1,203

Thanks Given: 1

Thanked 103 Times in 100 Posts

Here is a solution using shell scripts:

Code:

#!/bin/ksh
typeset -i mFromA mToA mFromB mToB
mF1='1.txt'
mF2='2.txt'
mPrevTag=''
#### sort is used to reduce the number of "grep"
sort ${mF1} | while read mTagA mFromA mToA; do
  if [[ "${mTagA}" != "${mPrevTag}" ]]; then
    grep "${mTagA}" ${mF2} > ${mF2}.tmp
  fi
  mFound="N"
  while read mTagB mFromB mToB; do
    if [[ ${mToA} -ge ${mFromB} && ${mFromA} -le ${mToB} ]]; then
      mFound="Y"
      break
    fi
  done < ${mF2}.tmp
  if [[ "${mFound}" = "N" ]]; then
    echo ${mTagA} ${mFromA} ${mToA} ${mF1}
  fi
  mPrevTag=${mTagA}
done

Shell_Life

View Public Profile for Shell_Life

Find all posts by Shell_Life

Shell Programming and Scripting

Grab unique record from different files on a condition

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Discussion started by: as7951

2. UNIX for Beginners Questions & Answers

Remove footer record in specific condition

Discussion started by: KK230689

3. UNIX for Dummies Questions & Answers

FTP mget will only grab files not folders

Discussion started by: mpatton

4. Shell Programming and Scripting

Output first unique record in csv file

Discussion started by: Chris LAU

5. Shell Programming and Scripting

Replace string, grab files, rename and move

Discussion started by: Blue Solo

6. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

7. Shell Programming and Scripting

[AWK script]Counting the character in record and print them in condition

Discussion started by: Antonlee

8. Shell Programming and Scripting

Help with File processing - Adding predefined text to particular record based on condition

Discussion started by: karumudi7

9. Shell Programming and Scripting

Managing sequence to make unique record

Discussion started by: gehlnar

10. Shell Programming and Scripting

ksh scripting: Extract 1 most recent record for unique key

Discussion started by: OPTIMUS_prime