Grab unique record from different files on a condition


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Grab unique record from different files on a condition
# 1  
Old 02-07-2012
Grab unique record from different files on a condition

Hi,

I think this is the toughest prob Smilie I have ever come across and I thankfully owe all of u for helping me cross this.

cat 1.txt
Quote:
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
cat 2.txt
Quote:
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
K now. This is what I am looking for.

Output.txt
Quote:
chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt
Here is how my output has been generated.

First, the column one of each file has to be matched to column one of other files, like chr1 to chr1, chr2 to chr2 and chr3 to chr3 only. No different column values has to be matched.

Second, if a particular range of column 2 and 3 intersects/comes in between the range of column 2 and 3 of the other file, they have to be eliminated.

Examples from the given input:

chr1 100 200(1.txt) intersects with chr1 156 199(2.txt), chr1 165 230(2.txt). So, they are eliminated.
chr1 450 700(1.txt) intersects with chr1 525 600(2.txt). So, these two are eliminated from the output.

Similarly,

chr2 500 600(1.txt) intersects with chr2 534 676(2.txt). So, it is eliminated.
chr2 345 765(1.txt) intersects with chr2 200 400(2.txt). So, it is eliminated from the output file.

Same is the case for chr3 too. My files have different number of records in each of them which are not sorted. The last column in the output file indicates the file from which the record originates. If you have any questions or suggestions, please write in the reply and I shall reply ASAP to clarify your doubt that might give me a chance to kick this problem out. All your time, patience and attention are highly appreciated.

Thanks in advance.
# 2  
Old 02-07-2012
Try:
Code:
#!/usr/bin/perl
$file1=shift;
$file2=shift;
open A,"$file1";
open B,"$file2";
while (chomp($line=<A>)){
  @temp=split / /,$line;
  $lowA{$temp[0]}{$.}=$temp[1];
  $highA{$temp[0]}{$.}=$temp[2];
}
while (chomp($line=<B>)){
  @temp=split / /,$line;
  $lowB{$temp[0]}{$.}=$temp[1];
  $highB{$temp[0]}{$.}=$temp[2];
}
for $iA (keys %lowA){
  for $jA (keys %{$lowA{$iA}}){
    for $iB (keys %{$lowB{$iA}}){
      if ($lowA{$iA}{$jA}>=$lowB{$iA}{$iB} && $lowA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
      if ($highA{$iA}{$jA}>=$lowB{$iA}{$iB} && $highA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
    }
  }
}
for $iB (keys %lowB){
  for $jB (keys %{$lowB{$iB}}){
    for $iA (keys %{$lowA{$iB}}){
      if ($lowB{$iB}{$jB}>=$lowA{$iB}{$iA} && $lowB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
      if ($highB{$iB}{$jB}>=$lowA{$iB}{$iA} && $highB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
    }
  }
}
for $i (keys %lowA){
  for $j (keys %{$lowA{$i}}){
    print "$i $lowA{$i}{$j} $highA{$i}{$j} $file1\n" if $elimA{$i}{$j}!=1;
  }
}
for $i (keys %lowB){
  for $j (keys %{$lowB{$i}}){
    print "$i $lowB{$i}{$j} $highB{$i}{$j} $file2\n" if $elimB{$i}{$j}!=1;
  }
}

Run it as: ./script.pl 1.txt 2.txt
This User Gave Thanks to bartus11 For This Post:
# 3  
Old 02-07-2012
Quote:
Originally Posted by bartus11
Try:
Code:
#!/usr/bin/perl
$file1=shift;
$file2=shift;
open A,"$file1";
open B,"$file2";
while (chomp($line=<A>)){
  @temp=split / /,$line;
  $lowA{$temp[0]}{$.}=$temp[1];
  $highA{$temp[0]}{$.}=$temp[2];
}
while (chomp($line=<B>)){
  @temp=split / /,$line;
  $lowB{$temp[0]}{$.}=$temp[1];
  $highB{$temp[0]}{$.}=$temp[2];
}
for $iA (keys %lowA){
  for $jA (keys %{$lowA{$iA}}){
    for $iB (keys %{$lowB{$iA}}){
      if ($lowA{$iA}{$jA}>=$lowB{$iA}{$iB} && $lowA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
      if ($highA{$iA}{$jA}>=$lowB{$iA}{$iB} && $highA{$iA}{$jA}<=$highB{$iA}{$iB}){
        $elimA{$iA}{$jA}=1;
        $elimB{$iA}{$iB}=1;
      }
    }
  }
}
for $iB (keys %lowB){
  for $jB (keys %{$lowB{$iB}}){
    for $iA (keys %{$lowA{$iB}}){
      if ($lowB{$iB}{$jB}>=$lowA{$iB}{$iA} && $lowB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
      if ($highB{$iB}{$jB}>=$lowA{$iB}{$iA} && $highB{$iB}{$jB}<=$highA{$iB}{$iA}){
        $elimB{$iB}{$jB}=1;
        $elimA{$iB}{$iA}=1;
      }
    }
  }
}
for $i (keys %lowA){
  for $j (keys %{$lowA{$i}}){
    print "$i $lowA{$i}{$j} $highA{$i}{$j} $file1\n" if $elimA{$i}{$j}!=1;
  }
}
for $i (keys %lowB){
  for $j (keys %{$lowB{$i}}){
    print "$i $lowB{$i}{$j} $highB{$i}{$j} $file2\n" if $elimB{$i}{$j}!=1;
  }
}

Run it as: ./script.pl 1.txt 2.txt
Great and thanks!

Works as I wished.

Also, what if there are multiple files, like say I have more than two files?

But, the condition is still the same, if a record intersects with another record in any of the other files, even one, then it should be eliminated.

Thanks for all your time.
# 4  
Old 02-07-2012
-- deleted --

Last edited by ctsgnb; 02-07-2012 at 08:29 PM.. Reason: Ooops i think i missed the logic of the algo
# 5  
Old 02-07-2012
Hi jacobs.smith,

Other solution using perl. I think it should work with any number of input files. Give it a try. It could be more efficient, but I struggled a little to get it, so if it works, I will be happy for that:
Code:
$ cat 1.txt 
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
$ cat 2.txt
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-files-1> <input-file-2> ...\n] unless @ARGV > 0;

my (@data);

while ( <> ) {
        my @f = split;
        next unless @f == 3;
        push @data, [ $ARGV, @f ]; 
}

for my $d ( @data ) {
        if ( grep {
                        $d->[0] ne $_->[0] &&
                        $d->[1] eq $_->[1] &&
                        ($d->[2] < $_->[2] &&
                        $d->[3] > $_->[2]
                                        ||
                        $d->[2] < $_->[3] &&
                        $d->[3] > $_->[3]
                                        ||
                        $d->[2] > $_->[2] &&
                        $d->[3] < $_->[3])

                } @data
        ) { 
                next;
        }

        printf qq[%s\n], join qq[ ], @$d[1..3], $d->[0];
}
$ perl script.pl 1.txt 2.txt
chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt

Regards,
Birei
This User Gave Thanks to birei For This Post:
# 6  
Old 02-08-2012
Quote:
Originally Posted by birei
Hi jacobs.smith,

Other solution using perl. I think it should work with any number of input files. Give it a try. It could be more efficient, but I struggled a little to get it, so if it works, I will be happy for that:
Code:
$ cat 1.txt 
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
$ cat 2.txt
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
$ cat script.pl
use warnings;
use strict;

die qq[Usage: perl $0 <input-files-1> <input-file-2> ...\n] unless @ARGV > 0;

my (@data);

while ( <> ) {
        my @f = split;
        next unless @f == 3;
        push @data, [ $ARGV, @f ]; 
}

for my $d ( @data ) {
        if ( grep {
                        $d->[0] ne $_->[0] &&
                        $d->[1] eq $_->[1] &&
                        ($d->[2] < $_->[2] &&
                        $d->[3] > $_->[2]
                                        ||
                        $d->[2] < $_->[3] &&
                        $d->[3] > $_->[3]
                                        ||
                        $d->[2] > $_->[2] &&
                        $d->[3] < $_->[3])

                } @data
        ) { 
                next;
        }

        printf qq[%s\n], join qq[ ], @$d[1..3], $d->[0];
}
$ perl script.pl 1.txt 2.txt
chr1 300 400 1.txt
chr1 350 467 1.txt
chr1 201 299 2.txt
chr2 800 1000 2.txt
chr2 100 200 2.txt
chr3 500 600 2.txt

Regards,
Birei

Thanks a lot Birei. The script works for the two files I have mentioned earlier before. And I even tried using it with 3 files. The 3 files and their output has been given below just for your confirmation and my satisfaction Smilie

Thanks once again

cat 1.txt
Quote:
chr1 100 200
chr1 300 400
chr1 350 467
chr1 450 700
chr2 500 600
chr2 345 765
chr3 101 300
chr3 132 456
cat 2.txt
Quote:
chr1 156 199
chr1 165 230
chr1 201 299
chr1 525 600
chr2 800 1000
chr2 534 676
chr2 200 400
chr2 100 200
chr3 200 400
chr3 500 600
chr3 400 700
cat3.txt
Quote:
chr1 330 420
chr1 50 60
chr1 20 30
chr1 15 20
chr1 220 299
chr2 199 300
chr3 900 1000
chr3 100 200
chr3 110 200
perl newscript.pl 1.txt 2.txt 3.txt
Quote:
chr2 800 1000 2.txt
chr3 500 600 2.txt
chr1 50 60 3.txt
chr1 20 30 3.txt
chr1 15 20 3.txt
chr3 900 1000 3.txt
I find everything to be smooth. Let me know if you see anything.

Thanks
# 7  
Old 02-08-2012
Here is a solution using shell scripts:
Code:
#!/bin/ksh
typeset -i mFromA mToA mFromB mToB
mF1='1.txt'
mF2='2.txt'
mPrevTag=''
#### sort is used to reduce the number of "grep"
sort ${mF1} | while read mTagA mFromA mToA; do
  if [[ "${mTagA}" != "${mPrevTag}" ]]; then
    grep "${mTagA}" ${mF2} > ${mF2}.tmp
  fi
  mFound="N"
  while read mTagB mFromB mToB; do
    if [[ ${mToA} -ge ${mFromB} && ${mFromA} -le ${mToB} ]]; then
      mFound="Y"
      break
    fi
  done < ${mF2}.tmp
  if [[ "${mFound}" = "N" ]]; then
    echo ${mTagA} ${mFromA} ${mToA} ${mF1}
  fi
  mPrevTag=${mTagA}
done

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

CSV File:Filter duplicate records from column1 & another column having unique record

Hi Experts, I have csv file with 30, 40 columns Pasting just 2 column for problem description. Need to print error if below combination is not present in file check for column-1 (DocumentNumber) and filter columns where value in DocumentNumber field is same. For all such rows, the field... (7 Replies)
Discussion started by: as7951
7 Replies

2. UNIX for Beginners Questions & Answers

Remove footer record in specific condition

Hi Experts, we have a requirement , need your help to remove the footer record in the file. Input file : 1011070375,,21,,NG,NG,asdfsfadf,1011,,30/09/2017,ACI,USD,,0.28,,,,,,,,,,,, 1011070381,,21,,NG,NG,sgfseasdf,1011,,30/09/2017,ACI,GBP,,0.22,,,,,,,,,,,,... (6 Replies)
Discussion started by: KK230689
6 Replies

3. UNIX for Dummies Questions & Answers

FTP mget will only grab files not folders

Hey All, first post :rolleyes: So I am writting a script to pull down files from an ftp that will be called from a bat file on windows. This seems pretty straight forward, and grabs all of the "files" in the cd location, but I am running into some permission issue that will not allow me to... (1 Reply)
Discussion started by: mpatton
1 Replies

4. Shell Programming and Scripting

Output first unique record in csv file

Hi, I have to output a new csv file from an input csv file with first unique value in the first column. input csv file color product id status green 102 pass yellow 201 hold yellow 202 keep green 101 ok green 103 hold yellow 203 ... (5 Replies)
Discussion started by: Chris LAU
5 Replies

5. Shell Programming and Scripting

Replace string, grab files, rename and move

Hello there! I'm having a lot of trouble writing a script. The script is supposed to: 1) Find all files with the name "Object.mtl" within each folder in the directory: /Users/username/Desktop/convert/Objects 2) Search and replace the string ".bmp" with ".tif" (without the quotations) 3)... (1 Reply)
Discussion started by: Blue Solo
1 Replies

6. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

7. Shell Programming and Scripting

[AWK script]Counting the character in record and print them in condition

.......... (1 Reply)
Discussion started by: Antonlee
1 Replies

8. Shell Programming and Scripting

Help with File processing - Adding predefined text to particular record based on condition

I am generating a output: Name Count_1 Count_2 abc 12 12 def 15 14 ghi 16 16 jkl 18 18 mno 7 5 I am sending the output in html email, I want to add the code: <font color="red"> NAME COLUMN record </font> for the Name... (8 Replies)
Discussion started by: karumudi7
8 Replies

9. Shell Programming and Scripting

Managing sequence to make unique record

Hi Everyone, Using shell script i am getting final file as attached below. In this 4th column value should be unique using any sequence. for instance I've 1_13020_SSGM which is appearing 6 times in file and i should change it like 1_13020_SSGM_1,1_13020_SSGM_2,....1_13020_SSGM_6. Can someone... (4 Replies)
Discussion started by: gehlnar
4 Replies

10. Shell Programming and Scripting

ksh scripting: Extract 1 most recent record for unique key

I'm loading multiple delimited files into an Oracle DB using sqlldr on Unix. I would like to get only the most recent record per each unique key. There may be multiple updates for each key, but I only want the most recent one. There is a date column in my delimited files, so I'm using cat to... (2 Replies)
Discussion started by: OPTIMUS_prime
2 Replies
Login or Register to Ask a Question