Unique entries based on a range of numbers.


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Unique entries based on a range of numbers.
# 1  
Old 01-25-2014
Unique entries based on a range of numbers.

Hi,

I have a matrix like this:

Code:
Algorithm	predicted_gene	start_point	end_point
A 	x	65	85
B	x	70	80
C	x	75	85
D	x	10	20
B	y	125	130
C	y	120	140
D	y	200	210

Here there are four tab-separated columns. The first column is the used algorithm for prediction, and there are 4 of them A-D. The second column are the predicted targets (which actually are genes), x and y. The third and fourth column indicate the start and the end of the predicted site in the sequence of the genes.

I'd need to unique the entries in column 2, based on the common range in the columns 3 and 4, something like this:

Code:
Algorithm	predicted_gene	start_point	end_point	Number_of_algorithms_predicting_this_site
A, B, C	x	65	85	3
D	x	10	20	1
B, C	y	120	140	2
D	y	200	210	1

Here, for example, at the first line we have algorithms A, B and C which predict the gene x, and the predicted positions all fall into the same site, i.e. the position 70-80 for algorithm B and 75-85 for algorithm C are both located inside the same predicted position by algorithm A, which is 65-85; and the last column indicates how many algorithms predicted this position. On the contrary, the predicted site by algorithm D for the entry x does not coincide with the others, so is presented in a separate line. The results for the entry y are explained in the same way.

Hope this is clear.

Thank you in advanced

Last edited by flyfisherman; 01-25-2014 at 02:42 PM..
# 2  
Old 01-25-2014
Quote:
Originally Posted by flyfisherman
Hi,

I have a matrix like this:

Code:
Algorithm	prediction	Lower ragne	Upper ragne
A 	x	65	85
B	x	70	80
C	x	75	85
D	x	10	20
B	y	125	130
C	y	120	140
D	y	200	210

I'd need to uniq the entries in column 2 based on the number range in columns 3 and 4, plus indicating how many and which algorithms resulted the prediction, something like this:

Code:
Algorithm	Prediction	Lower ragne	Upper ragne	Repeats
A, B, C	x	65	85	3
D	x	10	20	1
B, C	y	120	140	2
D	y	200	210	1

Thank you in advanced

Your description is not clear to me. Please explain how do you expect above blue color highlighted result. Please explain your algorithm.
# 3  
Old 01-25-2014
Quote:
Originally Posted by Akshay Hegde
Your description is not clear to me. Please explain how do you expect above blue color highlighted result. Please explain your algorithm.
I apologize if it was not clear, I'll modify the original post.
# 4  
Old 01-25-2014
Here is an awk approach:
Code:
awk '
        BEGIN {
                print "Algorithm\tPrediction\tLower ragne"
        }
        function checkIDX(a)
        {
                n = split ( I[a], T, "," )
                for ( i = 1; i <= n; i++ )
                {
                        if ( T[i] == $1 )
                                F = 1
                }
                return F
        }
        NR > 1 {
                F = 0
                if ( $2 in A )
                {
                        split ( A[$2], R )
                        if ( $3 >= R[2] && $4 <= R[3] )
                        {
                                L[$2]++
                                if ( checkIDX($2) != 1 )
                                        I[$2] = I[$2] OFS $1
                        }
                        if ( $3 <= R[2] && $4 >= R[3] )
                        {
                                L[$2]++
                                if ( checkIDX($2) != 1 )
                                        I[$2] = I[$2] OFS $1
                                A[$2] = $2 "\t" $3 "\t" $4
                        }
                        if ( ( $3 > R[3] ) || ( $4 < R[2] ) )
                        {
                                print I[$2] "\t" A[$2] "\t" L[$2]
                                A[$2] = $2 "\t" $3 "\t" $4
                                L[$2] = 1
                                I[$2] = $1
                        }
                }
                if ( ! ( $2 in A ) )
                {
                        A[$2] = $2 "\t" $3 "\t" $4
                        L[$2]++
                        I[$2] = $1
                }
        }
        END {
                for ( k in I )
                {
                        print I[k] "\t" A[k] "\t" L[k]
                }
        }
' OFS=, file

Input
Code:
Algorithm       prediction
A       x       65      85
B       x       70      80
C       x       75      85
D       x       10      20
B       y       125     130
C       y       120     140
D       y       200     210

Output
Code:
Algorithm       Prediction      Lower ragne
A,B,C   x       65      85      3
B,C     y       120     140     2
D       x       10      20      1
D       y       200     210     1

These 2 Users Gave Thanks to Yoda For This Post:
# 5  
Old 01-25-2014
Thank you Yoda for your time. Actually I edited my first post, since it was said not to be clear. In the input file I have four columns, with four headers, and in the output there is one more column, so five columns, and all are tab-delimited. Could you please modify your script based on this? Thanks
# 6  
Old 01-25-2014
Quote:
Originally Posted by flyfisherman
Thank you Yoda for your time. Actually I edited my first post, since it was said not to be clear. In the input file I have four columns, with four headers, and in the output there is one more column, so five columns, and all are tab-delimited. Could you please modify your script based on this? Thanks
Add your headers in BEGIN block edit print statement, try to learn ... after providing 99.99% of code.. if you can't edit small header information means, what I can tell. Please don't expect others to complete your task.. put little effort.
This User Gave Thanks to Akshay Hegde For This Post:
# 7  
Old 01-26-2014
Perl approach:
Code:
#!/usr/bin/perl
use strict;
use warnings;

open my $input, "<", "$ARGV[0]" or die "cannot open file: $ARGV[0]";

my %ranges;
while (my $line = <$input>) {
  next if $. == 1;
  chomp $line;
  my ($alg, $pred, $lower, $upper) = split /[ \t]+/, $line;
  my $range = (grep {$lower>=(split /:/, $_)[0] && $lower<=(split /:/, $_)[1]} keys %ranges)[0];
  if ( !$range ) {
    push @{$ranges{"$lower:$upper"}{algs}}, $alg;
    $ranges{"$lower:$upper"}{pred} = $pred;
    search_and_include($lower, $upper, \%ranges);
  } else {
    push @{$ranges{$range}{algs}}, $alg;
    $ranges{$range}{pred} = $pred;
  }
}

foreach my $range (keys %ranges) {
  print "Algorithm\tpredicted_gene\tstart_point\tend_point\tNumber_of_algorithms_predicting_this_site\n";
  my $algs = join ", ", @{$ranges{$range}{algs}};
  my $algs_count = scalar @{$ranges{$range}{algs}};
  my ($lower, $upper) = split /:/, $range;
  print join "\t", $algs, $ranges{$range}{pred}, $lower, $upper, $algs_count;
  print "\n";
}

sub search_and_include {
  my ($lower_inc, $upper_inc, $ranges) = @_;
  foreach my $range (keys %ranges) {
    my ($lower, $upper) = split /:/, $range;
    if ($lower >= $lower_inc && $upper <= $upper_inc && ($lower ne $lower_inc || $upper ne $upper_inc)) {
      push @{$ranges{"$lower_inc:$upper_inc"}{algs}}, @{$ranges{$range}{algs}};
      delete $ranges{$range};
    }
  }
}

Run it like this:
Code:
./script.pl file

This User Gave Thanks to bartus11 For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Printing unique numbers from each file

I have some files named file1, file2, fille3......etc. These files are in a folder f1. The content of files are shown below. I would like to count the unique pairs of third column in each file. some files have no data. It should be printed as zero. Your help would be appreciated. file1 ARG... (1 Reply)
Discussion started by: samra
1 Replies

2. Shell Programming and Scripting

Remove duplicate entries based on the range

I have file like this: chr start end chr15 99874874 99875874 chr15 99875173 99876173 aa1 chr15 99874923 99875923 chr15 99875173 99876173 aa1 chr15 99874962 99875962 chr15 99875173 99876173 aa1 chr1 ... (7 Replies)
Discussion started by: raj_k
7 Replies

3. UNIX for Dummies Questions & Answers

Sorting and saving values based on unique entries

Hi all, I wanted to save the values of a file that contains unique entries based on a specific column (column 4). my sample file looks like the following: input file: 200006-07file.txt 145 35 10 3 147 35 12 4 146 36 11 3 145 34 12 5 143 31 15 4 146 30 14 5 desired output files:... (5 Replies)
Discussion started by: ida1215
5 Replies

4. Shell Programming and Scripting

How to create individual entries from a range of numbers?

I want to create entries based on the series as in examples below: Input: 2dat3 grht-5&&-15 3dat3 grht-16&&-30 4dat3 ftht-4&&-12 5sat3 ftht-16&&-20 Output: 2dat3 grht-5 2dat3 grht-6 2dat3 grht-7 2dat3 grht-8 (7 Replies)
Discussion started by: aydj
7 Replies

5. UNIX for Dummies Questions & Answers

Grep for a range of numbers?

I am trying to extract specific information from a large *.sam file (it's originally 28Gb). I want to extract all lines that are on chr3 somewhere in the range of 112,937,439-113,437,438. Here is a sample line from my file so you can get a feel for what each line looks like: seq.4 0 ... (8 Replies)
Discussion started by: genGirl23
8 Replies

6. Shell Programming and Scripting

unique random numbers awk

Hi, I have a small piece of awk code (see below) that generates random numbers. gawk -F"," 'BEGIN { srand(); for (i = 1; i <= 30; i++) printf("%s AM329_%04d\n",$0,int(36 * rand())+1) }' OFS=, AM329_hole_names.csv The code works fine and generates alphanumeric numbers like AM329_0001,... (2 Replies)
Discussion started by: theflamingmoe
2 Replies

7. Shell Programming and Scripting

How to generate 10.000 unique numbers?

hello, does anybody can give me a hint on how to generate a lot of numbers which are not identically via scripting etc? (7 Replies)
Discussion started by: xrays
7 Replies

8. UNIX for Dummies Questions & Answers

Getting unique list of numbers using grep

Hi, I am going to fetch a list of numbers that starts with "0032" from a file with a format like the given below: " 0032459999 0032458888 0032457777 0032451111 0032452222 0032453333 0032459999 0032458888 0032457777 0032451111 0032452222 0032453333 " I want to get a unique... (6 Replies)
Discussion started by: tinku
6 Replies

9. Shell Programming and Scripting

read numbers from file and output which numbers belongs to which range

Howdy experts, We have some ranges of number which belongs to particual group as below. GroupNo StartRange EndRange Group0125 935300 935399 Group2006 935400 935476 937430 937459 Group0324 935477 935549 ... (6 Replies)
Discussion started by: thepurple
6 Replies

10. UNIX for Dummies Questions & Answers

To get unique numbers from two files

here i have two files: file 1 1 2 3 4 5 5 6 7 8 9 file 2 4 5 6 6 8 8 (6 Replies)
Discussion started by: i.scientist
6 Replies
Login or Register to Ask a Question