Getting non unique lines from concatenated files

03-31-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Try this script:

Code:

#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $r{$F[0]}{$F[1]}=$F[4];
  $g{$F[0]}{$F[1]}=$F[5];
}
END{
  for $i (keys %r){
    @x=sort{$a <=> $b} keys %{$r{$i}};
    print "$i\n";
    print "$x[0]-$x[$#x]\n";
    print "Ref:\n";
    for $j (@x){
      print "$r{$i}{$j}";
    }
    print "\n\n";
    print "Gen\n";
    for $j (@x){
      print "$g{$i}{$j}";
    }
    print "\n\n";
  }
}

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

04-01-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thanks for your help Bartus however I noticed one thing in my file and thats why the code doesnt work perfectly at the moment. I put the strange things in red below and my expected outcome below that for the following sample lines:

Code:

SK1.chr10       3181    20      20      C       C       1.000000        h4,h10,h21,h22,m6       3       3       3       0       20      0       0       -1      
SK1.chr10       3182    02      02      C       C       1.000000        h4,h10,h21,h22,m6       3       3       3       0       22      0       0       -1      
SK1.chr10       3183    21      21      T       T       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       12      0       0       -1      
SK1.chr10       3184                    G       N       -1.000000       h15,h20,h21,h22,m5,m2,m6,m7,m8,m9,m10   3       1       1       1       8       4       0       -1      
SK1.chr10       3185    21      21      A       A       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       7       0       0       -1      
SK1.chr10       3186    13      13      C       C       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       19      0       0       -1      
SK1.chr10       3187    31      31      G       G       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       19      0       0       -1      
SK1.chr10       3188    10      10      T       T       1.000000        h4,h10,h21,h22,m6       3       3       3       0       16      0       0       -1      
SK1.chr10       3189    01      01      T       T       1.000000        h4,h10,h15,h21,h22,m6   3       3       3       0       5       0       0       -1      
SK1.chr10       3190    12      12      G       G       1.000000        h4,h10,h15,h21,h22,m6   3       3       3       0       3       0       0       -1      
SK1.chr10       3191    23      23      A       A       1.000000        h4,h10,h15,h21,h22,m6   3       3       3       0       6       0       0       -1      
SK1.chr10       3192    31      31      T       T       1.000000        h4,h10,h21,h22,m6       3       3       3       0       6       0       0       -1      
SK1.chr10       3193    13      13      G       G       1.000000        h4,h10,h21,h22,m6       2       2       2       0       13      0       0       -1      
SK1.chr10       3194    32      32      C       C       1.000000        h4,h10,h1,h2,h21,h22,m4,m6      1       1       1       0       9       0       0       -1      
SK1.chr10       5503    21      21      C       C       1.000000        h4,h10,h1,h2,h21,h22,m4,m6      1       1       1       0       8       0       0       -1      
SK1.chr10       5504    10      10      A       A       1.000000        h4,h10,h21,h22,m6       3       3       3       0       13      0       0       -1      
SK1.chr10       5505    00      00      A       A       1.000000        h4,h10,h21,h22,m6       3       3       3       0       8       0       0       -1      
SK1.chr10       5506    03      03      A       A       1.000000        h4,h10,h21,h22,m6       4       4       4       0       12      0       0       -1

Output with provided code:

Code:

SK1.chr10
3181-5506
Ref:
CCT-1.000000ACGTTGATGCCAAA

Gen
CCTh15,h20,h21,h22,m5,m2,m6,m7,m8,m9,m10ACGTTGATGCCAAA

Expected output:

Code:

SK1.chr10
3181-3194
Ref:
CCTGACGTTGATGC

Gen:
CCTNACGTTGATGC

SK1.chr10
5503-5506
Ref:
CAAA

Gen:
CAAA

Hope there is a way out ...
Cheers and have a nice day

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

04-01-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

Try:

Code:

#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $start=$F[1] if $F[1]-1!=$prev;
  $r{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[2]:$F[4];
  $g{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[3]:$F[5];
  $prev=$F[1];
}
END{
  for $i (keys %r){
    for $j (keys %{$r{$i}}){
      @x=sort{$a <=> $b} keys %{$r{$i}{$j}};
      print "$i\n";
      print "$x[0]-$x[$#x]\n";
      print "Ref:\n";
      for $k (@x){
        print "$r{$i}{$j}{$k}";
      }
      print "\n\n";
      print "Gen:\n";
      for $k (@x){
        print "$g{$i}{$j}{$k}";
      }
      print "\n\n";
    }
  }
}

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

04-01-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

That did the job like desired ... thanx a ton Bartus ... could you explain the working of the code whenever you have a few moments ... I'll greatly appreciate that

Cheers Master

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

04-02-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

The main idea behind this code is to introduce another level of hash, that will contain starting number of the number's range.

Code:

#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $start=$F[1] if $F[1]-1!=$prev;                      # extract range starting number by comparing value of second column with it's value from previous line
  $r{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[2]:$F[4];  # if 5th column is equal "-1" then Ref value is taken from 3rd field, otherwise take it from 5th column
  $g{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[3]:$F[5];  # the same for Gen, just other columns
  $prev=$F[1];                                         # save 1st column value for comparison with next line
}
END{                                                   # this part is in the essence the same as the old code, it just contains another "for" loop(red), that goes through the starting numbers for the ranges.
  for $i (keys %r){
    for $j (keys %{$r{$i}}){
      @x=sort{$a <=> $b} keys %{$r{$i}{$j}};
      print "$i\n";
      print "$x[0]-$x[$#x]\n";
      print "Ref:\n";
      for $k (@x){
        print "$r{$i}{$j}{$k}";
      }
      print "\n\n";
      print "Gen:\n";
      for $k (@x){
        print "$g{$i}{$j}{$k}";
      }
      print "\n\n";
    }
  }
}

Below you can see how the "%r" hash looks like after reading all lines:

Code:

%r = {
          'SK1.chr10' => {
                           '3181' => {
                                       '3181' => 'C',
                                       '3193' => 'G',
                                       '3182' => 'C',
                                       '3189' => 'T',
                                       '3188' => 'T',
                                       '3194' => 'C',
                                       '3185' => 'A',
                                       '3183' => 'T',
                                       '3190' => 'G',
                                       '3184' => 'G',
                                       '3191' => 'A',
                                       '3187' => 'G',
                                       '3192' => 'T',
                                       '3186' => 'C'
                                     },
                           '5503' => {
                                       '5503' => 'C',
                                       '5504' => 'A',
                                       '5506' => 'A',
                                       '5505' => 'A'
                                     }
                         }
        };

Last edited by bartus11; 04-02-2011 at 12:28 PM..

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

04-02-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thank you very much ... by the way we have won the World Cup in Cricket today !! ... so that's why I had no questions to post today

... Hv a good weekend

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

04-04-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Hello

First question of the week ... I have a file like

Code:

SK1.chr15
201-339
Ref:
TTATCATATACGGTGTTAGAAGATGACGGAAATGATGAGAAATAGTCATCTAAATTAGTGGAAGCTGAAACGCAAGAATTGATAATGTAATAGGATCAATGAATACTAACATATAAAACGATGATAATAATATTTATAG

Gen:
TTATNNTANNCGGTGTTAGAAGATGACGGAAATGATGAGAAATAGTCATNTAANNTAGTGGAAGCTGAAACGCAAGAATTGATAATGTAATAGGATCAATGAATACTAACATATAAAACGATGATAATAATATTTNNAG

SK1.chr15
364-419
Ref:
CTGATTCAGTGGCGGAGGATGAACCTGATGTAATGGAAGTAGATGAACCGGAGACT

Gen:
CTGATTCAGTGGCGGAGGATGAACCTGATGTAATGGAAGTAGATGAACCGGAGACT

Now what I want to do is to grab all the lines starting after Gen: and concatenate them in order, but I also want to fill the gaps with appropriate number of "N"s depending on the coordinates given ... so for the above example my expected output would be

Code:

SK1.chr15
1-419 would be
200Ns ... then TTATNNTANNCGGTGTTAGAAGATGACGGAAATGATGAGAAATAGTCATNTAANNTAGTGGAAGCTGAAACGCAAGAATTGATAATGTAATAGGATCAATGAATACTAACATATAAAACGATGATAATAATATTTNNAG (which is Gen:201-339) .... then 24 Ns (which fills the gap between 339 and 364) and then CTGATTCAGTGGCGGAGGATGAACCTGATGTAATGGAAGTAGATGAACCGGAGACT(which is Gen: 364-419).

I have absolutely no idea how to accomplish this. Can you please enlighten on this problem. The thing is that for each SK1.chr* I need to reconstruct this separately. I'll appreciate your feedback on this one.
Cheers and have a nice day

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

UNIX for Dummies Questions & Answers

Getting non unique lines from concatenated files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

Discussion started by: spacegoose

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

Discussion started by: aravindj80

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

Discussion started by: anurupa777

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Discussion started by: jacobs.smith

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Discussion started by: shekhar_v4

10. Shell Programming and Scripting

Lines Concatenated with awk

Discussion started by: xadamz23