Getting non unique lines from concatenated files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files
# 106  
Old 03-31-2011
Try this script:
Code:
#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $r{$F[0]}{$F[1]}=$F[4];
  $g{$F[0]}{$F[1]}=$F[5];
}
END{
  for $i (keys %r){
    @x=sort{$a <=> $b} keys %{$r{$i}};
    print "$i\n";
    print "$x[0]-$x[$#x]\n";
    print "Ref:\n";
    for $j (@x){
      print "$r{$i}{$j}";
    }
    print "\n\n";
    print "Gen\n";
    for $j (@x){
      print "$g{$i}{$j}";
    }
    print "\n\n";
  }
}

This User Gave Thanks to bartus11 For This Post:
# 107  
Old 04-01-2011
Thanks for your help Bartus however I noticed one thing in my file and thats why the code doesnt work perfectly at the moment. I put the strange things in red below and my expected outcome below that for the following sample lines:
Code:
SK1.chr10       3181    20      20      C       C       1.000000        h4,h10,h21,h22,m6       3       3       3       0       20      0       0       -1      
SK1.chr10       3182    02      02      C       C       1.000000        h4,h10,h21,h22,m6       3       3       3       0       22      0       0       -1      
SK1.chr10       3183    21      21      T       T       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       12      0       0       -1      
SK1.chr10       3184                    G       N       -1.000000       h15,h20,h21,h22,m5,m2,m6,m7,m8,m9,m10   3       1       1       1       8       4       0       -1      
SK1.chr10       3185    21      21      A       A       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       7       0       0       -1      
SK1.chr10       3186    13      13      C       C       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       19      0       0       -1      
SK1.chr10       3187    31      31      G       G       1.000000        h4,h10,h9,h21,h22,m6    3       2       2       0       19      0       0       -1      
SK1.chr10       3188    10      10      T       T       1.000000        h4,h10,h21,h22,m6       3       3       3       0       16      0       0       -1      
SK1.chr10       3189    01      01      T       T       1.000000        h4,h10,h15,h21,h22,m6   3       3       3       0       5       0       0       -1      
SK1.chr10       3190    12      12      G       G       1.000000        h4,h10,h15,h21,h22,m6   3       3       3       0       3       0       0       -1      
SK1.chr10       3191    23      23      A       A       1.000000        h4,h10,h15,h21,h22,m6   3       3       3       0       6       0       0       -1      
SK1.chr10       3192    31      31      T       T       1.000000        h4,h10,h21,h22,m6       3       3       3       0       6       0       0       -1      
SK1.chr10       3193    13      13      G       G       1.000000        h4,h10,h21,h22,m6       2       2       2       0       13      0       0       -1      
SK1.chr10       3194    32      32      C       C       1.000000        h4,h10,h1,h2,h21,h22,m4,m6      1       1       1       0       9       0       0       -1      
SK1.chr10       5503    21      21      C       C       1.000000        h4,h10,h1,h2,h21,h22,m4,m6      1       1       1       0       8       0       0       -1      
SK1.chr10       5504    10      10      A       A       1.000000        h4,h10,h21,h22,m6       3       3       3       0       13      0       0       -1      
SK1.chr10       5505    00      00      A       A       1.000000        h4,h10,h21,h22,m6       3       3       3       0       8       0       0       -1      
SK1.chr10       5506    03      03      A       A       1.000000        h4,h10,h21,h22,m6       4       4       4       0       12      0       0       -1

Output with provided code:
Code:
SK1.chr10
3181-5506
Ref:
CCT-1.000000ACGTTGATGCCAAA

Gen
CCTh15,h20,h21,h22,m5,m2,m6,m7,m8,m9,m10ACGTTGATGCCAAA

Expected output:
Code:
SK1.chr10
3181-3194
Ref:
CCTGACGTTGATGC

Gen:
CCTNACGTTGATGC

SK1.chr10
5503-5506
Ref:
CAAA

Gen:
CAAA

Hope there is a way out ...
Cheers and have a nice day Smilie
# 108  
Old 04-01-2011
Try:
Code:
#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $start=$F[1] if $F[1]-1!=$prev;
  $r{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[2]:$F[4];
  $g{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[3]:$F[5];
  $prev=$F[1];
}
END{
  for $i (keys %r){
    for $j (keys %{$r{$i}}){
      @x=sort{$a <=> $b} keys %{$r{$i}{$j}};
      print "$i\n";
      print "$x[0]-$x[$#x]\n";
      print "Ref:\n";
      for $k (@x){
        print "$r{$i}{$j}{$k}";
      }
      print "\n\n";
      print "Gen:\n";
      for $k (@x){
        print "$g{$i}{$j}{$k}";
      }
      print "\n\n";
    }
  }
}

This User Gave Thanks to bartus11 For This Post:
# 109  
Old 04-01-2011
That did the job like desired ... thanx a ton Bartus ... could you explain the working of the code whenever you have a few moments ... I'll greatly appreciate that Smilie
Cheers Master Smilie
# 110  
Old 04-02-2011
The main idea behind this code is to introduce another level of hash, that will contain starting number of the number's range.
Code:
#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  @F=split;
  $start=$F[1] if $F[1]-1!=$prev;                      # extract range starting number by comparing value of second column with it's value from previous line
  $r{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[2]:$F[4];  # if 5th column is equal "-1" then Ref value is taken from 3rd field, otherwise take it from 5th column
  $g{$F[0]}{$start}{$F[1]}=($F[4]=="-1")?$F[3]:$F[5];  # the same for Gen, just other columns
  $prev=$F[1];                                         # save 1st column value for comparison with next line
}
END{                                                   # this part is in the essence the same as the old code, it just contains another "for" loop(red), that goes through the starting numbers for the ranges.
  for $i (keys %r){
    for $j (keys %{$r{$i}}){
      @x=sort{$a <=> $b} keys %{$r{$i}{$j}};
      print "$i\n";
      print "$x[0]-$x[$#x]\n";
      print "Ref:\n";
      for $k (@x){
        print "$r{$i}{$j}{$k}";
      }
      print "\n\n";
      print "Gen:\n";
      for $k (@x){
        print "$g{$i}{$j}{$k}";
      }
      print "\n\n";
    }
  }
}

Below you can see how the "%r" hash looks like after reading all lines:
Code:
%r = {
          'SK1.chr10' => {
                           '3181' => {
                                       '3181' => 'C',
                                       '3193' => 'G',
                                       '3182' => 'C',
                                       '3189' => 'T',
                                       '3188' => 'T',
                                       '3194' => 'C',
                                       '3185' => 'A',
                                       '3183' => 'T',
                                       '3190' => 'G',
                                       '3184' => 'G',
                                       '3191' => 'A',
                                       '3187' => 'G',
                                       '3192' => 'T',
                                       '3186' => 'C'
                                     },
                           '5503' => {
                                       '5503' => 'C',
                                       '5504' => 'A',
                                       '5506' => 'A',
                                       '5505' => 'A'
                                     }
                         }
        };


Last edited by bartus11; 04-02-2011 at 12:28 PM..
This User Gave Thanks to bartus11 For This Post:
# 111  
Old 04-02-2011
Thank you very much ... by the way we have won the World Cup in Cricket today !! ... so that's why I had no questions to post today Smilie ... Hv a good weekend Smilie
# 112  
Old 04-04-2011
Hello Smilie
First question of the week ... I have a file like
Code:
SK1.chr15
201-339
Ref:
TTATCATATACGGTGTTAGAAGATGACGGAAATGATGAGAAATAGTCATCTAAATTAGTGGAAGCTGAAACGCAAGAATTGATAATGTAATAGGATCAATGAATACTAACATATAAAACGATGATAATAATATTTATAG

Gen:
TTATNNTANNCGGTGTTAGAAGATGACGGAAATGATGAGAAATAGTCATNTAANNTAGTGGAAGCTGAAACGCAAGAATTGATAATGTAATAGGATCAATGAATACTAACATATAAAACGATGATAATAATATTTNNAG

SK1.chr15
364-419
Ref:
CTGATTCAGTGGCGGAGGATGAACCTGATGTAATGGAAGTAGATGAACCGGAGACT

Gen:
CTGATTCAGTGGCGGAGGATGAACCTGATGTAATGGAAGTAGATGAACCGGAGACT

Now what I want to do is to grab all the lines starting after Gen: and concatenate them in order, but I also want to fill the gaps with appropriate number of "N"s depending on the coordinates given ... so for the above example my expected output would be
Code:
SK1.chr15
1-419 would be
200Ns ... then TTATNNTANNCGGTGTTAGAAGATGACGGAAATGATGAGAAATAGTCATNTAANNTAGTGGAAGCTGAAACGCAAGAATTGATAATGTAATAGGATCAATGAATACTAACATATAAAACGATGATAATAATATTTNNAG (which is Gen:201-339) .... then 24 Ns (which fills the gap between 339 and 364) and then CTGATTCAGTGGCGGAGGATGAACCTGATGTAATGGAAGTAGATGAACCGGAGACT(which is Gen: 364-419).

I have absolutely no idea how to accomplish this. Can you please enlighten on this problem. The thing is that for each SK1.chr* I need to reconstruct this separately. I'll appreciate your feedback on this one.
Cheers and have a nice day Smilie
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

10. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies
Login or Register to Ask a Question