Getting non unique lines from concatenated files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files
# 57  
Old 03-26-2011
Code:
#!/usr/bin/perl
open I, "$ARGV[0]";                                                    # open file with filename passed to script as first argument ($ARGV[0])
while (<I>){                                                           # sequentially read lines of that file and put them into $_ variable
  $pos=((split "[\t ]+",$_)[4]);                                       # extract position number from current line stored in $_ variable
  $cov=((split "=",(split ";",(split "[\t ]+",$_)[8])[2])[1]);         # extract coverage value in the same way
  $s{$pos}+=$cov;                                                      # add coverage value for particular position
  $n{$pos}++;                                                          # count number of occurrences for particular position
  $s_tot+=$cov;                                                        # add coverage value for total sum
  $n_tot++;                                                            # count number of coverage occurrences in total
}
END{
  for $i (keys %s){print "Mean_Coverage_for_position _$i = " . ($s{$i}/$n{$i}) . "\n"};   # calculate and print average coverage for each position
  print "Mean_Coverage = " . ($s_tot/$n_tot) . "\n"                    # calculate and print global average
}

This User Gave Thanks to bartus11 For This Post:
# 58  
Old 03-26-2011
Thanks Bartus,
I have another question. I would like to count according to each chr in field[0], which can be chr1-16, chrm or another scplasm how many different entries are present in field[3]
And ofcourse I would like to do this on multiple files, please

Sample file:

Code:
chr01   levure5 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .       genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chrm    levure7 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=18;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=18;novelAlleleStarts=4;novelAlleleMeanQV=13;diColor1=21;diColor2=21;het=0;flag=h4,h10,;gene;ID=Q0
182;Name=Q0182;Alias=ORF11
chrm    levure8 SNP     86086   86086   0.000000        .       .       genotype=A;reference=G;coverage=20;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=19;novelAlleleStarts=4;novelAlleleMeanQV=14;diColor1=21;diColor2=21;het=0;flag=h4,h10,h9,;gene;ID
=Q0182;Name=Q0182;Alias=ORF11
chrm    levure5 SNP     98064   98064   0.000000        .       .       genotype=A;reference=G;coverage=61;refAlleleCounts=8;refAlleleStarts=4;refAlleleMeanQV=11;novelAlleleCounts=52;novelAlleleStarts=6;novelAlleleMeanQV=16;diColor1=20;diColor2=20;het=0;flag=h4,h10,
scplasm1        levure5 SNP     6153    6153    0.000000        .       .       genotype=C;reference=T;coverage=5752;refAlleleCounts=7;refAlleleStarts=7;refAlleleMeanQV=3;novelAlleleCounts=5588;novelAlleleStarts=58;novelAlleleMeanQV=20;diColor1=22;diColor2=22;het=0;flag=h4,h1
0,h9,;region;ID=scplasm1;dbxref=NCBI:NC_001398

Expected Output:
Code:
chr01[TAB]1
chrm[TAB]2
scplasm[TAB]1

Thank you very much for all your help, input and valuable comments on the code Smilie

Have a nice weekend.
# 59  
Old 03-26-2011
Code:
#!/bin/bash
for i in $*; do
echo "$i:"
perl -ane '$h{$F[0]}{$F[3]}++;END{for $i (keys %h){@x=keys %{$h{$i}};print "$i\t" . ($#x+1) . "\n"}}' $i
done

Run it as usual: ./script.sh file*
Explanation tomorrow Smilie
This User Gave Thanks to bartus11 For This Post:
# 60  
Old 03-26-2011
Whenever u have time Bartus .... sorry for so many questions and thanks a lot for your help Smilie
Hv a good weekend
# 61  
Old 03-27-2011
I'll just explain the new pieces of code:$h{$F[0]}{$F[3]}++ - This will create "hash of hashes", with first key being 1st field in your file, and second key being field number four. This technique is described in "Intermediate Perl" book. So after running this code over your sample file, the %h hash structure looks like this:
Code:
 %h = {
          'scplasm1' => {
                          '6153' => 1
                        },
          'chr01' => {
                       '12745' => 3
                     },
          'chrm' => {
                      '86086' => 2,
                      '98064' => 1
                    }
        };

As you can see, for each $F[0] field, all the $F[3] fields are present as keys of underlying hash. Values stored in those inner hashes are number occurrences for $F[3] field. This code would also work without storing this information, using: $h{$F[0]}{$F[3]}=1, then hash will look like this:
Code:
%h = {
          'scplasm1' => {
                          '6153' => 1
                        },
          'chr01' => {
                       '12745' => 1
                     },
          'chrm' => {
                      '86086' => 1,
                      '98064' => 1
                    }
        };

Now all we have to do is print the number of keys present in each inner hash (blue) for each main hash key - $F[0] field (red). To do this we iterate over $F[0] values (red) using for $i (keys %h){Then we assign inner hash keys (blue) to @x array - @x=keys %{$h{$i}};. So during each for each $i value, @x will look like this:
Code:
$i = scplasm1
@x = [
          '6153'
        ];

$i = chr01
@x = [
          '12745'
        ];

$i = chrm
@x = [
          '86086',
          '98064'
        ];

Now all we have to do is print $i and respective number of elements of @x array: print "$i\t" . ($#x+1) . "\n"
This User Gave Thanks to bartus11 For This Post:
# 62  
Old 03-27-2011
Thanks Bartus ... that was really nice of you ... Thank you ever so much for the in depth explanation .... Its really clear ... the only part I didnt undestand was
Code:
. ($#x+1) .



Have a nice Sunday Smilie

---------- Post updated at 12:14 PM ---------- Previous update was at 11:32 AM ----------

I reckon its that like that because in Perl indexing occurs from zero, so zero means one, one means two etc ... thats the need for +1 and the $#x means number of occurences of elements in @x array as you said .... am I right about the +1 ??
Smilie
# 63  
Old 03-27-2011
Yes. To be precise $#x is the last index of @x array. So to get number of elements in that array it needs to be incremented by 1.
This User Gave Thanks to bartus11 For This Post:
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

10. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies
Login or Register to Ask a Question