Getting non unique lines from concatenated files


 
Thread Tools Search this Thread
Top Forums UNIX for Dummies Questions & Answers Getting non unique lines from concatenated files
# 50  
Old 03-25-2011
Thanks Bartus
I should have figured that out ...was simple ... anyway I made it more interesting using the other version of perl oneliner ... like
Code:
#!/bin/sh
for i in $*; do
echo "$i"
echo "##########"
perl -nle '$h{((split "=",(split ";",(split "[\t ]+",$_)[8])[0])[1])}++;END{for $i (keys %h){print "$i=$h{$i}"}}' $i
echo "##########\n\n"
done

What could I include to report the average of the all values of coverage which would be the red part in the code below for the file I show you as an example below that
code
Code:
perl -nle '$h{((split "=",(split ";",(split "[\t ]+",$_)[8])[2])[1])}++;END{for $i (keys %h){print "$i=$h{$i}"}}' $i


file example:
Code:
chr01    levure5    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01    levure6    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure7    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure8    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=37;refAlleleCounts=18;refAlleleStarts=17;refAlleleMeanQV=18;novelAlleleCounts=18;novelAlleleStarts=13;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure5    SNP    16254    16254    0.000000    .    .    genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;no

Thanks and have a nice day Smilie

---------- Post updated at 05:18 AM ---------- Previous update was at 04:48 AM ----------

@Bartus11

Hi Bartus,
Yesterday you tols me how to make scripts executable in a loop by using the $1 for one file or $i with the for loop for multiple files to be included in the end of the perl one -liner within the sh script.

Now one of the codes you provided asks for file names in the beginning of the script and not the end .... can I still use the for loop in this case ?? ... so your code was
Code:
perl -l -0e 'BEGIN{@f=(file_3,file_4,file_1);$N=$#f}for $i (0..$N){for $j (0..$i-1,$i+1..$N){open I,"<$f[$j]";$a.=<I>}open O,">files${i}.tmp";print O $a;$a=""};
for $i (0..$N){print "$f[$i] unique\n";system "bash -c \"comm -23 <(sort $f[$i]) <(sort files$i.tmp);rm -f files$i.tmp\"";print "\n##############\n"}'

I'm proposing within the sh script (in red)the following, please correct me.
Code:
#!/bin/sh
for i in $*; do
echo "$i"
perl -l -0e 'BEGIN{@f=($i);$N=$#f}for $i (0..$N){for $j (0..$i-1,$i+1..$N){open I,"<$f[$j]";$a.=<I>}open O,">files${i}.tmp";print O $a;$a=""};
for $i (0..$N){print "$f[$i] unique\n";system "bash -c \"comm -23 <(sort $f[$i]) <(sort files$i.tmp);rm -f files$i.tmp\"";print "\n##############\n"}'
done

Cheers Smilie
# 51  
Old 03-25-2011
Try this:
Code:
#!/bin/sh
perl -l -s0e 'BEGIN{@f=split / /,$f;$N=$#f}for $i (0..$N){for $j (0..$i-1,$i+1..$N){open I,"<$f[$j]";$a.=<I>}open O,">files${i}.tmp";print O $a;$a=""};
for $i (0..$N){print "$f[$i] unique\n";system "bash -c \"comm -23 <(sort $f[$i]) <(sort files$i.tmp);rm -f files$i.tmp\"";print "\n##############\n"}' -- -f="$*"

-s option allows easy passing of variable values to the Perl one-liner (-- -f="$*" part). After passing all the filenames as $f variable, it is split using space as delimiter into @f array - @f=split / /,$f.
This User Gave Thanks to bartus11 For This Post:
# 52  
Old 03-25-2011
Thanks a lot Bartus ....
What about the calculation of average of coverage in feild 9 of the question above that? Any ideas?

And another thing was, you gave a code to find common lines in 2 files as shown, but its not a perl code

Can I still make it executable over several files by having a similar
Quote:
-- -f="$*" and @f=split / /, $f
construct ? Could you guide me into this?

Code:
x=`sort file_1`;for i in file_2 file_3; do x=`comm -12 <(echo "$x") <(sort $i)`; done; echo "$x"

Basically I want to make it executable and input the filenames in command line and not in the code itself to allow flexibility of comparison.

Cheers ... hv a nice day Smilie
# 53  
Old 03-25-2011
To convert that comm one-liner into script:
Code:
#!/bin/bash
first=$1
shift
x=`sort $first`;for i in $*; do x=`comm -12 <(echo "$x") <(sort $i)`; done; echo "$x"

Notice that this time /bin/bash is used, not /bin/sh. As for calculating average, provide sample input data and desired output Smilie
This User Gave Thanks to bartus11 For This Post:
# 54  
Old 03-25-2011
OK sure .... so sample file is
Code:
chr01    levure5    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01    levure6    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure7    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure8    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=37;refAlleleCounts=18;refAlleleStarts=17;refAlleleMeanQV=18;novelAlleleCounts=18;novelAlleleStarts=13;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=

Expected output is

Code:
Mean_Coverage = 53

Cheers and thanx for the other answer Smilie

---------- Post updated at 09:07 AM ---------- Previous update was at 09:00 AM ----------

I could make it more complicated :
Sample input:
Code:
chr01    levure5    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01    levure6    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure7    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure8    SNP    12745    12745    0.000000    .    .    genotype=S;reference=C;coverage=37;refAlleleCounts=18;refAlleleStarts=17;refAlleleMeanQV=18;novelAlleleCounts=18;novelAlleleStarts=13;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01    levure5    SNP    16254    16254    0.000000    .    .    genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;novelAlleleCounts=28;novelAlleleStarts=9;novelAlleleMeanQV=18;diColor1=10;diColor2=32;het=1;flag=
chr01    levure6    SNP    16254    16254    0.000000    .    .    genotype=R;reference=G;coverage=96;refAlleleCounts=72;refAlleleStarts=38;refAlleleMeanQV=17;novelAlleleCounts=24;novelAlleleStarts=6;novelAlleleMeanQV=15;diColor1=10;diColor2=32;het=1;flag=
chr01    levure7    SNP    16254    16254    0.000000    .    .    genotype=R;reference=G;coverage=32;refAlleleCounts=20;refAlleleStarts=18;refAlleleMeanQV=19;novelAlleleCounts=12;novelAlleleStarts=5;novelAlleleMeanQV=17;diColor1=10;diColor2=32;het=1;flag=
chr01    levure8    SNP    16254    16254    0.000000    .    .    genotype=R;reference=G;coverage=45;refAlleleCounts=33;refAlleleStarts=25;refAlleleMeanQV=20;novelAlleleCounts=10;novelAlleleStarts=6;novelAlleleMeanQV=19;diColor1=10;diColor2=32;het=1;flag=
chr01    levure5    SNP    16511    16511    0.000000    .    .    genotype=A;reference=G;coverage=42;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=35;novelAlleleStarts=16;novelAlleleMeanQV=19;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01    levure6    SNP    16511    16511    0.000000    .    .    genotype=A;reference=G;coverage=32;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=23;novelAlleleStarts=11;novelAlleleMeanQV=17;

Expected output

Code:
Mean_Coverage_for_position _12745 = 53
Mean_Coverage_for_position _16254 = 71
Mean_Coverage_for_position _16511 = 37

Can you provide both versions of possible Smilie
# 55  
Old 03-25-2011
Try this script:
Code:
#!/usr/bin/perl
open I, "$ARGV[0]";
while (<I>){
  $pos=((split "[\t ]+",$_)[4]);
  $cov=((split "=",(split ";",(split "[\t ]+",$_)[8])[2])[1]);
  $s{$pos}+=$cov;
  $n{$pos}++;
  $s_tot+=$cov;
  $n_tot++;
}
END{
  for $i (keys %s){print "Mean_Coverage_for_position _$i = " . ($s{$i}/$n{$i}) . "\n"};
  print "Mean_Coverage = " . ($s_tot/$n_tot) . "\n"
}

Run it like this: ./script.pl file
This User Gave Thanks to bartus11 For This Post:
# 56  
Old 03-25-2011
Yeah that was a nice one Smilie .... could you comment on the code what each line of code is doing ? I'll appreciate it very much Smilie

Cheers and have a nice weekend

++
 
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

I have a directory of files, I can show the number of lines in each file and order them from lowest to highest with: wc -l *|sort 15263 Image.txt 16401 reference.txt 40459 richtexteditor.txt How can I also print the number of unique lines in each file? 15263 1401 Image.txt 16401... (15 Replies)
Discussion started by: spacegoose
15 Replies

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

I would like to print unique lines without sort or unique. Unfortunately the server I am working on does not have sort or unique. I have not been able to contact the administrator of the server to ask him to add it for several weeks. (7 Replies)
Discussion started by: cokedude
7 Replies

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

file 1 Sun Mar 17 00:01:33 2013 submit , Name="1234" Sun Mar 17 00:01:33 2013 submit , Name="1344" Sun Mar 17 00:01:33 2013 submit , Name="1124" .. .. .. .. Sun Mar 17 00:01:33 2013 submit , Name="8901" file 2 Sun Mar 17 00:02:47 2013 1234 execute SUCCEEDED Sun Mar 17... (24 Replies)
Discussion started by: aravindj80
24 Replies

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Hello everyone, Maybe somebody could help me with an awk script. I have this input (field separator is comma ","): 547894982,M|N|J,U|Q|P,98,101,0,1,1 234900027,M|N|J,U|Q|P,98,101,0,1,1 234900023,M|N|J,U|Q|P,98,54,3,1,1 234900028,M|H|J,S|Q|P,98,101,0,1,1 234900030,M|N|J,U|F|P,98,101,0,1,1... (2 Replies)
Discussion started by: Ophiuchus
2 Replies

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

hi my problem is little complicated one. i have 2 files which appear like this file 1 abbsss:aa:22:34:as akl abc 1234 mkilll:as:ss:23:qs asc abc 0987 mlopii:cd:wq:24:as asd abc 7866 file2 lkoaa:as:24:32:sa alk abc 3245 lkmo:as:34:43:qs qsa abc 0987 kloia:ds:45:56:sa acq abc 7805 i... (5 Replies)
Discussion started by: anurupa777
5 Replies

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

hi i have used comm -13 <(sort 1.txt) <(sort 2.txt) option to get the unique lines that are present in file 2 but not in file 1. but some how i am getting the entire file 2. i would expect few but not all uncommon lines fro my dat. is there anything wrong with the way i used the command? my... (1 Reply)
Discussion started by: anurupa777
1 Replies

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Hi friends, I have multiple files. For now, let's say I have two of the following style cat 1.txt cat 2.txt output.txt Please note that my files are not sorted and in the output file I need another extra column that says the file from which it is coming. I have more than 100... (19 Replies)
Discussion started by: jacobs.smith
19 Replies

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Hi All, I have a very huge file (4GB) which has duplicate lines. I want to delete duplicate lines leaving unique lines. Sort, uniq, awk '!x++' are not working as its running out of buffer space. I dont know if this works : I want to read each line of the File in a For Loop, and want to... (16 Replies)
Discussion started by: krishnix
16 Replies

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Hi, I have 2 files file1 ******** 01-05-09|java.xls| 02-05-08|c.txt| 08-01-09|perl.txt| 01-01-09|oracle.txt| ******** file2 ******** 01-02-09|windows.xls| 02-05-08|c.txt| 01-05-09|java.xls| 08-02-09|perl.txt| 01-01-09|oracle.txt| ******** (8 Replies)
Discussion started by: shekhar_v4
8 Replies

10. Shell Programming and Scripting

Lines Concatenated with awk

Hello, I have a bash shell script and I use awk to print certain columns of one file and direct the output to another file. If I do a less or cat on the file it looks correct, but if I email the file and open it with Outlook the lines outputted by awk are concatenated. Here is my awk line:... (6 Replies)
Discussion started by: xadamz23
6 Replies
Login or Register to Ask a Question