Getting non unique lines from concatenated files

03-21-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

What is the field separator in your file? Multiple spaces? Or TAB maybe? Anyway try this:

Code:

perl -lne '$h{$.}=$_;END{@o=sort{(split "=",(split ";",(split "[\t ]+",$h{$a})[8])[2])[1]<=>(split "=",(split ";",(split "[\t ]+",$h{$b})[8])[2])[1]}keys %h; for $i (@o){print $h{$i}}}' file

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-21-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

This worked perfectly with the \t introduced in the new code ...... yeah my fields are separated by tabs .... awesome stuff Bartus11 ... once again if you clould please break the code down and comment on it, will make things simpler for me to understand and apply elsewhere

this code has a special <=> operator .... what does this do ??

Thank you very much again ... hv a nice evening

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-21-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

After giving it a little thought, I made a little optimisation to that code:

Code:

perl -ne '$h[$.]=$_;END{print sort{(split "=",(split ";",(split "[\t ]+",$a)[8])[2])[1]<=>(split "=",(split ";",(split "[\t ]+",$b)[8])[2])[1]}@h}' file

I replaced hash (associative array) with regular array (@h), as it is enough for this task.
$h[$.]=$_load each line into array @h, indexed by line number
printprint whatever sort function returns
sort{...}@hsort @h array (you have to read about sorting arrays in Perl, as it is too extensive subject for short post)
split "[\t ]+",$asplit first compare pair element, using multiple TABs and spaces as field separator
(split "[\t ]+",$b)[8]take 9th field from array output by that split
(split ";",(split "[\t ]+",$b)[8])[3]split that 9th filed using ";" as separator and take 3rd field from resulting array
(split "=",(split ";",(split "[\t ]+",$b)[8])[2])[1]split that field (now it contains something like: coverage=43) using "=" as separator, and take 2nd field, so basically this whole line cuts value of "coverage" from the line.
The same happens with second compare pair element ($b): (split "=",(split ";",(split "[\t ]+",$b)[8])[2])[1]
When both values have been extracted, the comparison itself can take place, by the means of "<=>" operator. You can read about that operator in "Learning Perl". Basically it is mostly useful inside of "sort" function, to sort the array numerically.

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

03-21-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

Thank you very much always Bartus11 ...
Your explanation is very good

... I will definitely try to follow the Learning Perl book in more detail
Cheers

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-24-2011

Registered User

164, 1

Join Date: Mar 2011

Last Activity: 6 August 2015, 12:14 AM EDT

Posts: 164

Thanks Given: 119

Thanked 1 Time in 1 Post

@Bartus11
Hi hope you are well.
Can I ask you another question continuing from with same data set for which you have kindly provided other answers

Code:

chr01   levure5 SNP     12745   12745   0.000000        .        .        genotype=S;reference=C;coverage=91;refAlleleCounts=44;refAlleleStarts=28;refAlleleMeanQV=19;novelAlleleCounts=40;novelAlleleStarts=25;novelAlleleMeanQV=18;diColor1=22;diColor2=11;het=1;flag=
chr01   levure6 SNP     12745   12745   0.000000        .       .        genotype=S;reference=C;coverage=62;refAlleleCounts=29;refAlleleStarts=19;refAlleleMeanQV=19;novelAlleleCounts=32;novelAlleleStarts=20;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure7 SNP     12745   12745   0.000000        .       .        genotype=S;reference=C;coverage=24;refAlleleCounts=9;refAlleleStarts=8;refAlleleMeanQV=23;novelAlleleCounts=13;novelAlleleStarts=12;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure8 SNP     12745   12745   0.000000        .       .        genotype=S;reference=C;coverage=37;refAlleleCounts=18;refAlleleStarts=17;refAlleleMeanQV=18;novelAlleleCounts=18;novelAlleleStarts=13;novelAlleleMeanQV=20;diColor1=11;diColor2=22;het=1;flag=
chr01   levure5 SNP     16254   16254   0.000000        .       .        genotype=R;reference=G;coverage=111;refAlleleCounts=82;refAlleleStarts=41;refAlleleMeanQV=18;novelAlleleCounts=28;novelAlleleStarts=9;novelAlleleMeanQV=18;diColor1=10;diColor2=32;het=1;flag=
chr01   levure6 SNP     16254   16254   0.000000        .       .        genotype=R;reference=G;coverage=96;refAlleleCounts=72;refAlleleStarts=38;refAlleleMeanQV=17;novelAlleleCounts=24;novelAlleleStarts=6;novelAlleleMeanQV=15;diColor1=10;diColor2=32;het=1;flag=
chr01   levure7 SNP     16254   16254   0.000000        .       .        genotype=R;reference=G;coverage=32;refAlleleCounts=20;refAlleleStarts=18;refAlleleMeanQV=19;novelAlleleCounts=12;novelAlleleStarts=5;novelAlleleMeanQV=17;diColor1=10;diColor2=32;het=1;flag=
chr01   levure8 SNP     16254   16254   0.000000        .       .        genotype=R;reference=G;coverage=45;refAlleleCounts=33;refAlleleStarts=25;refAlleleMeanQV=20;novelAlleleCounts=10;novelAlleleStarts=6;novelAlleleMeanQV=19;diColor1=10;diColor2=32;het=1;flag=
chr01   levure5 SNP     16511   16511   0.000000        .       .        genotype=A;reference=G;coverage=42;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=35;novelAlleleStarts=16;novelAlleleMeanQV=19;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,
chr01   levure6 SNP     16511   16511   0.000000        .       .        genotype=A;reference=G;coverage=32;refAlleleCounts=0;refAlleleStarts=0;refAlleleMeanQV=0;novelAlleleCounts=23;novelAlleleStarts=11;novelAlleleMeanQV=17;diColor1=12;diColor2=12;het=0;flag=h4,h10,h9,

Last time you helped me sort the data accoring to any ;delimited parameter of choice in the last field of each line. This time I want to know details about the genotype parameter in this feild.
So taking the above example what I want is to know is the count of each type of genotype. So my expected output for the above would be:

Code:

S=4
R=4
A=2

Genotype is always denoted by a capital A-Z letter, so I reckon the regex can be restricted to that pattern.

Would be nice if you can help out on this.

Have a nice day.

Cheers

pawannoel

View Public Profile for pawannoel

Find all posts by pawannoel

03-24-2011

Registered User

6, 0

Join Date: Mar 2011

Last Activity: 19 April 2011, 4:46 PM EDT

Posts: 6

Thanks Given: 1

Thanked 0 Times in 0 Posts

Code:

egrep -iow '(http[s]*[:][/]+|www[.])[^"\<>]*' url.txt

is this command logically incorrect to match a url pattern inside a file and display only the urls in the terminal???

Please rectify the error in my syntax ,

an2up

View Public Profile for an2up

Find all posts by an2up

03-24-2011

Registered User

3,733, 1,154

Join Date: Apr 2009

Last Activity: 3 August 2016, 11:03 AM EDT

Posts: 3,733

Thanks Given: 7

Thanked 1,154 Times in 1,124 Posts

This uses most of the code from last solution:

Code:

perl -nle '$h{((split "=",(split ";",(split "[\t ]+",$_)[8])[0])[1])}++;END{for $i (keys %h){print "$i=$h{$i}"}}' file

Cascade splits are used to get genotype value. Then it is used to populate hash %h (red parts) with genotype type as the keys and number of occurrences as values. At the "END" section, contents of %h hash is printed.

This User Gave Thanks to bartus11 For This Post:

bartus11

View Public Profile for bartus11

Find all posts by bartus11

UNIX for Dummies Questions & Answers

Getting non unique lines from concatenated files

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Print number of lines for files in directory, also print number of unique lines

Discussion started by: spacegoose

2. UNIX for Dummies Questions & Answers

Print unique lines without sort or unique

Discussion started by: cokedude

3. Shell Programming and Scripting

Look up 2 files and print the concatenated output

Discussion started by: aravindj80

4. Shell Programming and Scripting

Print only lines where fields concatenated match strings

Discussion started by: Ophiuchus

5. Shell Programming and Scripting

compare 2 files and return unique lines in each file (based on condition)

Discussion started by: anurupa777

6. UNIX for Dummies Questions & Answers

getting unique lines from 2 files

Discussion started by: anurupa777

7. Shell Programming and Scripting

Compare multiple files and print unique lines

Discussion started by: jacobs.smith

8. UNIX for Advanced & Expert Users

In a huge file, Delete duplicate lines leaving unique lines

Discussion started by: krishnix

9. Shell Programming and Scripting

Comparing 2 files and return the unique lines in first file

Discussion started by: shekhar_v4

10. Shell Programming and Scripting

Lines Concatenated with awk

Discussion started by: xadamz23