Find a matched pattern and perform comparison on numbers next to it

10-01-2019

Registered User

16, 0

Join Date: Sep 2019

Last Activity: 26 February 2020, 1:57 AM EST

Posts: 16

Thanks Given: 11

Thanked 0 Times in 0 Posts

Find a matched pattern and perform comparison on numbers next to it

Hi,

I have been trying to extract rows that match pattern "cov" with the value next to it to be > 3. The 'cov' pattern may appear either in $3 or $4 (if using ";" as field separator). Below is the example:-

input file

Code:

ENST00000652609.1|ENSG00000230590.10|OTTHUMG00000021850.6|OTTHUMT00000503925.1|FTX-232|FTX|2334|	StringTie	exon	385	622	1000	.	.	gene_id "SRR5206792.443"; transcript_id "SRR5206792.443.1"; exon_number "1"; cov "2.749580";
lnc-NAA50-3:1	StringTie	transcript	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1	StringTie	exon	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1	StringTie	transcript	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1	StringTie	exon	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";

From the sample file, the "cov" value for the first row is less than 3. Therefore, it should be excluded and the output should be like below.

Code:

lnc-NAA50-3:1	StringTie	transcript	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1	StringTie	exon	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1	StringTie	transcript	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1	StringTie	exon	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";

I know how to search the pattern but do not know how to compare the value to be > 3. Below is one of the sample codes that i did:-

Code:

awk 'BEGIN{FS=";"] $0~/cov/ && $3 || $4 >3 {print}' input file

tried couple of times to do the comparison by combining with pattern matching but failed. appreciate your kind help and advise. thanks

Last edited by bunny_merah19; 10-01-2019 at 02:39 AM..

bunny_merah19

View Public Profile for bunny_merah19

Find all posts by bunny_merah19

10-01-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try (untested)

Code:

awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file

Last edited by RudiC; 10-01-2019 at 10:11 AM..

This User Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

10-01-2019

Registered User

16, 0

Join Date: Sep 2019

Last Activity: 26 February 2020, 1:57 AM EST

Posts: 16

Thanks Given: 11

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Try (untested)

Code:

awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file

Thanks so much. It works perfectly!

I am looking into your codes.

Code:

{split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}

This is so good to know as I have many more data with this kind of almost similar condition to work on. Again, thanks so much. Really appreciate it.

bunny_merah19

View Public Profile for bunny_merah19

Find all posts by bunny_merah19

10-03-2019

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

Quote:

Originally Posted by bunny_merah19

This is so good to know as I have many more data with this kind of almost similar condition to work on. Again, thanks so much. Really appreciate it. Smilie

With a slight change it should also work nicely for string data values:

Code:

$ echo ' wrongkey  "some data"; key "more data";' | 
     awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

10-03-2019

Registered User

16, 0

Join Date: Sep 2019

Last Activity: 26 February 2020, 1:57 AM EST

Posts: 16

Thanks Given: 11

Thanked 0 Times in 0 Posts

Quote:

Originally Posted by RudiC

Try (untested)

Code:

awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file

Quote:

Originally Posted by Chubler_XL

With a slight change it should also work nicely for string data values:

Code:

$ echo ' wrongkey  "some data"; key "more data";' | 
     awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data

This is great! thanks so much

bunny_merah19

View Public Profile for bunny_merah19

Find all posts by bunny_merah19

12-02-2019

Registered User

16, 0

Join Date: Sep 2019

Last Activity: 26 February 2020, 1:57 AM EST

Posts: 16

Thanks Given: 11

Thanked 0 Times in 0 Posts

Hi,

there is a slight change for the output that I need to generate. Let say I have below input data

Code:

 SIN3A-2:2    StringTie    transcript    15    2652    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; cov "2.846695"; FPKM "9.158292"; TPM "8.126626";
SIN3A-2:2    StringTie    exon    15    536    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "1"; cov "1.019540";
SIN3A-2:2    StringTie    exon    725    1045    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "2"; cov "2.834891";
SIN3A-2:2    StringTie    exon    1268    1509    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "3"; cov "5.954821";
SIN3A-2:2    StringTie    exon    1867    1990    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "4"; cov "3.971774";
SIN3A-2:2    StringTie    exon    2344    2465    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "5"; cov "3.590164";
SIN3A-2:2    StringTie    exon    2567    2652    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "6"; cov "2.558140";
 SIN3A-2:2    StringTie    transcript    3744    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2    StringTie    exon    3744    3851    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2    StringTie    exon    3937    4093    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2    StringTie    exon    4211    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
 SIN3A-2:5    StringTie    transcript    6    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5    StringTie    exon    6    315    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5    StringTie    exon    510    607    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5    StringTie    exon    782    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";

The first line containing "transcript" in $3 have "cov" value less than 3. Therefore, the lines following it ( with exon in $3) need to be removed as well although they have the cov more than 3.

The output file should be like below:

Code:

SIN3A-2:2    StringTie    transcript    3744    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2    StringTie    exon    3744    3851    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2    StringTie    exon    3937    4093    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2    StringTie    exon    4211    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
SIN3A-2:5    StringTie    transcript    6    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5    StringTie    exon    6    315    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5    StringTie    exon    510    607    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5    StringTie    exon    782    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";

I tried to play around with the given code before but the output file still retain the lines with "exon". Below is one of my attempts:

Code:

awk -F"[\t;]" '$3 ~/transcript/ {if(match ($0,/cov[ ".0-9]*;/)) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}     
        SRC=$1
        OUT=""
        }                
$1==SRC {OUT= OUT ORS $0} 

{print}'  inputfile > outputfile

Can anyone pls help and tell me what did I do wrong? thanks

bunny_merah19

View Public Profile for bunny_merah19

Find all posts by bunny_merah19

12-02-2019

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Try

Code:

awk  '
$3 == "transcript"      {match ($0, /cov[ ".0-9]*;/)
                         split (substr ($0, RSTART, RLENGTH), T, "\"")
                         PR =  (T[2] > 3)
                        }
PR
' file

These 2 Users Gave Thanks to RudiC For This Post:

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

Find a matched pattern and perform comparison on numbers next to it

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Find matched pattern and print all based on certain conditions

Discussion started by: bunny_merah19

2. Shell Programming and Scripting

How to use sed to search a particular pattern in a file backward after a pattern is matched.?

Discussion started by: saurabh kumar

3. Shell Programming and Scripting

Insert certain field of matched pattern line above pattern

Discussion started by: jyu3

4. Homework & Coursework Questions

[solved]Perl: Printing line numbers to matched strings and hashes.

Discussion started by: D2K

5. Shell Programming and Scripting

Awk to match a pattern and perform a search after the first pattern

Discussion started by: RickCharles

6. Shell Programming and Scripting

How to find the matched numbers between 2 text file using perl program??

Discussion started by: sureshraj

7. Shell Programming and Scripting

Can sed perform editing operations ONLY in the matched region?

Discussion started by: islegmar

8. Shell Programming and Scripting

HELP! PERL script to find matched pattern

Discussion started by: kimhuat

9. Solaris

How to perform addition of two numbers in shell scripting in Solaris-10

Discussion started by: krevathi1912

10. Shell Programming and Scripting

How to perform calculations using numbers greater than 2150000000.

Discussion started by: stevefox