Find a matched pattern and perform comparison on numbers next to it


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find a matched pattern and perform comparison on numbers next to it
# 1  
Old 10-01-2019
Find a matched pattern and perform comparison on numbers next to it

Hi,

I have been trying to extract rows that match pattern "cov" with the value next to it to be > 3. The 'cov' pattern may appear either in $3 or $4 (if using ";" as field separator). Below is the example:-

input file
Code:
ENST00000652609.1|ENSG00000230590.10|OTTHUMG00000021850.6|OTTHUMT00000503925.1|FTX-232|FTX|2334|	StringTie	exon	385	622	1000	.	.	gene_id "SRR5206792.443"; transcript_id "SRR5206792.443.1"; exon_number "1"; cov "2.749580";
lnc-NAA50-3:1	StringTie	transcript	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1	StringTie	exon	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1	StringTie	transcript	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1	StringTie	exon	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";

From the sample file, the "cov" value for the first row is less than 3. Therefore, it should be excluded and the output should be like below.
Code:
lnc-NAA50-3:1	StringTie	transcript	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; cov "25.269176"; FPKM "81.295151"; TPM "72.137390";
lnc-NAA50-3:1	StringTie	exon	1	859	1000	.	.	gene_id "SRR5206792.445"; transcript_id "SRR5206792.445.1"; exon_number "1"; cov "25.269176";
lnc-DCAF1-1:1	StringTie	transcript	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; cov "19.228128"; FPKM "61.860096"; TPM "54.891655";
lnc-DCAF1-1:1	StringTie	exon	1	2446	1000	.	.	gene_id "SRR5206792.446"; transcript_id "SRR5206792.446.1"; exon_number "1"; cov "19.228128";

I know how to search the pattern but do not know how to compare the value to be > 3. Below is one of the sample codes that i did:-
Code:
awk 'BEGIN{FS=";"] $0~/cov/ && $3 || $4 >3 {print}' input file

tried couple of times to do the comparison by combining with pattern matching but failed. appreciate your kind help and advise. thanks

Last edited by bunny_merah19; 10-01-2019 at 02:39 AM..
# 2  
Old 10-01-2019
Try (untested)
Code:
awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file


Last edited by RudiC; 10-01-2019 at 10:11 AM..
This User Gave Thanks to RudiC For This Post:
# 3  
Old 10-01-2019
Quote:
Originally Posted by RudiC
Try (untested)
Code:
awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file

Thanks so much. It works perfectly!

I am looking into your codes.
Code:
{split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}

This is so good to know as I have many more data with this kind of almost similar condition to work on. Again, thanks so much. Really appreciate it. Smilie
# 4  
Old 10-03-2019
Quote:
Originally Posted by bunny_merah19
This is so good to know as I have many more data with this kind of almost similar condition to work on. Again, thanks so much. Really appreciate it. Smilie
With a slight change it should also work nicely for string data values:

Code:
$ echo ' wrongkey  "some data"; key "more data";' | 
     awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data

This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 10-03-2019
Quote:
Originally Posted by RudiC
Try (untested)
Code:
awk -F";" 'match ($0, /cov[ ".0-9]*;/) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next} 1' file

Quote:
Originally Posted by Chubler_XL
With a slight change it should also work nicely for string data values:

Code:
$ echo ' wrongkey  "some data"; key "more data";' | 
     awk 'match($0, /[ ;]key +"[^"]*" *-;/) {split(substr($0,RSTART,RLENGTH), T, "\""); print T[2]}'
more data

This is great! thanks so much Smilie
# 6  
Old 12-02-2019
Hi,


there is a slight change for the output that I need to generate. Let say I have below input data

Code:
 SIN3A-2:2    StringTie    transcript    15    2652    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; cov "2.846695"; FPKM "9.158292"; TPM "8.126626";
SIN3A-2:2    StringTie    exon    15    536    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "1"; cov "1.019540";
SIN3A-2:2    StringTie    exon    725    1045    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "2"; cov "2.834891";
SIN3A-2:2    StringTie    exon    1268    1509    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "3"; cov "5.954821";
SIN3A-2:2    StringTie    exon    1867    1990    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "4"; cov "3.971774";
SIN3A-2:2    StringTie    exon    2344    2465    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "5"; cov "3.590164";
SIN3A-2:2    StringTie    exon    2567    2652    1000    +    .    gene_id "R1792.2978"; transcript_id "R1792.2978.1"; exon_number "6"; cov "2.558140";
 SIN3A-2:2    StringTie    transcript    3744    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2    StringTie    exon    3744    3851    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2    StringTie    exon    3937    4093    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2    StringTie    exon    4211    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
 SIN3A-2:5    StringTie    transcript    6    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5    StringTie    exon    6    315    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5    StringTie    exon    510    607    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5    StringTie    exon    782    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";

The first line containing "transcript" in $3 have "cov" value less than 3. Therefore, the lines following it ( with exon in $3) need to be removed as well although they have the cov more than 3.



The output file should be like below:


Code:
SIN3A-2:2    StringTie    transcript    3744    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; cov "6.767245"; FPKM "21.771355"; TPM "19.318848";
SIN3A-2:2    StringTie    exon    3744    3851    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "1"; cov "12.069445";
SIN3A-2:2    StringTie    exon    3937    4093    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "2"; cov "13.160297";
SIN3A-2:2    StringTie    exon    4211    4813    1000    +    .    gene_id "R1792.2979"; transcript_id "R1792.2979.1"; exon_number "3"; cov "4.153071";
SIN3A-2:5    StringTie    transcript    6    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; cov "5.941011"; FPKM "19.113222"; TPM "16.960150";
SIN3A-2:5    StringTie    exon    6    315    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "1"; cov "5.615038";
SIN3A-2:5    StringTie    exon    510    607    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "2"; cov "7.288415";
SIN3A-2:5    StringTie    exon    782    818    1000    +    .    gene_id "R1792.2981"; transcript_id "R1792.2981.1"; exon_number "3"; cov "5.103339";

I tried to play around with the given code before but the output file still retain the lines with "exon". Below is one of my attempts:


Code:
awk -F"[\t;]" '$3 ~/transcript/ {if(match ($0,/cov[ ".0-9]*;/)) {split (substr ($0, RSTART, RLENGTH), T, "\""); if (T[2] <= 3) next}     
        SRC=$1
        OUT=""
        }                
$1==SRC {OUT= OUT ORS $0} 

{print}'  inputfile > outputfile

Can anyone pls help and tell me what did I do wrong? thanks
# 7  
Old 12-02-2019
Try
Code:
awk  '
$3 == "transcript"      {match ($0, /cov[ ".0-9]*;/)
                         split (substr ($0, RSTART, RLENGTH), T, "\"")
                         PR =  (T[2] > 3)
                        }
PR
' file

These 2 Users Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Find matched pattern and print all based on certain conditions

Hi, I am trying to extract data based on certain conditions. My sample input file as below:- lnc-2:1 OnePiece tra_law 500 688 1 . . g_id "R792.8417"# tra_law_id "R792.8417.1"# g_line "2.711647"# KM "8.723820"# lnc-2:1 OnePiece room 500 510 1 . . g_id "R792.8417"# tra_law_id "R792.8417.1"#... (7 Replies)
Discussion started by: bunny_merah19
7 Replies

2. Shell Programming and Scripting

How to use sed to search a particular pattern in a file backward after a pattern is matched.?

Hi, I have two files file1.txt and file2.txt. Please see the attachments. In file2.txt (which actually is a diff output between two versions of file1.txt.), I extract the pattern corresponding to 1172c1172. Now ,In file1.txt I have to search for this pattern 1172c1172 and if found, I have to... (9 Replies)
Discussion started by: saurabh kumar
9 Replies

3. Shell Programming and Scripting

Insert certain field of matched pattern line above pattern

Hello every, I am stuck in a problem. I have file like this. I want to add the fifth field of the match pattern line above the lines starting with "# @D". The delimiter is "|" eg > # @D0.00016870300|0.05501020000|12876|12934|3||Qp||Pleistocene||"3 Qp Pleistocene"|Q # @P... (5 Replies)
Discussion started by: jyu3
5 Replies

4. Homework & Coursework Questions

[solved]Perl: Printing line numbers to matched strings and hashes.

Florida State University, Tallahassee, FL, USA, Dr. Whalley, COP4342 Unix Tools. This program takes much of my previous assignment but adds the functionality of printing the concatenated line numbers found within the input. Sample input from <> operator: Hello World This is hello a sample... (2 Replies)
Discussion started by: D2K
2 Replies

5. Shell Programming and Scripting

Awk to match a pattern and perform a search after the first pattern

Hello Guyz I have been following this forum for a while and the solutions provided are super useful. I currently have a scenario where i need to search for a pattern and start searching by keeping the first pattern as a baseline ABC DEF LMN EFG HIJ LMN OPQ In the above text i need to... (8 Replies)
Discussion started by: RickCharles
8 Replies

6. Shell Programming and Scripting

How to find the matched numbers between 2 text file using perl program??

hi dudes, I nee you kind assistance, I have to find the matched numbers from 2 text files and output of matched numbers should be in another text file.. I do have text files like this , for example File 1 787 665*5-p 5454 545-p 445-p 5454*-p File 2 5455 787 445-p 4356 2445 144 ... (3 Replies)
Discussion started by: sureshraj
3 Replies

7. Shell Programming and Scripting

Can sed perform editing operations ONLY in the matched region?

Hi: Let's suppose I want to replace all the | by > ONLY when | is between . Usually (and it works) I would do something like sed -e 's/\(\*\)|\(*\]\)/\1>\2/g' where I have to "save" some portions of the matched region and use them with the \n metacharacter. I was wondering if I could... (2 Replies)
Discussion started by: islegmar
2 Replies

8. Shell Programming and Scripting

HELP! PERL script to find matched pattern

Hi all, I just learnt Perl and I encountered a problem in my current project. For a verilog file, i am required to write a PERL script that could match pattern to output nitrolink and nitropack. I wont know what name to grep except the pattern below. the verilog file: nitrolink nitrolink... (1 Reply)
Discussion started by: kimhuat
1 Replies

9. Solaris

How to perform addition of two numbers in shell scripting in Solaris-10

Hi, I have a sh script which contains the following line TOTAL=$((e4-s4)) -> Where e4 and s4 are input got from the user. At the time of execution of this line the following error occurs test.sh: syntax error at line 8: `TOTAL=$' unexpected How to solve this issue?. Can any... (9 Replies)
Discussion started by: krevathi1912
9 Replies

10. Shell Programming and Scripting

How to perform calculations using numbers greater than 2150000000.

Could someone tell me how to perform calculations using numbers greater than 2150000000 in Korn Shell? When I tried to do it it gave me the wrong answer. e.g. I have a ksh file with the contents below: --------------------------------- #!/bin/ksh SUM=`expr 2150000000 + 2` PRODUCT=`expr... (3 Replies)
Discussion started by: stevefox
3 Replies
Login or Register to Ask a Question