Find matched pattern and print all based on certain conditions


 
Thread Tools Search this Thread
Top Forums UNIX for Beginners Questions & Answers Find matched pattern and print all based on certain conditions
# 1  
Old 10-08-2019
Find matched pattern and print all based on certain conditions

Hi,

I am trying to extract data based on certain conditions. My sample input file as below:-

Code:
lnc-2:1	OnePiece	tra_law	500	688	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# g_line "2.711647"# KM "8.723820"#
lnc-2:1	OnePiece	room	500	510	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "1"# g_line "2.711647"#
lnc-2:1	OnePiece	room	540	588	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "2"# g_line "2.711647"#
lnc-2:1	OnePiece	room	620	650	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "3"# g_line "2.711647"#
lnc-2:1	OnePiece	room	660	688	1	.	.	g_id "R792.8417"# tra_law_id "R792.8417.1"# room_number "4"# g_line "2.711647"#
lnc-1:3	OnePiece	tra_law	1	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3	OnePiece	room	1	601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3	OnePiece	room	1020	3001	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3	OnePiece	room	3400	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#
lnc-9:1	OnePiece	tra_law	1743	2314	1	.	.	g_id "R792.8419"# tra_law_id "R792.8419.1"# g_line "27.213287"# KM "87.549683"#
lnc-9:1	OnePiece	room	1743	2314	1	.	.	g_id "R792.8419"# tra_law_id "R792.8419.1"# room_number "1"# g_line "27.213287"#
lnc-16:4	OnePiece	tra_law	25408	63025	1	-	.	g_id "R792.8420"# tra_law_id "R792.8420.1"# g_line "357.721802"# KM "1150.850586"#
lnc-16:4	OnePiece	room	25408	25528	1	-	.	g_id "R792.8420"# tra_law_id "R792.8420.1"# room_number "1"# g_line "765.276733"#
lnc-16:4	OnePiece	room	62888	63025	1	-	.	g_id "R792.8420"# tra_law_id "R792.8420.1"# room_number "2"# g_line "0.372920"#

I want to get an output where when all conditions are met, it should print every lines with the same name in $1. The conditions as follows:-

1) "tra_law" is found in $3 && the results of $5 - $4 (of tra_law) is > 200. It should print all the following lines associated with it.
2) Then, it should check for the room number in last column, where only room_number with min of 3 counts will be taken into consideration.

The output should be like below:-

Code:
lnc-1:3	OnePiece	tra_law	1	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3	OnePiece	room	1	601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3	OnePiece	room	1020	3001	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3	OnePiece	room	3400	3601	1	.	.	g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#

as you can see, only lnc-1:3 met the conditions. for lnc-2:1, the tra_law value is less than 200 (688 - 500 = 180), therefore, it is omitted. As for lnc-9:1 and lnc-16:4, though the tra_law value > 200, both are omitted too as the room_number counts are less than 3.

I tried to use awk to work on it. My codes as below:-

Code:
awk -F"\t" 'NR>2 {$20=$5-$4; if ($20>200 && $3 ~/tra_law/) print $0}' inputfile | awk '{NF--NF};1' > outputfile

I got the results of the conditions no 1. But, it did not print the following lines associated to it. Also, I do not know how to check for condition no 2. I would prefer for the condition no 2 to be put in separate awk command as I might need to use them separately in different situation. Tried it many times but failed. Appreciate your kind help. Thanks.

Last edited by bunny_merah19; 10-08-2019 at 07:47 AM..
# 2  
Old 10-08-2019
Help me out: what does {NF--NF} do in your second awk script do? An uncommon construct, at least to me...

For your problem, how about
Code:
awk -F"\t" '
NR > 2           &&
($5-$4) > 200    &&
$3 ~ /tra_law/  {if (CNT > 3) print substr (OUT, 2)
                 SRC = $1
                 OUT = ""
                 CNT = 0
                }
$1 == SRC       {OUT = OUT ORS $0 
                 CNT ++
                }
END             {if (CNT > 3) print substr (OUT, 2)
                }

' file

lnc-1:3    OnePiece    tra_law    1    3601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3    OnePiece    room       1     601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3    OnePiece    room    1020    3001    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3    OnePiece    room    3400    3601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#


Last edited by RudiC; 10-08-2019 at 05:17 AM..
This User Gave Thanks to RudiC For This Post:
# 3  
Old 10-08-2019
Quote:
Originally Posted by RudiC
Help me out: what does {NF--NF} do in your second awk script do? An uncommon construct, at least to me...

For your problem, how about
Code:
awk -F"\t" '
NR > 2           &&
($5-$4) > 200    &&
$3 ~ /tra_law/  {if (CNT > 3) print substr (OUT, 2)
                 SRC = $1
                 OUT = ""
                 CNT = 0
                }
$1 == SRC       {OUT = OUT ORS $0 
                 CNT ++
                }
END             {if (CNT > 3) print substr (OUT, 2)
                }

' file

lnc-1:3    OnePiece    tra_law    1    3601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# g_line "36.370155"# KM "117.008842"#
lnc-1:3    OnePiece    room       1     601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "1"# g_line "36.370155"#
lnc-1:3    OnePiece    room    1020    3001    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "2"# g_line "36.370155"#
lnc-1:3    OnePiece    room    3400    3601    1    .    .    g_id "R792.8416"# tra_law_id "R792.8416.1"# room_number "3"# g_line "36.370155"#

Hi RudyC,

Thanks so much...your codes work great on my real data.

If you notice, I created $20 to put in the total of $5 - $4. I need to remove $20 (last column) from being printed out in the final output. Therefore, i used {NF--NF} for that purpose. Smilie It was just my silly way I guess.
# 4  
Old 10-08-2019
Well, yes, I noticed the necessary removal of $20. So NF-- would remove it (if it is the last field). What is the trailing NF for?


And, why $20 when your entire file has only 9 fields per line consistently?
And - as I see now - why the NR>2 condition? You normally use this to eliminate e.g. headers. But your file seems to have valid data in lines 1 and 2?
# 5  
Old 10-08-2019
Quote:
Originally Posted by RudiC
Well, yes, I noticed the necessary removal of $20. So NF-- would remove it (if it is the last field). What is the trailing NF for?

--> I have found ways of removing the last column here in this link:
text processing - How to delete the last column of a file in Linux - Unix & Linux Stack Exchange

Code:
awk 'NF{NF-=1};1' <in >out
or:

awk 'NF{NF--};1' <in >out
or:

awk 'NF{--NF};1' <in >out
Although this looks like voodoo, it works. There are three parts to each of these awk commands.

The first is NF, which is a precondition for the second part. NF is a variable containing the number of fields in a line. In AWK, things are true if they're not 0 or empty string "". Hence, the second part (where NF is decremented) only happens if NF is not 0.

The second part (either NF-=1 NF-- or --NF) is just subtracting one from the NF variable. This prevent the last field from being printed, because when you change a field (removing the last field in this case), awk re-construct $0, concatenate all fields separated by space by default. $0 didn't contain the last field anymore.

The final part is 1. It's not magical, it's just used as a expression that means true. If an awk expression evaluates to true without any associated action, awk default action is print $0.

And, why $20 when your entire file has only 9 fields per line consistently?
-> My real data has more than that. The one that i put here just a sample. The real data quite long and messy Smilie

And - as I see now - why the NR>2 condition? You normally use this to eliminate e.g. headers. But your file seems to have valid data in lines 1 and 2?
-> I do have 2 lines header in my real data that need to be omitted. Smilie
# 6  
Old 11-27-2019
Hi,

I noticed that the code did not work for the below condition.

Code:
CL-AS1:4	OnePiece	tra_law	4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; g_line "6.980716"; KM "4.794062"; PM"4.235367";
CL-AS1:4	OnePiece	room		4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; room_number "1"; g_line "6.980716";
CL-AS1:4	OnePiece	tra_law	5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; g_line "4.355471"; KM "2.991154"; PM"2.642568";
CL-AS1:4	OnePiece	room		5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; room_number "1"; g_line "4.355471";
CL-AS1:4	OnePiece	tra_law	6800	24864	1	-	.	g_id "R06794.16436"; tra_law_id "R06794.16436.1"; g_line "5.995821"; KM "4.117677"; PM"3.637807";
CL-AS1:4	OnePiece	room		6800	7033	1	-	.	g_id "R06794.16436"; tra_law_id "R06794.16436.1"; room_number "1"; g_line "6.462393";
CL-AS1:4	OnePiece	room		24831	24864	1	-	.	g_id "R06794.16436"; tra_law_id "R06794.16436.1"; room_number "2"; g_line "2.784706";
CL-AS1:4	OnePiece	tra_law	8440	8785	1	.	.	g_id "R06794.16437"; tra_law_id "R06794.16437.1"; g_line "7.209587"; KM "4.951241"; PM"4.374228";
CL-AS1:4	OnePiece	room		8440	8785	1	.	.	g_id "R06794.16437"; tra_law_id "R06794.16437.1"; room_number "1"; g_line "7.209587";

I got the below output. By right, all the data above should be thrown out.

Code:
CL-AS1:4	OnePiece	tra_law	4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; g_line "6.980716"; KM "4.794062"; PM"4.235367";
CL-AS1:4	OnePiece	room		4721	4962	1	.	.	g_id "R06794.16434"; tra_law_id "R06794.16434.1"; room_number "1"; g_line "6.980716";
CL-AS1:4	OnePiece	tra_law	5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; g_line "4.355471"; KM "2.991154"; PM"2.642568";
CL-AS1:4	OnePiece	room		5085	5285	1	.	.	g_id "R06794.16435"; tra_law_id "R06794.16435.1"; room_number "1"; g_line "4.355471";

Tried to play around with the given codes but it didnt work. Need your kind help again. Thanks
# 7  
Old 11-27-2019
Well, looks like any tra_law line should reset the counters, regardless of the $5 - $4 delta value. Try this small adaption.



Code:
awk -F"\t" '
NR > 2           &&
$3 ~ /tra_law/  {if ((CNT > 3) && (DELTA > 200)) print substr (OUT, 2)
                 DELTA = $5 - $4
                 SRC = $1
                 OUT = ""
                 CNT = 0
                }
$1 == SRC       {OUT = OUT ORS $0 
                 CNT ++
                }
END             {if  ((CNT > 3) && (DELTA > 200))  print substr (OUT, 2)
                }

' file

Your new file seems to have multiple <TAB> chars as field separators?
This User Gave Thanks to RudiC For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Advanced & Expert Users

To print from the first line until pattern is matched

Hi I want to print the line until pattern is matched. I am using below code: sed -n '1,/pattern / p' file It is working fine for me , but its not working for exact match. sed -n '1,/^LAC$/ p' file Input: LACC FEGHRA 0 LACC FACAF 0 LACC DARA 0 LACC TALAC 0 LAC ILACTC 0... (8 Replies)
Discussion started by: Abhisrajput
8 Replies

2. Shell Programming and Scripting

Regex: print matched line and exact pattern match

Hi experts, I have a file with regexes which is used for automatic searches on several files (40+ GB). To do some postprocessing with the grep result I need the matching line as well as the match itself. I know that the latter could be achieved with grep's -o option. But I'm not aware of a... (2 Replies)
Discussion started by: stresing
2 Replies

3. Shell Programming and Scripting

Print line between two patterns when a certain pattern matched

Hello Friends, I need to print lines in between two string when a keyword existed in those lines (keywords like exception, error, failed, not started etc). for example, input: .. Begin Edr ab12 ac13 ad14 bc23 exception occured bd24 cd34 dd44 ee55 ff66 End Edr (2 Replies)
Discussion started by: EAGL€
2 Replies

4. Shell Programming and Scripting

Print only matched pattern in perl

Hi, I have script like below: #!/usr/local/bin/perl use strict; use warnings; while (<DATA>) { ( my ($s_id) = /^\d+\|(\d+?)\|/ ) ; if ( $s_id == 1 ){ s/^(.*\|)*.*ABC\.pi=(+|+)*.*ABC\.id=(\d+|+).*$/$1$2|$3/s; print "$1$2|$3\n"; (2 Replies)
Discussion started by: sol_nov
2 Replies

5. Shell Programming and Scripting

print the whole row in awk based on matched pattern

Hi, I need some help on how to print the whole data for unmatched pattern. i have 2 different files that need to be checked and print out the unmatched patterns into a new file. My sample data as follows:- File1.txt Id Num Activity Class Type 309 1.1 ... (5 Replies)
Discussion started by: redse171
5 Replies

6. Linux

Perl program to print previous set of lines once a pattern is matched

Hi all, I have a text data file. My aim here is to find line called *FIELD* AV for every record and print lines after that till *FIELD* RF. But here I want first 3 to four lines for very record as well. FIELD AV is some where in between for very record. SO I am not sure how to retrieve lines in... (2 Replies)
Discussion started by: kaav06
2 Replies

7. UNIX for Dummies Questions & Answers

Lynx Grep Pattern Match 2 conditions Print from Start to End

I am working on a scraping project and I am stuck at this tiny grep pattern match. Sample text : FPA List. FPA List. FPA List. FPA List. FPA List. FPA List. FPA List. FPA List. ABC Personal Planning Catherine K. Wat Cath Wat Catherine K. Wat Catherine K. Wat IFRAME:... (8 Replies)
Discussion started by: kkiran
8 Replies

8. Shell Programming and Scripting

print last matched pattern using perl

Hi, If there exist multiple pattern in a file, how can I find the last record matching the pattern through perl. The below script searches for the pattern everywhere in an input file. #! /usr/bin/perl -s -wnl BEGIN { $pattern or warn"Usage: $0 -pattern='RE' \n" and exit 255;... (5 Replies)
Discussion started by: er_ashu
5 Replies

9. Shell Programming and Scripting

SED: delete and print the only exact matched pattern

I am really need help with the regular expression in SED. From input file, I need to extract lines that have the port number (sport or dport) as defined. The input file is something like this time=1209515280-1209515340 dst=192.168.133.202 src=208.70.8.23 bytes=2472 proto=6 sport=80 dport=1447... (6 Replies)
Discussion started by: new_buddy
6 Replies

10. Shell Programming and Scripting

appending with sed based on matched pattern

Hi, I want to know if you can input with sed but instead of specifing a line number like below I wan't to be able to insert based on a specific word or patttern. 10i\ Insert me after line 10 is this possible with sed or should I use AWK? Thanks Jack (2 Replies)
Discussion started by: jack1981
2 Replies
Login or Register to Ask a Question