Find key pattern and print selected lines for each record


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Find key pattern and print selected lines for each record
# 1  
Old 05-13-2015
Find key pattern and print selected lines for each record

Hi,

I need help on a complicated file that I am working on. I wanted to extract important info from a very huge file. It is space delimited file. I have hundred thousands of records in this file. An example content of the inputfile as below:-

Code:
##
ID    Ser402             Old;         23 mins .
ACC   P669GM;
DAT   MAY-2014, the old episode.
TOS   Japanes Anime. one piece
TMA   Pirates; animation; cartoon.
POT   DownloadID=5445;
HEW   StreamID=792; watchop (eu).
HEW   AnotherOnlineID=823; narutowire (same).
COM   -@- Simple Comment: Ace died and Luffy is miserable. 
COM      None of his nakama was with him {SOV:000250}.
COM   -@- Full Comment: Host channel {SOV:000305}; Multi-chanel
COM      streaming {SOV:000305}.
COM   -@- Another Comment: Belongs to the same server.
COM      {SOV:000305}.
COM  -----------------------------------------------------------------------
COM   Can be watched online, see http://www.watchop.eu
DOR   Data; packet; -; Unknown; Anime.
DOR   TDP; TDP:0034; PPQ:host for sub channel; ASA:Subchannel.
DOR   TDP; TDP:0021; PPQ:internal channel; ASA:Unknown.
PPE   Torrent unapplicable;
KAW   Complete episode; Early release; Host channel;
KAW   Repeat; subchannel; subchannel host.
FEA   link          1    20         unavailable
FEA                                /F3184.
FEA   TOP_CHAN      1      1       unavailable (will be determined).
FEA   SUBCHAN       2      18      at 9 (confirmed!).
FEA   TOP_CHAN      19     117     unavailable (No info).
FEA   SUBCHAN       118    138     at 10 (confirmed!).
FEA   TOP_CHAN      139    145     unavailable (will be determined).
FEA   SUBCHAN       146    166     at 12 (confirmed!).
FEA   TOP_CHAN      167    269     unavailable (the source is unknown).
FEA   REP           1      146     A.
FEA   CAD           75     75      by host.
FEA                                {undetermined}.
SYN   synopsis for this episode is unavailable.
##
ID    MOV10               NewMov;         90 mins.
ACC   PPDFB1;
TOS   Japanes Anime. Naruto shippuden
TMA   Ninja; shinobi, konoha; hokage; Pain.
CC    Distributed under the Creative License
CC   -----------------------------------------------------------------------
DOR   Data; packet; -; Unknown; Anime movie.
DOR   movie; new movie; 90 mins only
DOR   MOVID; 299; -.
DOR   MOV3D; -; 1.
PPE   10; torrent
KAW   new movie; Complete movie.
FEA   Null         1    683        Unknown
FEA                                /F82.
FEA   mov       62    124       (SOV:005).
FEA   mov      155    259       (SOV:005).
FEA   mov      346    376       (SOV:025).
SYN   In this episode, Dresrossa has been surrounded by a cage known as birdcage by doflamingo.
      Luffy is moving towards the palace to defeat Doflamingo. 
##

All the records in this file are separated by “##”. What I need is an output that only shows the needed info based on matched patterns “ subchannel or subchannel host” in KAW line. In the example input, only the first records has this patterns. Then, the output should be like below:-

Code:
##
ID       Ser402
ACC	  P669GM
TOS     Japanes Anime. one piece
TMA     Pirates; animation; cartoon.
COM    -@- Full Comment: Host channel {SOV:000305}; Multi-chanel
COM       streaming {SOV:000305}.
DOR     TDP; TDP:0034; PPQ:host for sub channel; ASA:Subchannel.
DOR     TDP; TDP:0021; PPQ:internal channel; ASA:Unknown.
KAW     Complete episode; Early release; Host channel;
KAW     Repeat; subchannel; subchannel host.
FEA      link          1    20         unavailable
FEA                                /F3184.
FEA      TOP_CHAN     1      1       unavailable (will be determined).
FEA      SUBCHAN       2      18      at 9 (confirmed!).
FEA      TOP_CHAN     19     117     unavailable (No info).
FEA      SUBCHAN       118    138     at 10 (confirmed!).
FEA      TOP_CHAN      139    145     unavailable (will be determined).
FEA      SUBCHAN        146    166     at 12 (confirmed!).
FEA      TOP_CHAN      167    269     unavailable (the source is unknown).
FEA      REP                    1      146     A.
FEA      CAD                   75     75      by host.
FEA                                                   {undetermined}.
TT        3
##

As shown above, for line starts with COM, I just want the one with -@-Full Comment and another COM line following it, if any (bold in blue color). I also need to print line DOR followed by TDP only (bold in red color). While, In the last line, there should be a new line created named as “TT” and the value following it is the total number of the occurrences of pattern “FEA SUBCHAN”.

I don't have any idea how to print only selected lines there. I used below codes to find the key pattern. But it will only print all the lines for the matched records. I just need selected lines as shown in the sample output above.

Code:
awk '/##/{if(l)print s;l=0;s=$0;next}/subchannel/{l=1}{s=s RS $0}END{if(l)print s}' inputfile

would appreciate your kind help. Thanks.
# 2  
Old 05-14-2015
How about this

Code:
awk '
BEGIN{
   for(i=split("ID ACC TOS TMA KAW FEA TT", k);i;i--) keep[k[i]];
}
$1 in keep     { s=s "\n" $0 }
/^DOR[ \t]+TDP/{ s=s "\n" $0 }
$1=="COM" && /-@- Full Comment/ {
   s=s "\n" $0; getline
   s=s "\n" $0
}
/^FEA.*SUBCHAN/ { tt++ }
$1=="##"&&s {
   if(prn) print "##" s "\nTT   " tt
   s=""
   tt=prn=0
}
/^KAW.*subchannel/ {prn++}
END { print "##" } ' infile

These 2 Users Gave Thanks to Chubler_XL For This Post:
# 3  
Old 05-14-2015
Quote:
Originally Posted by Chubler_XL
How about this

Code:
awk '
BEGIN{
   for(i=split("ID ACC TOS TMA KAW FEA TT", k);i;i--) keep[k[i]];
}
$1 in keep     { s=s "\n" $0 }
/^DOR[ \t]+TDP/{ s=s "\n" $0 }
$1=="COM" && /-@- Full Comment/ {
   s=s "\n" $0; getline
   s=s "\n" $0
}
/^FEA.*SUBCHAN/ { tt++ }
$1=="##"&&s {
   if(prn) print "##" s "\nTT   " tt
   s=""
   tt=prn=0
}
/^KAW.*subchannel/ {prn++}
END { print "##" } ' infile

Hi Chubler_XL,

The codes worked perfectly on my real data!. So, split function can be used to get selected lines. Thank you very much!. Smilie
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

sed -- Find pattern -- print remainder -- plus lines up to pattern -- Minus pattern

The intended result should be : PDF converters 'empty line' gpdftext and pdftotext?xml version="1.0"?> xml:space="preserve"><note-content version="0.1" xmlns:/tomboy/link" xmlns:size="http://beatniksoftware.com/tomboy/size">PDF converters gpdftext and pdftotext</note-content>... (9 Replies)
Discussion started by: Klasform
9 Replies

2. Shell Programming and Scripting

Help with print out record if first and next line follow specific pattern

Input file: pattern1 100 250 US pattern2 50 3050 UK pattern3 100 250 US pattern1 70 1050 UK pattern1 170 450 Mal pattern2 40 750 UK . . Desired Output file: pattern1 100 250 US pattern2 50 3050 UK pattern1 170 450 Mal pattern2... (3 Replies)
Discussion started by: cpp_beginner
3 Replies

3. Shell Programming and Scripting

Shell Script @ Find a key word and If the key word matches then replace next 7 lines only

Hi All, I have a XML file which is looks like as below. <<please see the attachment >> <?xml version="1.0" encoding="UTF-8"?> <esites> <esite> <name>XXX.com</name> <storeId>10001</storeId> <module> ... (4 Replies)
Discussion started by: Rajeev_hbk
4 Replies

4. Shell Programming and Scripting

Gawk Find Pattern Print Lines Before and After

Using grep I can easily use: cvs log |grep -iB 10 -A 10 'date: 2013-10-30' to display search results and 10 lines before and after. How can this be accompished using gawk? (4 Replies)
Discussion started by: metallica1973
4 Replies

5. Shell Programming and Scripting

awk to print record not equal specific pattern

how to use "awk" to print any record has pattern not equal ? for example my file has 5 records & I need to get all lines which $1=10 or 20 , $2=10 or 20 and $3 greater than "130302" as it shown : 10 20 1303252348212B030 20 10 1303242348212B030 40 34 1303252348212B030 10 20 ... (14 Replies)
Discussion started by: arm
14 Replies

6. Shell Programming and Scripting

Help with print out all relevant record if match particular pattern

Input file: data100_content1 420 700 data101_content1 107 516 data101_content2 194 773 data101_content3 195 917 data104_content2 36 325 data105_content1 505 605 data106_content1 291 565 ... (7 Replies)
Discussion started by: perl_beginner
7 Replies

7. Shell Programming and Scripting

Print selected lines from file in order

I need to extract selected lines from a log file, I can use grep to pull one line matching 'x' or matching 'y', how can I run through the log printing both matching lines in order top to bottom. i.e line 1 xyz - not needed line 2 User01 - needed line 3 123 - not needed line 4 Info - needed... (2 Replies)
Discussion started by: rosslm
2 Replies

8. Shell Programming and Scripting

trying to print selected fields of selected lines by AWK

I am trying to print 1st, 2nd, 13th and 14th fields of a file of line numbers from 29 to 10029. I dont know how to put this in one code. Currently I am removing the selected lines by awk 'NR==29,NR==10029' File1 > File2 and then doing awk '{print $1, $2, $13, $14}' File2 > File3 Can... (3 Replies)
Discussion started by: ananyob
3 Replies

9. Shell Programming and Scripting

Grep for a pattern and print entire record

Hi friends, This is my very first post on forum, so kindly excuse if my doubts are found too silly. I am trying to automate a piece of routine work and this is where I am stuck at the moment-I need to grep a particular ID through a file containing many records(which start with <LRECORD> and end... (6 Replies)
Discussion started by: faiz1985
6 Replies

10. Shell Programming and Scripting

print selected lines

Hi everybody: I try to print in new file selected lines from another file wich depends on the first column. I have done a script like this: lines=( "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "21" "31" "41" "51" "55" "57" "58" ) ${lines} for lines in ${lines} do awk -v ... (6 Replies)
Discussion started by: tonet
6 Replies
Login or Register to Ask a Question