Search for sequential pattern

09-22-2016

Registered User

4, 0

Join Date: Sep 2016

Last Activity: 24 September 2016, 7:32 PM EDT

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

Search for sequential pattern

input file:

Code:

(input file can be many millions of lines long)

I want to search the example input file above, and when I find 4 sequential rows with values of 1,2,3,4 return those values and the two previous ones.
In this case it should return

Code:

1,A,1,2,3,4

I know this can be done on various platforms, but I'd like to use awk in this case. I'm fairly certain I'll end up using a six element array, but y'all will probably figure this out before I do. Thanks in advance, brain too old to figure this stuff out anymore...

---------- Post updated at 07:47 PM ---------- Previous update was at 04:47 PM ----------

I started down the path of using grep to pull out the rows that I need, 2 before the match and 3 after the match. I was going to simply the match to only finding the first entri that i needed, and filter the extra ones out later. After that is was a simple matter of formatting. That is, until the case where we had matching overlaps, like so.

Say I'm looking for rows with 1,2,3,4 - then I was only going to grep on "1", and extract the leading and following rows. Even if I got alot of entries that were not a perfect match, I can easily filter those out. Here is the case that ruined it.

Code:

The grep will misbehave because it refuses to grep the value "1" more than once. In this case the "1" relates to the before part of one selection, and the after part of another, and it only reports it once. So unless there is a way of telling grep to not do this, can't use grep....

Moderator's Comments:

Please wrap all code, files, input & output.errors in CODE tags.
It makes it far easier to read and preserves multiple spaces for indenting or fixed-width data.

Last edited by rbatte1; 09-23-2016 at 08:30 AM.. Reason: Added CODE tags

cedenker

View Public Profile for cedenker

Find all posts by cedenker

09-22-2016

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

How about this using awk

Code:

awk '
  {A=B; B=C; C=$0}
  N==5 { print F " found from row " NR-6 ; exit}
  N&&$0==N { F=F","N++; next}
  $0==1 { F=A","B",1";N=2;next}
  {N=x}
' infile

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

09-23-2016

Registered User

4, 0

Join Date: Sep 2016

Last Activity: 24 September 2016, 7:32 PM EDT

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

I can sort of follow this. Is is hardcoded to use "1,2,3,4" for the search criteria? Or at least 4 sequential numbers?
I need to have a little flexibility in selecting the 4 values to search for (I used 1,2,3,4 just as an oversimplified example).

I have confirmed that it works great for 1,2,3,4.......

Thanks for the first response!

Last edited by cedenker; 09-23-2016 at 12:35 AM.. Reason: clarify my follow up question

cedenker

View Public Profile for cedenker

Find all posts by cedenker

09-23-2016

Moderator

3,791, 1,452

Join Date: Oct 2010

Last Activity: 1 August 2020, 1:38 AM EDT

Posts: 3,791

Thanks Given: 183

Thanked 1,452 Times in 1,302 Posts

If you are looking for different strings (not "1" thru "4") a slightly different solution is required:

Code:

awk '
  BEGIN{ L=split("one,two,three,four", M, ",") }
  {A=B; B=C; C=$0}
  N==L+1 { print F " found from row " NR-L-2 ; exit}
  N&&$0==M[N] { F=F","M[N++]; next}
  $0==M[1] { F=A","B","M[1];N=2;next}
  {N=x}
' infile

This version now searches for "one", "two", "three" and then "four" and can be easily converted to search for you list of specific strings. The split command is building an array M[] which is used to match each line.

This User Gave Thanks to Chubler_XL For This Post:

Chubler_XL

View Public Profile for Chubler_XL

Find all posts by Chubler_XL

09-23-2016

Registered User

4, 0

Join Date: Sep 2016

Last Activity: 24 September 2016, 7:32 PM EDT

Posts: 4

Thanks Given: 3

Thanked 0 Times in 0 Posts

initial test works fine. Let me add some of the other things I oversimplified into the script and see if I can break it. Thanks!

---------- Post updated 09-23-16 at 12:16 AM ---------- Previous update was 09-22-16 at 11:04 PM ----------

I should have made this part of the initial requirement, but thought I could add it in myself after the original problem was solved. I can't wrap my head what the script is actually doing, so can't really add to it unfortunately.

The additional requirement is as follows.
Extra column in the input file.

Code:

1  cow
2  bird
3  horse
4  one
5  two
6  three
7  four
8  fff

the additional output would be the value in column 1 for the initial row of the match. In this case the output (looking for one,two,three,four) should be.

Code:

2, bird,horse,one,two,three,four

So I understood enough to read $2 instead of $0, and the script works the same now, just basically ignoring the first of the two input columns. I'm assuming all we need is a 2nd array to store the first column values, updating itself at the same time the 1st array updates. Then when it comes time to print out, just print the first array element of the 1st column.

I should have included this in the initial requirement, sorry about that....

Moderator's Comments:

Please use CODE tags as required by forum rules!

Last edited by RudiC; 09-23-2016 at 06:03 AM.. Reason: Added CODE tags.

cedenker

View Public Profile for cedenker

Find all posts by cedenker

09-23-2016

Registered User

247, 55

Join Date: Jun 2011

Last Activity: 31 January 2020, 9:04 AM EST

Posts: 247

Thanks Given: 40

Thanked 55 Times in 48 Posts

Wouldn't it be easier using grep?

For instance (assuming every line consist of exactly one character, as in your example, and that the line terminator is just a newline character), the following command would work:

Code:

grep -zo '....1.2.3.4' your_data.txt

Last edited by rovf; 09-23-2016 at 04:01 AM.. Reason: Removing unnecessary -E switch

rovf

View Public Profile for rovf

Find all posts by rovf

09-23-2016

Moderator

3,105, 1,603

Join Date: May 2013

Last Activity: 31 August 2020, 1:46 AM EDT

Location: Chennai

Posts: 3,105

Thanks Given: 1,269

Thanked 1,603 Times in 1,369 Posts

Hello cedenker,

Let's say our Input_file is as follows, where I am considering that strings one,twoetc could come at any order.

Code:

cat Input_file
1 cow
2 bird
3 horse
4 one
5 two
6 three
7 four
8 fff
9 one
10 two
11 one
12 two
13 one
14 two
15 three
16 four
11 one
12 two
13 three
14 one

Then following will be the code.

Code:

awk 'BEGIN{num=split("one,two,three,four", A,",");for(i=1;i<=num;i++){B[A[i]]=i}} {;while(($2 in B) && ++e == B[$2]){A[FNR]=$2;W=W?W OFS $2:$2;getline;};A[FNR]=$2;if(e>=4){print FNR-6,A[FNR-6],A[FNR-5],W};e=W=""}' OFS=,   Input_file

Output will be as follows.

Code:

2,bird,horse,one,two,three,four
11,one,two,one,two,three,four

EDIT: Adding a non-one liner form of solution too now.

Code:

awk 'BEGIN{
                num=split("one,two,three,four", A,",");
                for(i=1;i<=num;i++){
                                        B[A[i]]=i
                                   }
          }
          {;
                while(($2 in B) && ++e == B[$2]){
                                                        A[FNR]=$2;
                                                        W=W?W OFS $2:$2;
                                                        getline;
                                                };
                A[FNR]=$2;
                if(e>=4){
                                print FNR-6,A[FNR-6],A[FNR-5],W
                        };
                e=W=""
          }
    ' OFS=,   Input_file

So it is taking care of rule like strings one,two,three,fourshould come consecutive and if they are less than their count 4 it shouldn't print those too. Please do let us know how it goes and if this helps you.
EDIT2: Improving above code by removing array A inside whileloop.

Code:

awk 'BEGIN{num=split("one,two,three,four", A,",");for(i=1;i<=num;i++){B[A[i]]=i}} {A[++q]=$2;while(($2 in B) && ++e == B[$2]){;W=W?W OFS $2:$2;getline;};if(e>=4){print FNR-6,A[q],A[q-1],W};e=W=""}' OFS=,   Input_file
####OR a non-one liner form of solution too as follows.
awk 'BEGIN{
                num=split("one,two,three,four", A,",");
                for(i=1;i<=num;i++){
                                        B[A[i]]=i
                                   }
          }
          {
                A[++q]=$2;
                while(($2 in B) && ++e == B[$2]){;
                                                        W=W?W OFS $2:$2;
                                                        getline;
                                                };
                if(e>=4){
                                print FNR-6,A[q],A[q-1],W
                        };
                e=W=""
          }
    ' OFS=,   Input_file

Thanks,
R. Singh

Last edited by RavinderSingh13; 09-23-2016 at 06:30 AM.. Reason: Adding a non-one liner form of solution too now.

This User Gave Thanks to RavinderSingh13 For This Post:

RavinderSingh13

View Public Profile for RavinderSingh13

Find all posts by RavinderSingh13

Shell Programming and Scripting

Search for sequential pattern

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Grep/awk using a begin search pattern and end search pattern

Discussion started by: vbabz

2. Shell Programming and Scripting

Extracting sequential pattern

Discussion started by: fuzzi

3. Shell Programming and Scripting

How to use sed to search a particular pattern in a file backward after a pattern is matched.?

Discussion started by: saurabh kumar

4. Shell Programming and Scripting

Search for a pattern in a String file and count the occurance of each pattern

Discussion started by: swayam123

5. Shell Programming and Scripting

Need one liner to search pattern and print everything expect 6 lines from where pattern match made

Discussion started by: chidori

6. Programming

Tool to simulate non-sequential disk I/O (simulate db file sequential read) in C POSIX

Discussion started by: vrghost

7. Shell Programming and Scripting

Print a pattern between the xml tags based on a search pattern

Discussion started by: oky

8. Shell Programming and Scripting

Append specific lines to a previous line based on sequential search criteria

Discussion started by: jesse

9. Shell Programming and Scripting

search a pattern and if pattern found insert new pattern at the begining

Discussion started by: pitagi

10. Programming

Reading special characters while converting sequential file to line sequential

Discussion started by: Rajeshsu