fuzzy sequence match in a text file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting fuzzy sequence match in a text file
# 1  
Old 09-12-2012
fuzzy sequence match in a text file

Hi Forum:

I have struggle with it and decide to use my eye ball to accomplish this.

Basically I am looking for sequence of date inside a file.
If one of the sequence repeat 2-3 time or skip once; it's still consider a match.

Code:
input text file:

Sep 6 A
Sep 6 A
Sep 10 A
Sep 7 B
Sep 8 B
Sep 9 B
Sep 10 B
Sep 11 B
Sep 7 C
Sep 7 C
Sep 7 C
Sep 11 C
Sep 8 D
Sep 9 E
Sep 7 F
Sep 8 F
Sep 9 F
Sep 10 F
Sep 11 F
Sep 7 G
Sep 8 G
Sep 9 G
Sep 10 G
Sep 7 H
Sep 8 H
Sep 9 H
Sep 7 I
Sep 8 I
Sep 8 I
Sep 9 I
Sep 10 I
Sep 7 J
Sep 7 J
Sep 7 J
Sep 9 J
Sep 10 J

Desired filtered output:

Sep 7 B
Sep 8 B
Sep 9 B
Sep 10 B
Sep 11 B
Sep 7 F
Sep 8 F
Sep 9 F
Sep 10 F
Sep 11 F
Sep 7 G
Sep 8 G
Sep 9 G
Sep 10 G
Sep 7 H
Sep 8 H
Sep 9 H
Sep 7 I
Sep 8 I
Sep 8 I
Sep 9 I
Sep 10 I
Sep 7 J
Sep 7 J
Sep 7 J
Sep 9 J
Sep 10 J


Cheers!!
Chirish.
# 2  
Old 09-12-2012
Try this:

Code:
awk '
{ min=day
  max=skip?day+1:day+2
  if($1==mth && $2+0>=min && $2 <=max) {
    if($2+0>min)diff++
    skip=skip||$2+0==day+2
    day=$2+0
    out=start out"\n"$0
    start=""
    next
  }
  if(diff>2) printf "%s\n",out
  mth=$1
  start=$0
  day=$2+0
  diff=1; skip=0; out="" }
END {if(diff>2) printf "%s\n",out}' infile

Edit: rename variables for more clarity

Last edited by Chubler_XL; 09-12-2012 at 09:12 PM..
# 3  
Old 09-12-2012
Wow, Chubler_XL, I stand in awe... After thirty years or so using Unix, Linux, and awk (among others, see, this is my work AND my hobby too), I am completely stupefied at:
Code:
skip=skip||$2+0==day+2

and
Code:
max=skip?day+1:day+2

Please, don't get me wrong: It is amazing to cut your code, paste it in my terminal and see the expected output... Is like listening "Bazinga!" in the background!
Would you please give us mere mortals a bit of feedback?
# 4  
Old 09-12-2012
Apologies, this site seem focused on smallest number of chars, not clarity.

Code:
if(skip != 0 || $2+0 == day + 2) skip=1
 
# or
if (skip == 0) {
    if ($2+0 == day + 2) skip =1
}

Code:
if(skip ==0) max=day+2 else max=day+1


Last edited by Chubler_XL; 09-12-2012 at 10:41 PM..
This User Gave Thanks to Chubler_XL For This Post:
# 5  
Old 09-14-2012
I don't follow the logic behind
Code:
if(skip != 0 || $2+0 == day + 2) skip=1

# or
if (skip == 0) {
    if ($2+0 == day + 2) skip =1
}

being equivalent... The first statement is "if ((skip is not 0) OR (($2+0) equals (day + 2)))", but your second expression is "if (skip equals 0) THEN if (($2+0) is equal to (day + 2))", and as such it is NOT equivalent to the first one; in the first statement it suffices that "(skip is not 0)" to make "skip=1", but in the second expression this is not true. As I see it, the second expression is equivalent to "if ((skip equals 0) AND (($2+0) equals (day + 2)))", which is clearly different from the first statement.
# 6  
Old 09-15-2012
Quote:
Originally Posted by hexram
I don't follow the logic behind
Code:
if(skip != 0 || $2+0 == day + 2) skip=1

# or
if (skip == 0) {
    if ($2+0 == day + 2) skip =1
}

being equivalent...
Try building a truth table for each and you see:

Skip$2+0 == day+2Skip Result
1T1
1F1
0T1
0F0
# 7  
Old 09-15-2012
Quote:
Originally Posted by Chubler_XL
Try building a truth table for each and you see
hexram is correct. Those two expressions are not logically equivalent (you cannot manipulate one into the form of the other using boolean identities/properties/theorems). They behave differently when skip is not zero. However, the difference in behavior does not affect the outcome if the non-zero value is 1.

Regards,
Alister
This User Gave Thanks to alister For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Match text to lines in a file, iterate backwards until text or text substring matches, print to file

hi all, trying this using shell/bash with sed/awk/grep I have two files, one containing one column, the other containing multiple columns (comma delimited). file1.txt abc12345 def12345 ghi54321 ... file2.txt abc1,text1,texta abc,text2,textb def123,text3,textc gh,text4,textd... (6 Replies)
Discussion started by: shogun1970
6 Replies

2. Shell Programming and Scripting

Insert text after match in XML file

Having a little trouble getting this to work just right. I have xml files that i want to split some data. I have 2 <name> tags within the file I would like to take only the first tag and split the data. tag example. From this. TAB<Name>smith, john</Name> to TAB<Name>smith,... (8 Replies)
Discussion started by: whegra
8 Replies

3. Shell Programming and Scripting

Match all lines in file where specific text pattern is less than

In the below file I am trying to grep or similar, all lines where only AF= is less than 0.4.. Thank you :). grep grep "AF=" ,+ .4 file file 12 112036782 . T C 34.0248 PASS ... (3 Replies)
Discussion started by: cmccabe
3 Replies

4. Shell Programming and Scripting

Display match or no match and write a text file to a directory

The below bash connects to a site, downloads a file, searches that file based of user input - could be multiple (all that seems to work). What I am not able to figure out is how to display on the screen match found or no match found" and write a file to a directory (C:\Users\cmccabe\Desktop\wget)... (4 Replies)
Discussion started by: cmccabe
4 Replies

5. Shell Programming and Scripting

Match text from file 1 to file 2 and return specific text

I hope this makes sense and is possible. I am trying to match $1 of panel_genes.txt with $3 of RefSeqGene.txt and when a match is found the value in $6 of RefSeqGene.txt Example: ACTA2 is $1 of panel_genes.txt ACTA2 NM_001613.2 ACTA2 NM_001141945.1 awk 'FNR==NR {... (4 Replies)
Discussion started by: cmccabe
4 Replies

6. Shell Programming and Scripting

Inserting IDs from a text file into a sequence alignment file

Hi, I have one file with one column and several hundred entries File1: NA1 NA2 NA3And now I need to run a command within a mapping aligner tool to insert these sample names into a sequence alignment file (SAM) such that they look like this @RG ID:Library1 SM:NA1 PL:Illumina ... (7 Replies)
Discussion started by: nans
7 Replies

7. Shell Programming and Scripting

find common entries and match the number with long sequence and cut that sequence in output

Hi all, I have a file like this ID 3BP5L_HUMAN Reviewed; 393 AA. AC Q7L8J4; Q96FI5; Q9BQH8; Q9C0E3; DT 05-FEB-2008, integrated into UniProtKB/Swiss-Prot. DT 05-JUL-2004, sequence version 1. DT 05-SEP-2012, entry version 71. FT COILED 59 140 ... (1 Reply)
Discussion started by: manigrover
1 Replies

8. Shell Programming and Scripting

Insert text file only after the first match with SED

Hello, I'm new in Shell scripting but i should write a script, which inserts the license header out of a txt-File into the files in our Projekt. For the Java classes it runs without Problems but for XML files not. At xml-files i have to put the license Header after the xml-Header (?xml... (1 Reply)
Discussion started by: PhoenixONE
1 Replies

9. Shell Programming and Scripting

match text from two files and write to a third file

Hi all I have two files X.txt and Y.txt. Both file contains same number of sentences. The content of X.txt is The filter described above may be combined. and the content of Y.txt is The filter describ+ed above may be combin+ed. Some of the words are separated with "+"... (2 Replies)
Discussion started by: my_Perl
2 Replies

10. Programming

Fuzzy Match Logic for Numerical Values

I have searched the internet (including these forums) and perhaps I'm not using the right wording. What I'm looking for is a function (preferably C) that analyzes the similitude of two numerical or near-numerical values, and returns either a true/false (match/nomatch) or a return code that... (4 Replies)
Discussion started by: marcus121
4 Replies
Login or Register to Ask a Question