Fuzzy Match Logic for Numerical Values


 
Thread Tools Search this Thread
Top Forums Programming Fuzzy Match Logic for Numerical Values
# 1  
Old 03-31-2009
Fuzzy Match Logic for Numerical Values

I have searched the internet (including these forums) and perhaps I'm not using the right wording.

What I'm looking for is a function (preferably C) that analyzes the similitude of two numerical or near-numerical values, and returns either a true/false (match/nomatch) or a return code that corresponds to the type of near-match relationship that was encountered. The former would require the user to pre-define (through arguments/environ/variables) the acceptable tolerance or conditions, whereas the latter would be subsequently evaluated by the caller to see if the return code indicated a relationship (between the two numbers) within the tolerances of the application.

For example, each of these could be a different return code and the caller could determine whether or not the relationship is considered a "match" for the purposes of his application:
2401 and 2410 (trailing juxtaposition)
2401 and 4201 (leading juxtaposition)
2401 and 2041 (imbedded juxtaposition)
2479 and 24799 (substring match/trailing dupe)
12 and 12A (this is what I meant above by "near numerical" values)
etc.

Any information that would help me find some code that accomplishes this would be greatly appreciated.
~Marcus
# 2  
Old 03-31-2009
I'm guessing agrep is what you want. agrep == approximate grep

ftp://ftp.cs.arizona.edu/agrep/
# 3  
Old 03-31-2009
I first discovered this concept in PHP, it's called the 'Levenshtein algorithm', and it measures minimum number of changes required to transform one sequence into another. wikipedia has a nice article going over a generic algorithm.
# 4  
Old 04-01-2009
Hi Corona. Thanks for your feedback. We tried LD but it really lacked intelligence for our purposes. For example, the numbers 2401 and 24 it reported as being 50% alike. Clearly by human eye you can see it is and we wouldn't want to include this as a "match." i.e., John Doe at 2401 Main St and John Doe at 24 Main St - probably not the same person. Yet, 2401 and 2410 is a strong possibility. We'd allow this if we did not find John Doe at 2401 Main but we did find a John Doe at 2410, we'd call it a match. Again, LD gives these two numbers a weight of 50% similitude. So in order to catch 2401 and 2410 (simple juxtaposition) we had to set our threshold at 50; yet then we get really crappy matches like 2401 and 24.

In order to catch 310 and 301 with LD, we'd have to lower our threshold to 33%, but then you really let in the junk!

I've no doubt LD has its applications (in fact, we do use it for street names), but it's not so hot for fuzzy number matches. Thank you for your thoughts though.
~Marcus
# 5  
Old 04-16-2009
Perhaps LD could be adjusted by giving a higher cost to adding rather than swapping characters.
Login or Register to Ask a Question

Previous Thread | Next Thread

9 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Gawk: PROCINFO["sorted_in"] does not sort my numerical array values

Hi, PROCINFO seems to be a great function but I don't manage to make it works. input: B,A,C B B,B As an example, just want to count the occurence of each letter across the input and sort them by decreased order. Wanted output: B 4 A 1 C 1 When I use this command, the PROCINFO... (4 Replies)
Discussion started by: beca123456
4 Replies

2. Shell Programming and Scripting

Bash to add portion of text to files in directory using numerical match

In the below bash I am trying to rename eachof the 3 text files in /home/cmccabe/Desktop/percent by matching the numerical portion of each file to lines 3,4, or 5 in /home/cmccabe/Desktop/analysis.txt. There will always be a match between the files. When a match is found each text file in... (2 Replies)
Discussion started by: cmccabe
2 Replies

3. Shell Programming and Scripting

Match and store numerical prefix to update files

In the bash below the unique headers of each vcf.gz are stored in a text file with the same name. That is if 16-0000-file.vcf.gz was used the header text file would be 16-0000-file_header.txt. There can be multiple vcf.gz in a directory, usually 3, that I need to fix the header in each file before... (6 Replies)
Discussion started by: cmccabe
6 Replies

4. UNIX for Dummies Questions & Answers

How to use grep with numerical values?

I'm new to Unix and I have been trying to fix this problem for the past week. How would I use grep to display only certain numbers for a list. For example, if I have this list: Joe senior 4/50 John junior 25/50 Mary junior 41/50 Martha sophomore 2/50 ...How do I get a file... (1 Reply)
Discussion started by: PTcharger
1 Replies

5. Shell Programming and Scripting

fuzzy sequence match in a text file

Hi Forum: I have struggle with it and decide to use my eye ball to accomplish this. Basically I am looking for sequence of date inside a file. If one of the sequence repeat 2-3 time or skip once; it's still consider a match. input text file: Sep 6 A Sep 6 A Sep 10 A Sep 7 B Sep 8... (7 Replies)
Discussion started by: chirish
7 Replies

6. Shell Programming and Scripting

Logic Explanation of Match command in Linux

I am debugging a script and have stuck up at one code line awk -F , '{if (match($3,001)) { print $2 } }' Master20120307090511.tmp The Master20120307090511.tmp is 001,ARE , 001 002,ARE , 002 003,ARE , 003 006,ARE , 006 011,ARE , 011 012,ARE , 012 What happens is when i fire this ... (5 Replies)
Discussion started by: vee_789
5 Replies

7. Shell Programming and Scripting

match 2 files by values

Hello ALL, Hope all fine for you. I have the following task but no idea about how to do. I have 2 files (ascii) one of them is a list of bib records, looks like this: =LDR 01228nam 2200301 b 4500 =001 00000000000001 =005 20090429:10082000 =008 ... (2 Replies)
Discussion started by: ldiaz2106
2 Replies

8. UNIX for Dummies Questions & Answers

Extracting rows from a text file based on numerical values of a column

I have a text file where the second column is a list of numbers going from small to large. I want to extract the rows where the second column is smaller than or equal to 0.0001. My input: rs10082730 9e-08 12 46002702 rs2544081 1e-07 12 46015487 rs1425136 1e-06 7 35396742 rs2712590... (1 Reply)
Discussion started by: evelibertine
1 Replies

9. Shell Programming and Scripting

Get values from different columns from file2 when match values of file1

Hi everyone, I have file1 and file2 comma separated both. file1 is: Header1,Header2,Header3,Header4,Header5,Header6,Header7,Header8,Header9,Header10 Code7,,,,,,,,, Code5,,,,,,,,, Code3,,,,,,,,, Code9,,,,,,,,, Code2,,,,,,,,,file2... (17 Replies)
Discussion started by: cgkmal
17 Replies
Login or Register to Ask a Question