Quote:
Originally Posted by
DGPickett
Google does not index use a "fuzzy" algorithm, as I recall.
Google indexes, as I recall, using a Bayesian classifier.
There is a difference (quite a difference) between index and retrieval with a fuzzy algorithm versus indexing with a Bayesian classifier.
---------- Post updated at 17:18 ---------- Previous update was at 17:14 ----------
OBTW, on fuzzy search, read this
reference:
Quote:
Fuzzy search
Fuzzy search searches for words that are spelled in a similar way to the search term.
Example
SELECT AUTHOR, TITLE
FROM DB2EXT.TEXTTAB
WHERE CONTAINS(COMMENT,
'fuzzy form of 80 "pullitzer"') =1
In this example, the search could find an occurrence of the misspelled word pulitzer.
The match level, in the example “80”, specifies the desired degree of accuracy. Use fuzzy search when misspellings are possible in the document. This is often the case when an Optical Character Recognition device, or phonetic input creates the document. Use values between 1 and 100 to show the degree of fuzziness, where 100 is an exact match and anything below 80 is increasingly "fuzzy".
Note: If the fuzzy search does not provide the appropriate degree of accuracy, search for parts of a term using character masking.
I think you can easily find a "fuzzy" indexer to run on Linux.
If you find one (in PHP), let me know. I may implement fuzzy search as an additional capability on this site.
---------- Post updated at 17:21 ---------- Previous update was at 17:18 ----------
OBTW, as a side-note, you could probably use a Bayesian classifier to assist in building a fuzzy searcher or indexer. I've not look into this, but a bit of Google'ing around might yield some useful peach fuzz
---------- Post updated at 17:25 ---------- Previous update was at 17:21 ----------
Here is something interesting.....
Approximate/fuzzy string search in PHP
Quote:
This PHP class, approximate-search.php, provides non-exact text search (often called fuzzy search or approximate matching).
It allows you to specify a Levenshtein edit distance treshold, i.e. an error limit for a match. For example, a search for kamari with a threshold of 1 errors would match kamari, kammari, kaNari and kamar but not kaNar.
The code is optimized for repeated searching of the same string, e.g. walking through rows of a database.