Sponsored Content
Top Forums Shell Programming and Scripting Is there a 'fuzzy search' facility in Linux? Post 302468623 by Bashingaway on Wednesday 3rd of November 2010 10:59:21 AM
Old 11-03-2010
Is there a 'fuzzy search' facility in Linux?

I have over 10m documents that I want to search through against a list of know keywords, however the documents were produced using a technique that isn't perfect in how the data was presented.

Is there a fuzzy keyword search available in Linux or can anyone think of a way of doing it that isn't horrendously time expensive?

Example Keyword

Banana

Search therefore, case insensitive for...

Banana
Banan*
Bana*a
Ban*na
Ba*ana
B*nana
*anana

Bana**
Ban*n*
Ba*an*
B*nan*
*anan*
Ban**a
Ba*a*a
B*na*a
*ana*a

and so on.....

With 500 keywords and average of 10 characters per word that's over 50k 'fuzzy searches' per page to cover all the permutations, for words above 9 characters you'd probably want to have even more then 2 * characters per word which ramps up the number of searches even more.

Ideas please?
 

7 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Using the LOG_AUTH facility

Hi, I am wanting to enable logging of all ftp sessions on my Solaris 8 host. I want to at least log all ftp logins and if possible any commands that the user executes. I have tried various settings in syslog.conf then rereading syslogd but logging still does not happen. I have... (1 Reply)
Discussion started by: blp001
1 Replies

2. OS X (Apple)

Unix email facility

Dear all, I am an inexperienced man with Macitosh and green in Apple OS X . I had tried very hard to use Unix, in fact the Terminal, with its Email function. I read some books and came to know that it has Mail, mailx or mail functions that we can use for simple mail. I have try every... (3 Replies)
Discussion started by: Larry LAM
3 Replies

3. Programming

Fuzzy Match Logic for Numerical Values

I have searched the internet (including these forums) and perhaps I'm not using the right wording. What I'm looking for is a function (preferably C) that analyzes the similitude of two numerical or near-numerical values, and returns either a true/false (match/nomatch) or a return code that... (4 Replies)
Discussion started by: marcus121
4 Replies

4. UNIX for Dummies Questions & Answers

Unable to use the CDE Facility

Hello I have a SunBlade 1000 workstation and I cannot login via CDE. All I get is a console login prompt. I then have to login via root and I just get the command line interface. I have being doing some research on the UNIX forum and the problem may lie with the content in etc/hosts file.... (7 Replies)
Discussion started by: tjwops
7 Replies

5. Hardware

Monitor/projector display looks fuzzy

Hi there Not sure if I'm posting this in the right section...but here goes. I'm using an HP Compaq nc8430 laptop. Graphics card according to specs is an ATI Mobility Radeon X1600. It's the first time I installed Linux for use on my personal laptop and I'm having trouble using it with a... (0 Replies)
Discussion started by: notreallyhere
0 Replies

6. Shell Programming and Scripting

How to delete corrupted characters and then do fuzzy searches?

Hi All I have a whole block of pages that have come in from various sources, unfortunately the pages in many instances have blocks of corrupted text. What I'm trying to do is write a sed line that will just delete non alphanumeric characters if they're in a block of say three or four... (5 Replies)
Discussion started by: Bashingaway
5 Replies

7. Shell Programming and Scripting

fuzzy sequence match in a text file

Hi Forum: I have struggle with it and decide to use my eye ball to accomplish this. Basically I am looking for sequence of date inside a file. If one of the sequence repeat 2-3 time or skip once; it's still consider a match. input text file: Sep 6 A Sep 6 A Sep 10 A Sep 7 B Sep 8... (7 Replies)
Discussion started by: chirish
7 Replies
tracker-search(1)						   User Commands						 tracker-search(1)

NAME
tracker-search - Search all content for keywords SYNOPSIS
tracker-search [OPTION...] EXPRESSION [EXPRESSION...] DESCRIPTION
tracker-search searches all indexed content for EXPRESSION. The resource in which EXPRESSION matches must exist (see --all for more infor- mation). All results are returned in ascending order. In all cases, if no EXPRESSION is given for an argument (like --folders for example) then ALL items in that category are returned instead. EXPRESSION One or more terms to search. The default operation is a logical AND. For logical OR operations, see -r. OPTIONS
-?, --help-all Display all help options available. -l, --limit=N Limit search to N results. The default is 10 or 512 with --disable-snippets. -o, --offset=N Offset the search results by N. For example, start at item number 10 in the results. The default is 0. -r, --or-operator Use OR for search terms instead of AND (the default) -d, --detailed Show the unique URN associated with each search result. This does not apply to --music-albums and --music-artists. -a, --all Show results which might not be available. This might bebecause a removable media is not mounted for example. Without this option, resources are only shown if they exist. This option applies to all command line switches except --disable-snippets Results are shown with snippets. Snippets are context around the word that was searched for in the first place. This gives some idea of if the resource found is the right one. Snippets require Full Text Search to be compile time enabled AND to not be disabled with --disable-fts. Using --disable-snippets only shows the resources which matched, no context is provided about where the match occurred. --disable-fts If Full Text Search (FTS) is available, this option allows it to be disabled for one off searches. This returns results slightly using particular properties to match the search terms (like nie:title) instead of looking for the search terms amongst ALL proper- ties. It is more limiting to do this, but sometimes searching without FTS can yield better results if the FTS ranking is off. --disable-color This disables any ANSI color use on the command line. By default this is enabled to make it easier to see results. --music-albums and --music-artists. -f, --files Search for files of any type matching EXPRESSION (optional). -s, --folders Search for folders matching EXPRESSION (optional). -m, --music Search for music files matching EXPRESSION (optional). --music-albums Search for music albums matching EXPRESSION (optional). --music-artists Search for music artists matching EXPRESSION (optional). -i, --images Search for images matching EXPRESSION (optional). -v, --videos Search for videos matching EXPRESSION (optional). -t, --documents Search for documents matching EXPRESSION (optional). -e, --emails Search for emails matching EXPRESSION (optional). Returns a list of subjects for emails found. -c, --contacts Search for contacts matching EXPRESSION (optional). Returns a list of names and email addresses found. --software Search for software installed matching EXPRESSION (optional). Returns a list of desktop files and application titles found. --software-categories Search for software categories matching EXPRESSION (optional). Returns a list of urns and their categories (e.g. Settings, Video, Utility, etc). --feeds Search through RSS feed information matching EXPRESSION (optional). Returns a list of those found. -b, --bookmarks Search through bookmarks matching EXPRESSION (optional). Returns a list titles and links for each bookmark found. -V, --version Print version. SEE ALSO
tracker-store(1), tracker-stats(1), tracker-tag(1), tracker-info(1). GNU
July 2009 tracker-search(1)
All times are GMT -4. The time now is 12:49 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy