indexing list of words in a file


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting indexing list of words in a file
# 1  
Old 11-24-2011
indexing list of words in a file

Hey all,

I'm doing a project currently and want to index words in a webpage.
So there would be a file with webpage content and a file with list of words, I want an output file with true and false that would show which word exists in the webpage.

example:

Webpage content data.html


Code:
References

   1. http://console.online.net/
   2. http://webmail.online.net/
   3. http://console.online.net/assistance/
   4. http://www.online.net/
   5. http://www.online.net/nom-de-domaine/comparatif-des-extensions-geographiques.xhtml
   6. http://www.online.net/nom-de-domaine/comparatif-des-extensions-geographiques.xhtml
   7. http://console.online.net/commande/index/
   8. http://www.online.net/hebergement-mutualise/comparatif-des-offres-pour-site-internet.xhtml
   9. http://www.online.net/hebergement-mutualise/comparatif-des-offres-pour-site-internet.xhtml
  10. http://www.online.net/hebergement-mutualise/offre-online-basic.xhtml
  11. http://www.online.net/hebergement-mutualise/offre-online-pro.xhtml
  12. http://www.online.net/hebergement-mutualise/offre-online-illimite.xhtml
  13. http://www.online.net/serveur-dedie/comparatif-offres-serveur-dedie.xhtml
  14. http://www.online.net/serveur-dedie/comparatif-serveur-dedie-start.xhtml
  15. http://www.online.net/serveur-dedie/offre-dedibox-sc.xhtml
  16. http://www.online.net/serveur-dedie/offre-dedibox-classic.xhtml
  17. http://www.online.net/serveur-dedie/offre-dedibox-dc.xhtml
  18. http://www.online.net/serveur-dedie/offre-dedibox-qc.xhtml
  19. http://www.online.net/serveur-dedie/comparatif-serveur-dedie-pro.xhtml
  20. http://www.online.net/serveur-dedie/offre-dedibox-pro-r210.xhtml
  21. http://www.online.net/serveur-dedie/offre-dedibox-pro-r410.xhtml
  22. http://www.online.net/serveur-dedie/offre-dedibox-pro-r510.xhtml
  23. http://www.online.net/serveur-dedie/offre-dedibox-storage.xhtml
  24. http://www.online.net/serveur-dedie/offre-dedibox-housing-dedirack.xhtml
  25. http://www.online.net/serveur-dedie/offre-dedibox-housing-dedirack.xhtml
  26. http://www.iliad-entreprises.fr/
  27. http://www.online.net/infogerance-serveur/infogerance-serveur-dedie.xhtml
  28. http://www.iliad-datacenter.fr/
  29. https://console.online.net/commande/server/?server=110
  30. http://www.online.net/
  31. http://console.online.net/assistance/
  32. http://twitter.com/online_fr
  33. http://www.online.net/hebergement-mutualise/comparatif-des-offres-pour-site-internet.xhtml
  34. https://console.online.net/commande/index/
  35. http://www.online.net/serveur-dedie/comparatif-serveur-dedie-start.xhtml
  36. https://console.online.net/commande/server/?server=110
  37. http://www.online.net/fiche-tarifaire.pdf
  38. http://www.online.net/cgv.pdf
  39. http://www.online.net/document-legal/mentions-legales.xhtml
  40. http://www.online.net/

list of words words.dat

Code:
online
hebergement
ftp
35
php
.fr
.se

file with true false that would show the existence of the words
output.dat

Code:
true
false
true
false
true
false

thnx Smilie
# 2  
Old 11-24-2011
You can try something like this:
Code:
awk '
NR==FNR{a[$1]; next}
{for(i in a)a[i]+=gsub(i,x)}
END{for(i in a){print i,a[i]==0?"false":"true"}}
' words.dat data.html

# 3  
Old 11-24-2011
I read the problem as determine if a word from the list exists in the content of the webpage:
Code:
#! /usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

$\ = "\n";

my $url  = shift(@ARGV);

unless (defined $url && '' lt $url) {
    print STDERR $0, ': missing url';
    exit(1);
}

my $content = get($url);

unless (defined $content) {
    print STDERR $url, ': has no content';
    exit(1);
}

while (<>) {
    chomp;
    print index($content, $_) < 0 ? 'false' : 'true';
}

which would be invoked as:
Code:
perl hindex http://www.online.net/serveur-dedie/offre-dedibox-qc.xhtml wordlist

which results in:
Code:
true
true
false
false
false
true
false

Now this returns 'true' or 'false' depending on the existence of a sequence of characters in the content of the webpage, not splitting out words, removing html tags, and the like. You would need something like HTML::Parser to do that:
Code:
#! /usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::Parser;

$\ = "\n";

my %WORDS = ();

sub text {
    my $text = shift(@_);
    return unless defined $text;
    foreach my $w (split ' ', $text) { $WORDS{lc $w}++; }
}

my $url = shift(@ARGV);

unless (defined $url && '' lt $url) {
    print STDERR 'USAGE: ', $0, ' <url> [<wordlist>]';
    exit(1);
}

my $content = get($url);

unless (defined $content) {
    print STDERR $url, ': has no content';
    exit(1);
}

HTML::Parser->new(text_h => [ \&text, 'text' ])->parse($content);

while (<>) {
    chomp;
    print defined $WORDS{$_} ? 'true' : 'false';
}

Which, when invoked, returns:
Code:
true
false
false
false
false
false
false

Please note that this example does not skip over the contents of <script> tags and the like.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Deleting a list of words from a text file

Hello, I have a list of words separated by spaces I am trying to delete from a text file, and I could not figure out what is the best way to do this. what I tried (does not work) : delete="password key number verify" arr=($delete) for i in arr { sed "s/\<${arr}\>]*//g" in.txt } >... (5 Replies)
Discussion started by: Hawk4520
5 Replies

2. Shell Programming and Scripting

Awk- Indexing a list of numbers in file2 to print certain rows in file1

Hi Does anyone know of an efficient way to index a column of data in file2 to print the coresponding row in file1 which corresponds to the data in file2 AND 30 rows preceding and after the row in file1. For example suppose you have a list of numbers in file2 (single column) as follows:... (6 Replies)
Discussion started by: Geneanalyst
6 Replies

3. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies

4. UNIX for Advanced & Expert Users

List all file names that contain two specific words. ( follow up )

Being new to the forum, I tried finding a solution to find files containing 2 words not necessarily on the same line. This thread "List all file names that contain two specific words." answered it in part, but I was looking for a more concise solution. Here's a one-line suggestion... (8 Replies)
Discussion started by: Symbo53
8 Replies

5. Shell Programming and Scripting

Grepping a list of words from one file in a master database of homophones

Hello, I am sorry if the title is confusing, but I need a script to grep a list of Names from a Source file in a Master database in which all the homophonic variants of the name are listed along with a single indexing key and store all of these in an output file. I need this because I am testing... (4 Replies)
Discussion started by: gimley
4 Replies

6. Shell Programming and Scripting

Split a file using 2-D indexing system

I have a file and want to split it using a 2-D index system for example if the file is p.dat with 6 data sets separated by ">". I want to set nx=3, ny=2. I need to create files p.dat.1.1 p.dat.1.2 p.dat.1.3 p.dat.2.1 p.dat.2.2 p.dat.2.3 I have tried using a single index and want... (3 Replies)
Discussion started by: kristinu
3 Replies

7. Shell Programming and Scripting

indexing a file

hello guys, I have a file like this: input.dat Push-to-talk No Coonection IP support Support for IP telephony Yes Built-in SIP stack Yes Support via software Yes Microsoft Support for Microsoft Exchange Yes UMA (5 Replies)
Discussion started by: Johanni
5 Replies

8. Shell Programming and Scripting

[ask]filtering file to indexing...

dear all, i have file with format like this file_master.txt 20110212|231213|rio|apri|23112|222222 20110212|312311|jaka|dino|31223|543234 20110301|343322|alfan|budi|32131|333311 ... i want filter with output like this index_nm.txt rio|apri jaka|dino ... index_years.txt 20110212... (7 Replies)
Discussion started by: zvtral
7 Replies

9. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

10. Shell Programming and Scripting

List all file names that contain two specific words.

Hi, all: I would like to search all files under "./" and its subfolders recursively to find out those files contain both word "A" and word "B", and list the filenames finally. How to realize that? Cheers JIA (18 Replies)
Discussion started by: jiapei100
18 Replies
Login or Register to Ask a Question