Sponsored Content
Top Forums Shell Programming and Scripting indexing list of words in a file Post 302576282 by m.d.ludwig on Thursday 24th of November 2011 07:31:28 AM
Old 11-24-2011
I read the problem as determine if a word from the list exists in the content of the webpage:
Code:
#! /usr/bin/perl

use strict;
use warnings;
use LWP::Simple;

$\ = "\n";

my $url  = shift(@ARGV);

unless (defined $url && '' lt $url) {
    print STDERR $0, ': missing url';
    exit(1);
}

my $content = get($url);

unless (defined $content) {
    print STDERR $url, ': has no content';
    exit(1);
}

while (<>) {
    chomp;
    print index($content, $_) < 0 ? 'false' : 'true';
}

which would be invoked as:
Code:
perl hindex http://www.online.net/serveur-dedie/offre-dedibox-qc.xhtml wordlist

which results in:
Code:
true
true
false
false
false
true
false

Now this returns 'true' or 'false' depending on the existence of a sequence of characters in the content of the webpage, not splitting out words, removing html tags, and the like. You would need something like HTML::Parser to do that:
Code:
#! /usr/bin/perl

use strict;
use warnings;

use LWP::Simple;
use HTML::Parser;

$\ = "\n";

my %WORDS = ();

sub text {
    my $text = shift(@_);
    return unless defined $text;
    foreach my $w (split ' ', $text) { $WORDS{lc $w}++; }
}

my $url = shift(@ARGV);

unless (defined $url && '' lt $url) {
    print STDERR 'USAGE: ', $0, ' <url> [<wordlist>]';
    exit(1);
}

my $content = get($url);

unless (defined $content) {
    print STDERR $url, ': has no content';
    exit(1);
}

HTML::Parser->new(text_h => [ \&text, 'text' ])->parse($content);

while (<>) {
    chomp;
    print defined $WORDS{$_} ? 'true' : 'false';
}

Which, when invoked, returns:
Code:
true
false
false
false
false
false
false

Please note that this example does not skip over the contents of <script> tags and the like.
 

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

List all file names that contain two specific words.

Hi, all: I would like to search all files under "./" and its subfolders recursively to find out those files contain both word "A" and word "B", and list the filenames finally. How to realize that? Cheers JIA (18 Replies)
Discussion started by: jiapei100
18 Replies

2. Shell Programming and Scripting

Splitting Concatenated Words in Input File with Words from a Master File

Hello, I have a complex problem. I have a file in which words have been joined together: Theboy ranslowly I want to be able to correctly split the words using a lookup file in which all the words occur: the boy ran slowly slow put child ly The lookup file which is meant for look up... (21 Replies)
Discussion started by: gimley
21 Replies

3. Shell Programming and Scripting

[ask]filtering file to indexing...

dear all, i have file with format like this file_master.txt 20110212|231213|rio|apri|23112|222222 20110212|312311|jaka|dino|31223|543234 20110301|343322|alfan|budi|32131|333311 ... i want filter with output like this index_nm.txt rio|apri jaka|dino ... index_years.txt 20110212... (7 Replies)
Discussion started by: zvtral
7 Replies

4. Shell Programming and Scripting

indexing a file

hello guys, I have a file like this: input.dat Push-to-talk No Coonection IP support Support for IP telephony Yes Built-in SIP stack Yes Support via software Yes Microsoft Support for Microsoft Exchange Yes UMA (5 Replies)
Discussion started by: Johanni
5 Replies

5. Shell Programming and Scripting

Split a file using 2-D indexing system

I have a file and want to split it using a 2-D index system for example if the file is p.dat with 6 data sets separated by ">". I want to set nx=3, ny=2. I need to create files p.dat.1.1 p.dat.1.2 p.dat.1.3 p.dat.2.1 p.dat.2.2 p.dat.2.3 I have tried using a single index and want... (3 Replies)
Discussion started by: kristinu
3 Replies

6. Shell Programming and Scripting

Grepping a list of words from one file in a master database of homophones

Hello, I am sorry if the title is confusing, but I need a script to grep a list of Names from a Source file in a Master database in which all the homophonic variants of the name are listed along with a single indexing key and store all of these in an output file. I need this because I am testing... (4 Replies)
Discussion started by: gimley
4 Replies

7. UNIX for Advanced & Expert Users

List all file names that contain two specific words. ( follow up )

Being new to the forum, I tried finding a solution to find files containing 2 words not necessarily on the same line. This thread "List all file names that contain two specific words." answered it in part, but I was looking for a more concise solution. Here's a one-line suggestion... (8 Replies)
Discussion started by: Symbo53
8 Replies

8. Shell Programming and Scripting

Replace particular words in file based on if finds another words in that line

Hi All, I need one help to replace particular words in file based on if finds another words in that file . i.e. my self is peter@king. i am staying at north sydney. we all are peter@king. How to replace peter to sham if it finds @king in any line of that file. Please help me... (8 Replies)
Discussion started by: Rajib Podder
8 Replies

9. Shell Programming and Scripting

Awk- Indexing a list of numbers in file2 to print certain rows in file1

Hi Does anyone know of an efficient way to index a column of data in file2 to print the coresponding row in file1 which corresponds to the data in file2 AND 30 rows preceding and after the row in file1. For example suppose you have a list of numbers in file2 (single column) as follows:... (6 Replies)
Discussion started by: Geneanalyst
6 Replies

10. Shell Programming and Scripting

Deleting a list of words from a text file

Hello, I have a list of words separated by spaces I am trying to delete from a text file, and I could not figure out what is the best way to do this. what I tried (does not work) : delete="password key number verify" arr=($delete) for i in arr { sed "s/\<${arr}\>]*//g" in.txt } >... (5 Replies)
Discussion started by: Hawk4520
5 Replies
LWP::Simple(3)						User Contributed Perl Documentation					    LWP::Simple(3)

NAME
LWP::Simple - simple procedural interface to LWP SYNOPSIS
perl -MLWP::Simple -e 'getprint "http://www.sn.no"' use LWP::Simple; $content = get("http://www.sn.no/"); die "Couldn't get it!" unless defined $content; if (mirror("http://www.sn.no/", "foo") == RC_NOT_MODIFIED) { ... } if (is_success(getprint("http://www.sn.no/"))) { ... } DESCRIPTION
This module is meant for people who want a simplified view of the libwww-perl library. It should also be suitable for one-liners. If you need more control or access to the header fields in the requests sent and responses received, then you should use the full object-oriented interface provided by the "LWP::UserAgent" module. The following functions are provided (and exported) by this module: get($url) The get() function will fetch the document identified by the given URL and return it. It returns "undef" if it fails. The $url argument can be either a string or a reference to a URI object. You will not be able to examine the response code or response headers (like 'Content-Type') when you are accessing the web using this function. If you need that information you should use the full OO interface (see LWP::UserAgent). head($url) Get document headers. Returns the following 5 values if successful: ($content_type, $document_length, $modified_time, $expires, $server) Returns an empty list if it fails. In scalar context returns TRUE if successful. getprint($url) Get and print a document identified by a URL. The document is printed to the selected default filehandle for output (normally STDOUT) as data is received from the network. If the request fails, then the status code and message are printed on STDERR. The return value is the HTTP response code. getstore($url, $file) Gets a document identified by a URL and stores it in the file. The return value is the HTTP response code. mirror($url, $file) Get and store a document identified by a URL, using If-modified-since, and checking the Content-Length. Returns the HTTP response code. This module also exports the HTTP::Status constants and procedures. You can use them when you check the response code from getprint(), getstore() or mirror(). The constants are: RC_CONTINUE RC_SWITCHING_PROTOCOLS RC_OK RC_CREATED RC_ACCEPTED RC_NON_AUTHORITATIVE_INFORMATION RC_NO_CONTENT RC_RESET_CONTENT RC_PARTIAL_CONTENT RC_MULTIPLE_CHOICES RC_MOVED_PERMANENTLY RC_MOVED_TEMPORARILY RC_SEE_OTHER RC_NOT_MODIFIED RC_USE_PROXY RC_BAD_REQUEST RC_UNAUTHORIZED RC_PAYMENT_REQUIRED RC_FORBIDDEN RC_NOT_FOUND RC_METHOD_NOT_ALLOWED RC_NOT_ACCEPTABLE RC_PROXY_AUTHENTICATION_REQUIRED RC_REQUEST_TIMEOUT RC_CONFLICT RC_GONE RC_LENGTH_REQUIRED RC_PRECONDITION_FAILED RC_REQUEST_ENTITY_TOO_LARGE RC_REQUEST_URI_TOO_LARGE RC_UNSUPPORTED_MEDIA_TYPE RC_INTERNAL_SERVER_ERROR RC_NOT_IMPLEMENTED RC_BAD_GATEWAY RC_SERVICE_UNAVAILABLE RC_GATEWAY_TIMEOUT RC_HTTP_VERSION_NOT_SUPPORTED The HTTP::Status classification functions are: is_success($rc) True if response code indicated a successful request. is_error($rc) True if response code indicated that an error occurred. The module will also export the LWP::UserAgent object as $ua if you ask for it explicitly. The user agent created by this module will identify itself as "LWP::Simple/#.##" and will initialize its proxy defaults from the environment (by calling $ua->env_proxy). CAVEAT
Note that if you are using both LWP::Simple and the very popular CGI.pm module, you may be importing a "head" function from each module, producing a warning like "Prototype mismatch: sub main::head ($) vs none". Get around this problem by just not importing LWP::Simple's "head" function, like so: use LWP::Simple qw(!head); use CGI qw(:standard); # then only CGI.pm defines a head() Then if you do need LWP::Simple's "head" function, you can just call it as "LWP::Simple::head($url)". SEE ALSO
LWP, lwpcook, LWP::UserAgent, HTTP::Status, lwp-request, lwp-mirror perl v5.16.2 2012-02-18 LWP::Simple(3)
All times are GMT -4. The time now is 07:33 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy