08-20-2011
Hi,
Not any tool though (as I think sed and awk are the best tools to parse the Wikipedia XML dump) and I just used a simple regular expression technique to parse and extract the Wikipedia articles from one huge file available for download. But the problem was, it took days to parse the entire dump so I thought why not parallelize the entire thing so that it could be done fast?
Though even after parsing lots of prepossessing needs to be done which I feel is easy just by using certain heuristics and then running sed or awk on those heuristics.
But if you are looking for tools parse the XML Wikipedia dump, you may look here:
Experiments on the English Wikipedia — gensim
Wikipedia Preprocessor (WikiPrep)
Hope this helps.
10 More Discussions You Might Find Interesting
1. UNIX for Dummies Questions & Answers
Hello, I've tried for a while now to run a bash script that continues to the end, while opening new shells as needed.
I've tried
xterm -e "somecommand"; &
xterm -e " somecommand";
I've also tried
screen -S "somecommand"; &
screen -S "somecommand";
All without any luck, they... (5 Replies)
Discussion started by: Closed_Socket
5 Replies
2. Shell Programming and Scripting
Hi all I have a requirement where I have a flow like
Script1
script2 Script3 Script 4 Script 5 Script 6
script7
where script2 to script6 will... (3 Replies)
Discussion started by: nvuradi
3 Replies
3. Shell Programming and Scripting
Hi
I have a shell script A which calls another 10 shell scripts which run in background. How do i make the parent script wait for the child scripts complete, or in other words, i must be able to do a grep of parent script to find out if the child scripts are still running.
My Code:
... (5 Replies)
Discussion started by: albertashish
5 Replies
4. Shell Programming and Scripting
Hi,
I have created 3 shell scripts which has to run one by one first two shell scripts will create a .txt files...which are used by the third shell script.Now I want to create a master script and run all these in a single script.
Please give a pseudo code on how to so the same.
... (4 Replies)
Discussion started by: gaur.deepti
4 Replies
5. Shell Programming and Scripting
I want to make the first character of some words to be uppercase. I have a file like the one below.
uid,givenname,sn,cn,mail,telephonenumber
mattj,matt,johnson,matt johnson,mattj@gmail.com
markv,mark,vennet,matt s vennet,markv@gmail.com
mikea,mike,austi,mike austin,mike@gmail.com
I want... (3 Replies)
Discussion started by: matt12
3 Replies
6. Shell Programming and Scripting
Hi, I was hoping that someone could help me. I have a problem that i am trying to work on and it requires me to change text within multiple files using sed. I use the program to change an occurance of a word throughout different files that are being tested. At first i had to Create a new script,... (1 Reply)
Discussion started by: Johnny2518
1 Replies
7. UNIX for Dummies Questions & Answers
hi all,
I have 3 individual scripts to perform the task . 2nd script should run only after the 1st script and 3rd script must run only after first 2 scripts are executed successfully.
i want to have a single script that calls all this 3 scripts .this single script should execute the 2nd script... (1 Reply)
Discussion started by: Rahul619
1 Replies
8. Shell Programming and Scripting
Hi
I probably dont have GNU extended sed in my SUNOS . and its creating lot of problems
ex:
a simple sed command like this is not working
sed '/WORD/ a\
sample text line 1 \
sample text line 1
' filename
sed: command garbled: /WORD/ a
I took precaution to have a new line after... (11 Replies)
Discussion started by: vash
11 Replies
9. UNIX for Advanced & Expert Users
Hi Team,
I have the below 4 scripts which I will be running in sequential order.
This run will start for today's business date.
If all the 4 scripts are success for today that means script has ran succesfully.
Howver if any one of these 4 scripts failed then it has to take the next... (1 Reply)
Discussion started by: Deena1984
1 Replies
10. Shell Programming and Scripting
Hello!
I have a scriptA.ksh and in this script I need to call script1.ksh, script2.ksh, script3.ksh, script4.ksh and script5.ksh. But want to run in two batches like
1st script1.ksh, script2.ksh, script3.ksh, once all 3 are completed then
script4.ksh script5.ksh
I have given the syntax... (1 Reply)
Discussion started by: karumudi7
1 Replies
LEARN ABOUT OSX
xml::libxml::relaxng
XML::LibXML::RelaxNG(3) User Contributed Perl Documentation XML::LibXML::RelaxNG(3)
NAME
XML::LibXML::RelaxNG - RelaxNG Schema Validation
SYNOPSIS
use XML::LibXML;
$doc = XML::LibXML->new->parse_file($url);
$rngschema = XML::LibXML::RelaxNG->new( location => $filename_or_url );
$rngschema = XML::LibXML::RelaxNG->new( string => $xmlschemastring );
$rngschema = XML::LibXML::RelaxNG->new( DOM => $doc );
eval { $rngschema->validate( $doc ); };
DESCRIPTION
The XML::LibXML::RelaxNG class is a tiny frontend to libxml2's RelaxNG implementation. Currently it supports only schema parsing and
document validation.
METHODS
new
$rngschema = XML::LibXML::RelaxNG->new( location => $filename_or_url );
$rngschema = XML::LibXML::RelaxNG->new( string => $xmlschemastring );
$rngschema = XML::LibXML::RelaxNG->new( DOM => $doc );
The constructor of XML::LibXML::RelaxNG may get called with either one of three parameters. The parameter tells the class from which
source it should generate a validation schema. It is important, that each schema only have a single source.
The location parameter allows to parse a schema from the filesystem or a URL.
The string parameter will parse the schema from the given XML string.
The DOM parameter allows to parse the schema from a pre-parsed XML::LibXML::Document.
Note that the constructor will die() if the schema does not meed the constraints of the RelaxNG specification.
validate
eval { $rngschema->validate( $doc ); };
This function allows to validate a (parsed) document against the given RelaxNG schema. The argument of this function should be an
XML::LibXML::Document object. If this function succeeds, it will return 0, otherwise it will die() and report the errors found. Because
of this validate() should be always evaluated.
AUTHORS
Matt Sergeant, Christian Glahn, Petr Pajas
VERSION
2.0008
COPYRIGHT
2001-2007, AxKit.com Ltd.
2002-2006, Christian Glahn.
2006-2009, Petr Pajas.
perl v5.16.2 2012-10-22 XML::LibXML::RelaxNG(3)