Sponsored Content
Top Forums Shell Programming and Scripting Extract text between two specified "constant" texts using awk Post 302503322 by shoaibjameel123 on Thursday 10th of March 2011 07:48:38 AM
Old 03-10-2011
Extract text between two specified "constant" texts using awk

Hi All,
From the title you may know that this question has been asked several times and I have done lot of Googling on this.

I have a Wikipedia dump file in XML format. All the contents are in one XML file i.e. all different topics have been put in one XML file. Now I need to separate them and make separate files for each topic. After carefully going though the XML file, I found that the topics occur between <page> and </page> tags. I want to use awk to extract the topics and their descriptions in separate files like first topic goes into 1.dat and then second topic into 2.dat and so on till the end of file.
This is how Wikipedia XML file looks:
HTML Code:
<page>
<title>APRIL</title>
.........(text contents that I need to extract and store in 1.dat including the <title> tag)
</page>
<page>
<title>August</title>
....(text contents that I need to store in 2.dat including the <title> tag)
</page>
so on.......

I have done this but it created havoc.
Code:
awk '</page>/{s++}print > "s.dat" s}' wiki.xml


Last edited by shoaibjameel123; 03-10-2011 at 08:58 AM..
 

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Explain the line "mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'`"

Hi Friends, Can any of you explain me about the below line of code? mn_code=`env|grep "..mn"|awk -F"=" '{print $2}'` Im not able to understand, what exactly it is doing :confused: Any help would be useful for me. Lokesha (4 Replies)
Discussion started by: Lokesha
4 Replies

2. Shell Programming and Scripting

Extract Part of a "Word", using AWK or SED????

I have been lurking on this forum for some time now and appreciate Everyone's help. I need to find a way to get the SystemID from this XML file. The file is much larger than just this one line but I can grep and get this line Printed. But really just need the "systemid". <test123: prefintem... (9 Replies)
Discussion started by: elbombillo
9 Replies

3. Shell Programming and Scripting

using awk to extract text between two constant strings

Hi, I have a file from which i need to extract data between two constant strings. The data looks like this : Line 1 SUN> read db @cmpd unit 60 Line 2 Parameter: CMPD -> "C00071" Line 3 Line 4 SUN> generate Line 5 tabint>ERROR: (Variable data) The data i need to extract is... (11 Replies)
Discussion started by: mjoshi
11 Replies

4. Shell Programming and Scripting

cat $como_file | awk /^~/'{print $1","$2","$3","$4}' | sed -e 's/~//g'

hi All, cat file_name | awk /^~/'{print $1","$2","$3","$4}' | sed -e 's/~//g' Can this be done by using sed or awk alone (4 Replies)
Discussion started by: harshakusam
4 Replies

5. Shell Programming and Scripting

awk command to replace ";" with "|" and ""|" at diferent places in line of file

Hi, I have line in input file as below: 3G_CENTRAL;INDONESIA_(M)_TELKOMSEL;SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL My expected output for line in the file must be : "1-Radon1-cMOC_deg"|"LDIndex"|"3G_CENTRAL|INDONESIA_(M)_TELKOMSEL"|LAST|"SPECIAL_WORLD_GRP_7_FA_2_TELKOMSEL" Can someone... (7 Replies)
Discussion started by: shis100
7 Replies

6. Shell Programming and Scripting

Using sed to find text between a "string " and character ","

Hello everyone Sorry I have to add another sed question. I am searching a log file and need only the first 2 occurances of text which comes after (note the space) "string " and before a ",". I have tried sed -n 's/.*string \(*\),.*/\1/p' filewith some, but limited success. This gives out all... (10 Replies)
Discussion started by: haggismn
10 Replies

7. Shell Programming and Scripting

how to use "cut" or "awk" or "sed" to remove a string

logs: "/home/abc/public_html/index.php" "/home/abc/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" "/home/xyz/public_html/index.php" how to use "cut" or "awk" or "sed" to get the following result: abc abc xyz xyz xyz (8 Replies)
Discussion started by: timmywong
8 Replies

8. Shell Programming and Scripting

awk? extract quoted "" strings from multiple lines.

I am trying to extract multiple strings from snmp-mib files like below. ----- $ cat IF-MIB.mib <snip> linkDown NOTIFICATION-TYPE OBJECTS { ifIndex, ifAdminStatus, ifOperStatus } STATUS current DESCRIPTION "A linkDown trap signifies that the SNMP entity, acting in... (5 Replies)
Discussion started by: genzo
5 Replies

9. UNIX for Beginners Questions & Answers

Extract delta records using with "comm" and "sort" commands combination

Hi All, I have 2 pipe delimited files viz., file_old and file_new. I'm trying to compare these 2 files, and extract all the different rows between them into a new_file. comm -3 < sort file_old < sort file_new > new_file I am getting the below error: -ksh: sort: cannot open But if I do... (7 Replies)
Discussion started by: njny
7 Replies

10. UNIX for Beginners Questions & Answers

Awk: Performing "for" loop within text block with two files

I am hoping to pull multiple strings from one file and use them to search within a block of text within another file. File 1PS001,001 HLK PS002,004 MWQ PS004,002 RXM PS004,006 DBX PS004,006 SBR PS005,007 ML PS005,009 DBR PS005,011 MR PS005,012 SBR PS006,003 RXM PS006,003 >SJ PS006,010... (11 Replies)
Discussion started by: jvoot
11 Replies
XML::TreeBuilder(3pm)					User Contributed Perl Documentation				     XML::TreeBuilder(3pm)

NAME
XML::TreeBuilder - Parser that builds a tree of XML::Element objects SYNOPSIS
foreach my $file_name (@ARGV) { my $tree = XML::TreeBuilder->new({ 'NoExpand' => 0, 'ErrorContext' => 0 }); # empty tree $tree->parse_file($file_name); print "Hey, here's a dump of the parse tree of $file_name: "; $tree->dump; # a method we inherit from XML::Element print "And here it is, bizarrely rerendered as XML: ", $tree->as_XML, " "; # Now that we're done with it, we must destroy it. $tree = $tree->delete; } DESCRIPTION
This module uses XML::Parser to make XML document trees constructed of XML::Element objects (and XML::Element is a subclass of HTML::Element adapted for XML). XML::TreeBuilder is meant particularly for people who are used to the HTML::TreeBuilder / HTML::Element interface to document trees, and who don't want to learn some other document interface like XML::Twig or XML::DOM. The way to use this class is to: 1. start a new (empty) XML::TreeBuilder object. 2. set any of the "store" options you want. 3. then parse the document from a source by calling "$x->parsefile(...)" or "$x->parse(...)" (See XML::Parser docs for the options that these two methods take) 4. do whatever you need to do with the syntax tree, presumably involving traversing it looking for some bit of information in it, 5. and finally, when you're done with the tree, call $tree->delete to erase the contents of the tree from memory. This kind of thing usually isn't necessary with most Perl objects, but it's necessary for TreeBuilder objects. See HTML::Element for a more verbose explanation of why this is the case. METHODS AND ATTRIBUTES
XML::TreeBuilder is a subclass of XML::Element, which in turn is a subclass of HTML:Element. You should read and understand the documentation for those two modules. An XML::TreeBuilder object is just a special XML::Element object that allows you to call these additional methods: $root = XML::TreeBuilder->new() Construct a new XML::TreeBuilder object. Parameters: NoExpand Passed to XML::Parser. Do not Expand external entities. Default: undef ErrorContext Passed to XML::Parser. Number of context lines to generate on errors. Default: undef $root->eof Deletes parser object. $root->parse(...options...) Uses XML::Parser's "parse" method to parse XML from the source(s?) specified by the options. See XML::Parse $root->parsefile(...options...) Uses XML::Parser's "parsefile" method to parse XML from the source(s?) specified by the options. See XML::Parse $root->parse_file(...options...) Simply an alias for "parsefile". $root->store_comments(value) This determines whether TreeBuilder will normally store comments found while parsing content into $root. Currently, this is off by default. $root->store_declarations(value) This determines whether TreeBuilder will normally store markup declarations found while parsing content into $root. Currently, this is off by default. $root->store_pis(value) This determines whether TreeBuilder will normally store processing instructions found while parsing content into $root. Currently, this is off (false) by default. SEE ALSO
XML::Parser, XML::Element, HTML::TreeBuilder, HTML::DOMbo. And for alternate XML document interfaces, XML::DOM and XML::Twig. COPYRIGHT AND DISCLAIMERS
Copyright (c) 2000,2004 Sean M. Burke. All rights reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose. AUTHOR
Sean M. Burke, <sburke@cpan.org> perl v5.10.1 2011-03-05 XML::TreeBuilder(3pm)
All times are GMT -4. The time now is 07:08 PM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy