The UNIX and Linux Forums  


Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
awk: Extracting part of the buffer venkat_k Shell Programming and Scripting 7 09-23-2008 08:36 AM
need help extracting this part finalight Shell Programming and Scripting 6 05-20-2008 07:03 AM
Extracting part of a string sam_78_nyc Shell Programming and Scripting 8 04-25-2007 08:37 PM
Extracting part of the basename madhunk Shell Programming and Scripting 3 02-13-2007 11:54 AM
extracting uncommon part between two files sabyasm Shell Programming and Scripting 2 11-06-2005 01:25 PM

Closed Thread
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 11-10-2008
shridhard shridhard is offline
Registered User
  
 

Join Date: Nov 2008
Posts: 5
Question Extracting a part of XML File

Hi Guys,

I have a very large XML feed (2.7 MB) which crashes the server at the time of parsing. Now to reduce the load on the server I have a cron job running every 5 min.'s. This job will get the file from the feed host and keep it in the local machine.

This does not solve the problem as the file still gets loaded in the server. The file looks something like this:

<?xml version="1.0" standalone="no"?>
<IRXML CorpMasterID="">
<NewsReleases PubDate="20081104" PubTime="16:48:03">
<NewsCategory Category="">
<NewsRelease ReleaseID="" DLU="20081104 16:47:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="16:33:00">11/4/2008 4:33:00 PM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>
<NewsRelease ReleaseID="" DLU="20081104 09:19:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="09:01:00">11/4/2008 9:01:00 AM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>

I want to write a shell script which will extract only the part starting from
<NewsRelease> till </NewsRelease>
Something like:

<NewsRelease ReleaseID="" DLU="20081104 09:19:00" ArchiveStatus="Current"
RNSSource="">
<Title></Title>
<ExternalURL/>
<Date Date="20081104" Time="09:01:00">11/4/2008 9:01:00 AM</Date>
<ContentNetworkingLinks/>
<Categories>
<Category></Category>
</Categories>
</NewsRelease>

Also there is one more problem, in unix when the file is downloaded there are no return carriage, so the complete file appears to be in one line .

Any help would be appreciated. Thanks,
Shridhar
  #2 (permalink)  
Old 11-10-2008
wempy's Avatar
wempy wempy is offline
Registered User
  
 

Join Date: Jun 2006
Location: Harpenden, UK
Posts: 208

Code:
sed -n '/<NewsRelease R/,/<\/NewsRelease>/p' xmldump >outputfile

  #3 (permalink)  
Old 11-10-2008
wempy's Avatar
wempy wempy is offline
Registered User
  
 

Join Date: Jun 2006
Location: Harpenden, UK
Posts: 208
regarding the end of line problem, what format is the file currently in i.e. does it have LF, CR/LF or CR as it's end of line marker?
depending on format depends on which tool to use.
to go from dos to unix use dos2unix or run the file up in vim and :set fileformat=unix
  #4 (permalink)  
Old 11-11-2008
shridhard shridhard is offline
Registered User
  
 

Join Date: Nov 2008
Posts: 5
copying the complete file

Thanks for the reply.

There seems to be some problem with the command. The command seems to execute, but when I see the outputfile, it is the complete copy of the xmlfeed.
I don't think there is a problem with the file format, because I do not see ^M in the file.
I think the problem could be with the multiple occurrences of "NewsRelease" in the file.

Also my requirement is that, I need the first 5 occurrences of <NewsRelease> ... </NewsRelease> from the XMLFeed to another file, as I need to Parse the first 5 news releases to HTML using XSL.

Please let me know if this is possible.

Thanks again.
Shridhar
  #5 (permalink)  
Old 11-12-2008
summer_cherry summer_cherry is offline Forum Advisor  
Registered User
  
 

Join Date: Jun 2007
Location: Beijing China
Posts: 1,089
Hope this can help you some.

it will only print out the first five part surrounded by <NewsRelease and /NewsRelease>.




Code:
awk '/<NewsRelease/,/\/NewsRelease/{
if(n<5)
	print
if(index($0,"/NewsRelease")!=0)
	n++
}' filename

  #6 (permalink)  
Old 11-12-2008
shridhard shridhard is offline
Registered User
  
 

Join Date: Nov 2008
Posts: 5
Thanks got it almost working

Thanks for the reply, it worked ... I have to add few more things to make it work completely.

Warm Regards,
Shridhar
  #7 (permalink)  
Old 11-12-2008
fpmurphy's Avatar
fpmurphy fpmurphy is offline Forum Staff  
Moderator
  
 

Join Date: Dec 2003
Location: Florida
Posts: 1,935
Quote:
Also my requirement is that, I need the first 5 occurrences of <NewsRelease> ... </NewsRelease> from the XMLFeed to another file, as I need to Parse the first 5 news releases to HTML using XSL.
Why not extract the first 5 releases using XSLT i.e.

Code:
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:output method="xml"/>

  <xsl:template match="/">
    <xsl:apply-templates>
      <xsl:with-param name="mycount" select="5"/>
    </xsl:apply-templates>
  </xsl:template>

  <xsl:template match="NewsReleases">
    <xsl:param name="mycount"/>
      <xsl:element name="NewsReleases">
      <xsl:attribute name="PubDate">
         <xsl:value-of select="@PubDate"/>
      </xsl:attribute>
      <xsl:attribute name="PubTime">
         <xsl:value-of select="@PubTime"/>
      </xsl:attribute>
      <xsl:text>&#xA;</xsl:text>
      <xsl:for-each select="//NewsRelease[position() &lt;=$mycount]">
        <xsl:copy-of select="."/>
      </xsl:for-each>
      <xsl:text>&#xA;</xsl:text>
      </xsl:element>
  </xsl:template>

</xsl:stylesheet>

This assumes that your irXML document is well formed (XML) - which not the case for the sample document you supplied.
Closed Thread

Bookmarks

Tags
awk, awk trim, trim, trim awk

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 08:49 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0