The UNIX and Linux Forums  
Hello and Welcome from United States to the UNIX and Linux Forums! Thank You for Visiting and Joining Our Global Community.

Go Back   The UNIX and Linux Forums > Top Forums > Shell Programming and Scripting
.
google unix.com



Shell Programming and Scripting Post questions about KSH, CSH, SH, BASH, PERL, PHP, SED, AWK and OTHER shell scripts and shell scripting languages here.

More UNIX and Linux Forum Topics You Might Find Helpful
Thread Thread Starter Forum Replies Last Post
parse data between parenthesis using shell script julie_s Shell Programming and Scripting 4 05-21-2009 08:38 AM
Parse for errors shell script bubba112557 Shell Programming and Scripting 2 04-02-2009 11:25 AM
Shell Script to Parse PLSQL code? gauravsachan Shell Programming and Scripting 2 01-27-2009 09:20 PM
Shell Script Needed to Parse Results jroberson Shell Programming and Scripting 2 08-20-2008 11:20 AM
Parse a string in XML file using shell script ayhanne Shell Programming and Scripting 46 01-09-2008 12:33 PM

Reply
English Japanese Spanish French German Portuguese Italian Dutch Swedish Russian Norwegian Hungarian Hebrew Danish Bulgarian Greek Powered by Powered by Google
 
LinkBack Thread Tools Search this Thread Rate Thread Display Modes
  #1 (permalink)  
Old 06-11-2009
Jerrad Jerrad is offline
Registered User
  
 

Join Date: May 2006
Posts: 7
regex/shell script to Parse through XML Records

Hi All,

I have been working on something that doesn't seem to have a clear regex solution and I just wanted to run it by everyone to see if I could get some insight into the method of solving this problem.

I have a flat text file that contains billing records for users, however the records are stored as XML with each record starting and stopping at <record> and </record> respectively.

What I am trying to do is be able to search for a users id and have it extract the complete record for them.

Sample Data

Quote:
<record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>jondoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record><record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>janedoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record><record>
What I would like to be able to do is search for johndoe and have it spit out all records for johndoe.

So the output would be the following, however there could be multiple records in the file for this user so it would need to write out the record to a text file or standard output each time it found a record.

Quote:
<record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>jondoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record>
I started with some regex trying to grab <record> then johndoe then </record> <record>(\s|\S)+johndoe(\s|\S)+</record>

However this is selecting all records if they contain <record> etc and even if I could just extract the portion I want I am not sure how I can have it remember where it left off and keep chewing through the file without creating duplicates.

Since this is being performed on Solairs 10 I wasn't able to use some of the more advanced grep features like grep -B(x) -A(x)

Thanks in advance for any help you can provide
  #2 (permalink)  
Old 06-11-2009
edgarvm edgarvm is offline
Registered User
  
 

Join Date: May 2009
Posts: 26
Maybe you must try with xpath , you can find a perl module for xml processing in cpan.org
  #3 (permalink)  
Old 06-11-2009
ghostdog74 ghostdog74 is offline Forum Advisor  
Registered User
  
 

Join Date: Sep 2006
Posts: 2,527
does "</record><record>" always appear together like this , or on separate lines
  #4 (permalink)  
Old 06-12-2009
casman46 casman46 is offline
Registered User
  
 

Join Date: Oct 2008
Posts: 3
Using the sample data I obtained the requested output using this script

Code:
#!/usr/bin/ksh

gawk -v name=$1 '
BEGIN{
   RS = "</record>"; FS = "\n"; ORS = "</record>"
}

{
   pos = index($4,name)
   if(pos > 0)
       print $0
    else
      next
}
' file3 > awk.out
  #5 (permalink)  
Old 06-12-2009
fpmurphy's Avatar
fpmurphy fpmurphy is offline Forum Staff  
Moderator
  
 

Join Date: Dec 2003
Location: Florida
Posts: 1,923
A XSL stylesheet is the easiest way to process your records. Consider the following sample set of records:
Code:
<records>
   <record>
       <recId>1</recId>
       <created>Wed Dec 10 06:00:16 2008</created>
       <userid>joebloggs</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>2</recId>
       <created>Wed Dec 17 06:00:16 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>3</recId>
       <created>Wed Jan 19 06:00:16 2008</created>
       <userid>jjhollis</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>4</recId>
       <created>Mon Dec 22 16:30:17 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
</records>
which is a valid and well-formed XML document containing 4 records.

Using the following XSL stylesheet with xsltproc:
Code:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!-- pass in userid as -param userid "'joedoe'"  -->
<xsl:param name="userid" />

<xsl:output method="xml" indent="yes" />

<xsl:template match="records">
<records>
   <xsl:apply-templates select="record" />
</records>
</xsl:template>

<xsl:template match="record">
   <xsl:if test="userid=$userid">
       <xsl:copy-of select="." />
   </xsl:if>
</xsl:template>

</xsl:stylesheet>
you can output all the records for "jondoe" to stdout as follows:
Code:
$ xsltproc --param userid "'jondoe'" file42.xsl file42.xml
<?xml version="1.0"?>
<records>
  <record>
       <recId>1</recId>
       <created>Wed Dec 17 06:00:16 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
  <record>
       <recId>4</recId>
       <created>Mon Dec 22 16:30:17 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
</records>
$
Bits Awarded / Charged to fpmurphy for this Post
Date User Comment Amount
06-12-2009 vgersh99 N/A 1,000
  #6 (permalink)  
Old 06-12-2009
Jerrad Jerrad is offline
Registered User
  
 

Join Date: May 2006
Posts: 7
Thanks for all the replies guys, I will try some of the suggestions you made and see what I can come up with.
Reply

Bookmarks

Thread Tools Search this Thread
Search this Thread:

Advanced Search
Display Modes Rate This Thread
Rate This Thread:

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On




All times are GMT -4. The time now is 07:52 AM.


Powered by: vBulletin, Copyright ©2000 - 2006, Jelsoft Enterprises Limited. Language Translations Powered by .
vBCredits v1.4 Copyright ©2007 - 2008, PixelFX Studios
The UNIX and Linux Forums Content Copyright ©1993-2009. All Rights Reserved.Ad Management by RedTyger

Content Relevant URLs by vBSEO 3.2.0