regex/shell script to Parse through XML Records


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting regex/shell script to Parse through XML Records
# 1  
Old 06-11-2009
regex/shell script to Parse through XML Records

Hi All,

I have been working on something that doesn't seem to have a clear regex solution and I just wanted to run it by everyone to see if I could get some insight into the method of solving this problem.

I have a flat text file that contains billing records for users, however the records are stored as XML with each record starting and stopping at <record> and </record> respectively.

What I am trying to do is be able to search for a users id and have it extract the complete record for them.

Sample Data

Quote:
<record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>jondoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record><record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>janedoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record><record>
What I would like to be able to do is search for johndoe and have it spit out all records for johndoe.

So the output would be the following, however there could be multiple records in the file for this user so it would need to write out the record to a text file or standard output each time it found a record.

Quote:
<record>
<recId>xxxxxxxxxxxxxxx</recId>
<created>Wed Dec 17 06:00:16 2008</created>
<userid>jondoe</userid>
<domain>xxxxxxxxxxxxxxxxxxxx</domain>
<type>260</type>
<nasIP>xxxxxxxxxxxxxxxx</nasIP>
<portType>18</portType>
<radIP>xxxxxxxxxxxxxxx</radIP>
<userIP>0.0.0.0</userIP>
<delta>7598</delta>
<gmtOffset>0</gmtOffset>
<bytesIn>3159</bytesIn>
<bytesOut>563</bytesOut>
<packetsIn>52</packetsIn>
<packetsOut>19</packetsOut>
<proxyAuthIPAddr>0</proxyAuthIPAddr>
<proxyAcctIPAddr>xxxxxxxxxxxxxxx</proxyAcctIPAddr>
<proxyAcctAck>1</proxyAcctAck>
<termCause>17</termCause>
<clientIPAddr>xxxxxxxxxxxxxxx</clientIPAddr>
<entityID>955</entityID>
<entityCtxt>1</entityCtxt>
<backupMethod>L</backupMethod>
<sessionCountInfo></sessionCountInfo>
<clientID>xxxxxxxxxxxxxxx</clientID>
<sessionID>xxxxxxxxxxxxxxxxxxxxxx</sessionID>
<nasID>xxxxx</nasID>
<nasVendor>xxxxxx</nasVendor>
<nasModel>xxxxxxxxxxxx</nasModel>
<nasPort>xxxxxxxx</nasPort>
<billingID></billingID>
<startDate>2008/12/17 03:57:06</startDate>
<callingNumber>xxxxxxxxxxxxxxx</callingNumber>
<calledNumber></calledNumber>
<radiusAttr>xxxxxxxxxxxxxxxx</radiusAttr>
<startAttr></startAttr>
<auditID>xxxxxxxxxxxxxxxxxxxxxxxx</auditID>
<seqNum>0</seqNum>
<accountName></accountName>
</record>
I started with some regex trying to grab <record> then johndoe then </record> <record>(\s|\S)+johndoe(\s|\S)+</record>

However this is selecting all records if they contain <record> etc and even if I could just extract the portion I want I am not sure how I can have it remember where it left off and keep chewing through the file without creating duplicates.

Since this is being performed on Solairs 10 I wasn't able to use some of the more advanced grep features like grep -B(x) -A(x)

Thanks in advance for any help you can provide
# 2  
Old 06-11-2009
Maybe you must try with xpath , you can find a perl module for xml processing in cpan.org
# 3  
Old 06-11-2009
does "</record><record>" always appear together like this , or on separate lines
# 4  
Old 06-12-2009
Using the sample data I obtained the requested output using this script

Code:
#!/usr/bin/ksh

gawk -v name=$1 '
BEGIN{
   RS = "</record>"; FS = "\n"; ORS = "</record>"
}

{
   pos = index($4,name)
   if(pos > 0)
       print $0
    else
      next
}
' file3 > awk.out

# 5  
Old 06-12-2009
A XSL stylesheet is the easiest way to process your records. Consider the following sample set of records:
Code:
<records>
   <record>
       <recId>1</recId>
       <created>Wed Dec 10 06:00:16 2008</created>
       <userid>joebloggs</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>2</recId>
       <created>Wed Dec 17 06:00:16 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>3</recId>
       <created>Wed Jan 19 06:00:16 2008</created>
       <userid>jjhollis</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
   <record>
       <recId>4</recId>
       <created>Mon Dec 22 16:30:17 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
</records>

which is a valid and well-formed XML document containing 4 records.

Using the following XSL stylesheet with xsltproc:
Code:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<!-- pass in userid as -param userid "'joedoe'"  -->
<xsl:param name="userid" />

<xsl:output method="xml" indent="yes" />

<xsl:template match="records">
<records>
   <xsl:apply-templates select="record" />
</records>
</xsl:template>

<xsl:template match="record">
   <xsl:if test="userid=$userid">
       <xsl:copy-of select="." />
   </xsl:if>
</xsl:template>

</xsl:stylesheet>

you can output all the records for "jondoe" to stdout as follows:
Code:
$ xsltproc --param userid "'jondoe'" file42.xsl file42.xml
<?xml version="1.0"?>
<records>
  <record>
       <recId>1</recId>
       <created>Wed Dec 17 06:00:16 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
  <record>
       <recId>4</recId>
       <created>Mon Dec 22 16:30:17 2008</created>
       <userid>jondoe</userid>
       <domain>xxxxxxxxxxxxxxxxxxxx</domain>
   </record>
</records>
$

# 6  
Old 06-12-2009
Thanks for all the replies guys, I will try some of the suggestions you made and see what I can come up with.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

Parse xml in shell script and extract records with specific condition

Hi I have xml file with multiple records and would like to extract records from xml with specific condition if specific tag is present extract entire row otherwise skip . <logentry revision="21510"> <author>mantest</author> <date>2015-02-27</date> <QC_ID>334566</QC_ID>... (12 Replies)
Discussion started by: madankumar.t@hp
12 Replies

2. Shell Programming and Scripting

Using shell command need to parse multiple nested tag value of a XML file

I have this XML file - <gp> <mms>1110012</mms> <tg>988</tg> <mm>LongTime</mm> <lv> <lkid>StartEle=ONE, Desti = Motion</lkid> <kk>12</kk> </lv> <lv> <lkid>StartEle=ONE, Source = Velocity</lkid> <kk>2</kk> </lv> <lv> ... (3 Replies)
Discussion started by: NeedASolution
3 Replies

3. Shell Programming and Scripting

BASH script to parse XML and generate CSV

Hi All, Hope all you are doing good! Need your help. I have an XML file which needs to be converted CSV file. I am not an expert of awk/sed so your help is highly appreciated!! XML file looks like this: <l:event dateTime="2013-03-13 07:15:54.713" layerName="OSB" processName="ABC"... (2 Replies)
Discussion started by: bhaskar_m
2 Replies

4. Shell Programming and Scripting

How to Parse the XML data along with the URL in Shell Script?

Hi, Can anybody help to solve this. I want to parse some xmldata along with the URL in the Shell. I'm calling the URL via the curl command Given below is my shell script file export... (7 Replies)
Discussion started by: Megala
7 Replies

5. Shell Programming and Scripting

awk Script to parse a XML tag

I have an XML tag like this: <property name="agent" value="/var/tmp/root/eclipse" /> Is there way using awk that i can get the value from the above tag. So the output should be: /var/tmp/root/eclipse Help will be appreciated. Regards, Adi (6 Replies)
Discussion started by: asirohi
6 Replies

6. Shell Programming and Scripting

Shell script (not Perl) to parse xml with awk

Hi, I have to make an script according to these: - I have couples of files like: xxxxxxxxxxxxx.csv xxxxxxxxxxxxx_desc.xml - every xml file has diferent fields, but keeps this format: ........ <defaultName>2011-02-25T16:43:43.582Z</defaultName> ........... (2 Replies)
Discussion started by: Pluff
2 Replies

7. Shell Programming and Scripting

Parse XML file in shell script

Hi Everybody, I have an XML file containing some data and i want to extract it, but the specific issue in my file is that the data is repeated some times like the following example : <section1> <subsection1> X=... Y=... Z=... <\subsection1> <subsection2> X=... Y=... Z=...... (2 Replies)
Discussion started by: yassine
2 Replies

8. Shell Programming and Scripting

Need to Parse XML from bash script

I am completely new to bash scripting and now need to write a bash script that would parse a XML file and take out values from specific tags. I tried using xsltproc, xml_grep commands. But the issue is that the XML i am trying to parse is not UTF 8. so those commands are unable to parse my XML's... (4 Replies)
Discussion started by: shivashankar.g
4 Replies

9. Shell Programming and Scripting

Parse XML file into CSV with shell?

Hi, It's been a few years since college when I did stuff like this all the time. Can someone help me figure out how to best tackle this problem? I need to parse a file full of entries that look like this: <eq action="A" sectyType="0" symbol="PGR" exch="CA" curr="VEF" sess="NORM"... (7 Replies)
Discussion started by: Pcushing
7 Replies

10. Shell Programming and Scripting

Parse a string in XML file using shell script

Hi! I'm just new here and don't know much about shell scripting. I just want to ask for help in creating a shell script that will parse a string or value of the status in the xml file. Please sample xml file below. Can you please help me create a simple script to get the value of status? Also it... (46 Replies)
Discussion started by: ayhanne
46 Replies
Login or Register to Ask a Question