Need Help in extracting data from XML File


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Need Help in extracting data from XML File
# 1  
Old 12-23-2014
Need Help in extracting data from XML File

Hi All
My input file is an XML and it has some tags and data rows at end.

Starting of data rows is <rs:data> and ending of data rows is </rs:data>.

Within sample data rows (2 rows) shown below, I want to extract data value after equal to sign (until space or "/" sign).

So if XML data rows look like this
Code:
 
<z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' /> 
<z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' />

So output should be 2 rows with data as shown below:
Code:
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'



Complete XML Input file looks like this:

-----------------------------------------------------------------------------------------
Code:
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
xmlns:dt='uuid:ABC'
xmlns:rs='urn:schemas-microsoft-com:rowset'
xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
<s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
<s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
<s:datatype dt:type='string' dt:maxLength='100' />
</s:AttributeType>
<s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
<s:datatype dt:type='string' dt:maxLength='100' />
</s:AttributeType>
<s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
<s:datatype dt:type='string' dt:maxLength='100' />
</s:AttributeType>
</s:ElementType>
</s:Schema>
<rs:data>
<z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' /> 
<z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' /> 
</rs:data>
</xml>

-----------------------------------------------------------------------------------------

Appreciate your help in advance.


Moderator's Comments:
Mod Comment Please use code tags next time for your code and data. Thanks
# 2  
Old 12-23-2014
Code:
$ sed -n "/^<z:/s/[^']*\('[^']*'\)[^']*\('[^']*'\)[^']*\('[^']*'\).*/\1,\2,\3/p" file
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'

This User Gave Thanks to anbu23 For This Post:
# 3  
Old 12-23-2014
Hello vx04,

Following may help you in same.
Code:
awk -vs1="'" -F"' " '{if($0 ~ /<rs:data>/){B=1;getline}} {if($0 ~ /<\/rs:data>/){B=0}}  B{for(i=1;i<=NF;i++){gsub(/.*\=||[/><]/,X,$i);A=A?A s1","$i:$i};gsub(/\, $/,X,A);gsub(/\,$/,X,A);print A;A=""}'  Input_file

Output will be as follows.
Code:
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'

EDIT: Adding a non oneliner form for same.
Code:
awk -vs1="'" -F"' " '
                        {if($0 ~ /<rs:data>/)
                                                {B=1;getline}
                        }
                        {if($0 ~ /<\/rs:data>/)
                                                {B=0}
                        }
                        B{for(i=1;i<=NF;i++)     {
                                                 gsub(/.*\=||[/><]/,X,$i);
                                                 A=A?A s1","$i:$i};
                                                 gsub(/\, $/,X,A);
                                                 gsub(/\,$/,X,A);
                                                 print A;A=""
                        }
                    '  Input_file


Thanks,
R. Singh

Last edited by RavinderSingh13; 12-23-2014 at 09:00 AM.. Reason: Added a non oneliner form for solution
This User Gave Thanks to RavinderSingh13 For This Post:
# 4  
Old 12-23-2014
Gawk

Code:
 gawk '/^<z/{s="";  while (match($0, /\047([^\047]*)\047(.*)/, x)) { s = sprintf("%s%s\047%s\047",s,(s?OFS:""),x[1] ); $0 = x[2] } print s }' OFS=, infile

---------- Post updated at 09:38 PM ---------- Previous update was at 09:23 PM ----------

Code:
awk -F"'"  '/^<z/{s=""; for (i=2; i<=NF; i+=2) s = sprintf("%s%s\047%s\047",s,(s?OFS:""),$i ); print s }' OFS=, infile

This User Gave Thanks to Akshay Hegde For This Post:
# 5  
Old 12-23-2014
This takes care of
- no data outside the <rs:data> ... </rs:data> region
- several tagged data fields can share the same line
Code:
sed -n "/<rs:data>/,/<\/rs:data>/{s/^.*<rs:data>//;s/<\/rs:data>.*$//;s/^<z[^=']*=//;s/ [^=']*=/,/g;s/[^']*$//;p}" file4
'John Doe','XYZ Hospital','Heart Health'
'Jane Doe','XYZ Hospital','Maternity'

This User Gave Thanks to RudiC For This Post:
# 6  
Old 12-23-2014
Hi Guys
Thanks for taking time to reply. I wanted to mention that we are using AIX 6.1 and none of the answers seemed to work.

Here is the error I get on each


Input File
==========


Code:
==> cat Hospital.xml
<?xml version="1.0" encoding="utf-8"?>
<xml xmlns:s='uuid:XYZ'
     xmlns:dt='uuid:ABC'
     xmlns:rs='urn:schemas-microsoft-com:rowset'
     xmlns:z='#RowsetSchema'>
<s:Schema id='RowsetSchema'>
   <s:ElementType name='row' content='eltOnly' rs:CommandTimeout='30'>
      <s:AttributeType name='First_Name' rs:name='First_Name' rs:number='1'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Hospital_Name' rs:name='Hospital_Name' rs:number='2'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
      <s:AttributeType name='Dept_Name' rs:name='Dept_Name' rs:number='3'>
         <s:datatype dt:type='string' dt:maxLength='100' />
      </s:AttributeType>
   </s:ElementType>
</s:Schema>
<rs:data>
   <z:First_Name='John Doe' Hospital_Name='XYZ Hospital' Dept_Name='Heart Health' />
   <z:First_Name='Jane Doe' Hospital_Name='XYZ Hospital' Dept_Name='Maternity' />
</rs:data>
</xml>

I am using AIX 6.1 version.


=======

1st Solution by anbu23 does not bring any results.

Code:
==> cat scr1.ksh
sed -n "/^<z:/s/[^']*\('[^']*'\)[^']*\('[^']*'\)[^']*\('[^']*'\).*/\1,\2,\3/p" Hospital.xml

Code:
==> ./scr1.ksh > scr1_out.txt

-rw-rw-rw-    1 user1 group1            0 Dec 23 10:16 scr1_out.txt

=========================================================================

2nd Solution by RavinderSingh13 gives regular expression editor error

Code:
==> cat scr2.ksh
awk -vs1="'" -F"' " '{if($0 ~ /<rs:data>/){B=1;getline}} {if($0 ~ /<\/rs:data>/){B=0}}  B{for(i=1;i<=NF;i++){gsub(/.*\=||[/><]/,X,$i);A=A?A s1","$i:$i};gsub(/\, $/,X,A);gsub(/\,$/,X,A);print A;A=""}' Hospital.xml

==> ./scr2.ksh
Code:
awk: 0602-521 There is a regular expression error.
        [] imbalance
 The source line number is 1.
 The error context is
                {if($0 ~ /<rs:data>/){B=1;getline}} {if($0 ~ /<\/rs:data>/){B=0}}  >>>  B{for(i=1;i<=NF;i++){gsub(/.*\=||[/> <<< <]/,X,$i);A=A?A s1","$i:$i};gsub(/\, $/,X,A);gsub(/\,$/,X,A);print A;A=""}

=========================================================================

Alternate solution by RavinderSingh13 also gives regular expression error

Code:
==> cat scr3.ksh
awk -vs1="'" -F"' " '
                        {if($0 ~ /<rs:data>/)
                                                {B=1;getline}
                        }
                        {if($0 ~ /<\/rs:data>/)
                                                {B=0}
                        }
                        B{for(i=1;i<=NF;i++)     {
                                                 gsub(/.*\=||[/><]/,X,$i);
                                                 A=A?A s1","$i:$i};
                                                 gsub(/\, $/,X,A);
                                                 gsub(/\,$/,X,A);
                                                 print A;A=""
                        }
                    '  Hospital.xml



Code:
==> ./scr3.ksh
awk: 0602-521 There is a regular expression error.
        [] imbalance
 The source line number is 9.
 The error context is
                                                                 >>>  gsub(/.*\=||[/> <<< <]/,X,$i);



=========================================================================

3rd Solution by Akhsay Hegde does not give any error but no output


Code:
==> cat scr4.ksh
awk -F"'"  '/^<z/{s=""; for (i=2; i<=NF; i+=2) s = sprintf("%s%s\047%s\047",s,(s?OFS:""),$i ); print s }' OFS=, Hospital.xml

Code:
==> ./scr4.ksh

No output

=========================================================================

4th Solution by RudiC gives error that function can not be parsed.


Code:
==> cat scr5.ksh
sed -n "/<rs:data>/,/<\/rs:data>/{s/^.*<rs:data>//;s/<\/rs:data>.*$//;s/^<z[^=']*=//;s/ [^=']*=/,/g;s/[^']*$//;p}" Hospital.xml

Code:
==> ./scr5.ksh
sed: Function /<rs:data>/,/<\/rs:data>/{s/^.*<rs:data>//;s/<\/rs:data>.*$//;s/^<z[^=']*=//;s/ [^=']*=/,/g;s/[^']*$//;p} 
     cannot be parsed.

=========================================================================

Last edited by Scrutinizer; 12-23-2014 at 12:39 PM.. Reason: CODE tags
# 7  
Old 12-23-2014
Try:

Code:
sed -n "/rs:data/,/\\rs:data/{//d; s/[^'=]*=//g; s/''/','/g; s/[^']*$//;p;}" Hospital.xml

or:
Code:
awk '$0~s{p=1-p; next} p{print FS $2, $4, $6 FS}' FS=\' OFS="','" s=rs:data Hospital.xml


Last edited by Scrutinizer; 12-23-2014 at 12:43 PM..
This User Gave Thanks to Scrutinizer For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Dummies Questions & Answers

Extracting data between continuous non empty xml tags

Hi, I need help in extracting only the phone numbers between the continuous non empty xml tags in unix. I searched through a lot of forum but i did not get exact result for my query. Please help Given below is the sample pipe delimited file. I have a lot of tags before and after... (6 Replies)
Discussion started by: zen01234
6 Replies

2. Shell Programming and Scripting

Extracting the tag name from an xml file

Hi, My requirement is something like this, I have a xml file that contains some tags and nested tags, <n:tag_name1> <n:sub_tag1>val1</n:sub_tag1> <n:sub_tag2>val2</n:sub_tag2> </n:tag_name1> <n:tag_name2> <n:sub_tag1>value</n:sub_tag1> ... (6 Replies)
Discussion started by: Little
6 Replies

3. Shell Programming and Scripting

Reading XML file and extracting value

Dear All, I am reading one XML file to extract value from the particular tag:- Sample xml is below:- <KeyValuePairs> <Key>TestString</Key> <Value>Test12_Pollings</Value> </KeyValuePairs> I want to read the value for the KEY tag and there will be multiple key tags :- awk... (4 Replies)
Discussion started by: sharsour
4 Replies

4. Shell Programming and Scripting

Extracting content from xml file

Hello All, Hope you are doing well!!!!! I have a small code in the below format in xml file: <UML:ModelElement.taggedValue> <UML:TaggedValue tag="documentation" value="This sequence&#xA;&#xA;HLD_EA_0001X&#xA;HLD_DOORS_002X"/> <UML:TaggedValue tag="documentation" value="This... (11 Replies)
Discussion started by: suvendu4urs
11 Replies

5. Shell Programming and Scripting

Need help in extracting data from xml file

Hello, This is my first post in here, so excuse me if I sound too noob here! I need to extract the path "/apps/mp/installedApps/V61/HRO/hrms_01698_A_qa.ear" from the below xml extract. The path will always appear with the key "binariesURL" <deployedObject... (6 Replies)
Discussion started by: abhishek2386
6 Replies

6. UNIX for Dummies Questions & Answers

Extracting data from an xml file

Hello, Please can someone assist. I have the following xml file: <?xml version="1.0" encoding="utf-8" ?> - <PUTTRIGGER xmlns:xsd="http://www.test.org/2001/XMLSchema" xmlns:xsi="http://www.test.org/2001/XMLSchema-instance" APPLICATIONNUMBER="0501160" ACCOUNTNAME="Mrs S Test"... (15 Replies)
Discussion started by: Dolph
15 Replies

7. UNIX for Dummies Questions & Answers

Extracting values from an XML file

Hello People, I have an xml file from which I need to extract the values of the parameters using UNIX shell commands. Ex : Input is like : <Name>Roger</Name> or <Address>MI</Address> I need the output as just : Roger or MI with the tags removed. Please help. (1 Reply)
Discussion started by: sushant172
1 Replies

8. Shell Programming and Scripting

Extracting a part of XML File

Hi Guys, I have a very large XML feed (2.7 MB) which crashes the server at the time of parsing. Now to reduce the load on the server I have a cron job running every 5 min.'s. This job will get the file from the feed host and keep it in the local machine. This does not solve the problem as... (9 Replies)
Discussion started by: shridhard
9 Replies

9. Shell Programming and Scripting

Extracting Data from xml file

Hi ppl out there... Can anyone help me with the shell script to extract data from an xml file. My xml file looks like : - <servlet> <servlet-name>FrontServlet</servlet-name> <display-name>FrontServlet</display-name> ... (3 Replies)
Discussion started by: nishana
3 Replies

10. Shell Programming and Scripting

extracting XML file using sed

Hello folks I want to extract data between certain tag in XML file using 'sed' <xml> ......... .......... <one>XXXXXXXXXXXXXXXXXXXX</one> ...... Anyone ?Thank you (7 Replies)
Discussion started by: pujansrt
7 Replies
Login or Register to Ask a Question