Parsing XML


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Parsing XML
# 1  
Old 09-19-2011
Parsing XML

Learned People,

Hello !

Till today, for the most part, all of the tricky questions/situations that I encountered were already posted by other folks and all I had to do was peruse through these one at a time and I could find some sort of an answer and all I had to do was add some minor tweaks and it would fly.

Today, I ran into this situation where in Im parsing XML data that is stored in flat files. I found a perl based answer on this forums that sounds like I can use it but there is no perl installation on the machine & what ever I can do, I tried it and it is not working. Also, this is not a well-formed xml and most of it is one continuos line of 2million+ characters and even when the file size runs into a gig or 1.2g, it only has 20,000 lines.

Here is an example below -

Sample input file ------
HTML Code:
$head -1 ROAMERCDRFILE

MDN>9168761121</MDN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Phase Error Peak</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error Peak</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>0</Value>\n<MSN>Degree</MSN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Phase Error RMS</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error RMS</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>1.87</Value>\n<MSN>Degree</MSN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Freq Error</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Freq Error</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>4.47</Value>\n<MSN>Hz</MSN>\n<HomeSId>150</HomeSId>\n<ESN>-150</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid2 Power Transmit Time Error</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Transmit Time Error</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>0</Value>\n<MSN>us</MSN>\n<HomeSId>NA</HomeSId>\n<ESN>NA</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Power</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid1 Power</Subsubroamer>\n<ServSid>CDMA BC6 Power Port A BC6 CH 25 Mid1Power</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>-10.35</Value>\n<MSN>dBm</MSN>\n<HomeSId>-7</HomeSId>\n<ESN>-13</ESN>\n</HOMESID_Data>\n<HOMESID_Data>\n<roamer>CDMA BC6</roamer>\n<Subroamer>Mod</Subroamer>\n<Subsubroamer>Port A BC6 CH 25 Mid1 Power Rho</Subsubroamer>\n<ServSid>CDMA BC6 Mod Port A BC6 CH 25 Mid1 Power Rho</ServSid>\n<SPCS_Priority>1</SPCS_Priority>\n<Value>99.86</Value>\n<MSN>N/A</MSN>\n<HomeSId>100</HomeSId>\n<ESN>98</ESN>\n
What Im looking for is -
HTML Code:
9168761121
	NA
	Mod
	CDMA BC6
	0
	NA
            ....
            .....
            ...
This is what I stitched together from copying stuff around ....

HTML Code:
awk 'BEGIN{FS="<|>"}
{print ESN, SubRoamer, Roamer, Value
ESN=""
SubRoamer =""          
Roamer =""
Value=""
} 
{for(i=1;i<=NF;i++) {if($i=="ESN"){ESN=$(i+1);continue}}}
{for(i=1;i<=NF;i++) {if($i=="SubRoamer"){SubRoamer =$(i+1); continue}}}
{for(i=1;i<=NF;i++) {if($i=="Roamer"){Roamer =$(i+1); continue}}}
{for(i=1;i<=NF;i++) {if($i=="Value"){Value =$(i+1); continue}}}
END {print ESN, SubRoamer, Roamer, Value}' ROAMERCDRFILE
& it print the following ...
HTML Code:
98   99.86
which is wrong.

Also, I pieced together a very laborious looking thing as below -

HTML Code:
cat ROAMERCDRFILE  | tr '>' '\n' | egrep 'ESN|Subroamer|roamer|Value' | sed 's/ESN//g' | sed 's/Subroamer//g' | sed 's/roamer//g' | sed 's/Value//g' | sed 's/<//g' |sed 's/>//g' | sed 's/\///g' | sed 's/\\//g'
& it is showing records as below -


HTML Code:
n
NA
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error Peak
n
0
n
NA
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error RMS
n
1.87
n
NA
n
But then, that is taking lots and lots of time and so far, I havent seen result pouring out not even once.

Any and all help would be such a huge relief.

Please help !

regards,
Manohar
# 2  
Old 09-19-2011
Is this gonna be work for you?

Code:
echo '.....' | awk 'BEGIN {FS=">";RS="<"}{$0=$2;sub(/\\/,"");print}'


O/P:
Code:
9168761121
n
NA
n
NA
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error Peak
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error Peak
n
1
n
0
n
Degree
n
NA
n
NA
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Phase Error RMS
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error RMS
n
1
n
1.87
n
Degree
n
NA
n
NA
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Freq Error
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Freq Error
n
1
n
4.47
n
Hz
n
150
n
-150
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid2 Power Transmit Time Error
n
CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Transmit Time Error
n
1
n
0
n
us
n
NA
n
NA
n
n
n
CDMA BC6
n
Power
n
Port A BC6 CH 25 Mid1 Power
n
CDMA BC6 Power Port A BC6 CH 25 Mid1Power
n
1
n
-10.35
n
dBm
n
-7
n
-13
n
n
n
CDMA BC6
n
Mod
n
Port A BC6 CH 25 Mid1 Power Rho
n
CDMA BC6 Mod Port A BC6 CH 25 Mid1 Power Rho
n
1
n
99.86
n
N/A
n
100
n
98
n

# 3  
Old 09-19-2011
anchal_khare,

Thanks a ton for the reply.

Yes, it is doing it but I had to obfuscate some of the underlying stuff in here for obvious reasons of confidentiality and hence its not fetching the records but it is showing some of the records in here that Im looking for.

So, can you please please explain to me as to what is going on in here, as I understand the field separators and the substitution component a little bit of it but why are you doing this - $0=$2 and what are we gaining out of this ? The reason as to why I ask you this is SO that I can totally tailor my solution instead of posting in here...

awk 'BEGIN {FS=">";RS="<"}{$0=$2;sub(/\\/,"");print}'

Thanks again !

regards,
Manohar.
# 4  
Old 09-19-2011
RS means record separator (each line).
Setting it to "<" was actually means the input line is being splitted after each "<" character. So that you will get values like ..

Code:
Subroamer>Power
/Subroamer>\n
Subsubroamer>Port A BC6 CH 25 Mid1 Power
ServSid>CDMA BC6 Power Port A BC6 CH 25 Mid1Power

As you know, We are interested in the later part ( after the ">" character), so FS (field separator ">" is used and in this way fetching second field ($2).

$0=$2, is replacing everything in the line with $2 so that processing could be simpler.
You could skip that part. but you need to do something like this '..... {sub(/\\/,"",$2);print $2}'.
# 5  
Old 09-19-2011
anchal_khare,

I got it the moment I posted reply to my question. That is so nice that you are helping me out here. Thank you and thank you.
I have a bigger question now. So, now this would just strip out the head & tail as well & display the data in its elemental nature...equivalent of just showing up the data all line by line.

So, how do I determine what header tag is related to what data element in here ? I cant just start at data element x & go about as these are optional fields in the db and as such, the cell site might not even post that value field in there. Is there any other way ? Im sorry to throw back another question right at you like this but I just dont know where else to go or what else should I do in here.

with DUE THANKS,

regards,
Anil
# 6  
Old 09-19-2011
I worked on the output you have shown in your requirement (#1).
You didn't mention that you are interested in element names.

About element names, as you mentioned that its not a well formed XML string, therefore I am not sure about the correctness of any solution until you tell us some specific criteria.


With the given input , you can do something like..

Code:
echo '......' | awk 'BEGIN {FS=">";RS="<"} !/^\//{printf "%-20s%-20s\n", $1,$2}'


MDN                 9168761121          
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Phase Error Peak
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error Peak
SPCS_Priority       1                   
Value               0                   
MSN                 Degree              
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Phase Error RMS
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Phase Error RMS
SPCS_Priority       1                   
Value               1.87                
MSN                 Degree              
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Freq Error
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Freq Error
SPCS_Priority       1                   
Value               4.47                
MSN                 Hz                  
HomeSId             150                 
ESN                 -150                
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid2 Power Transmit Time Error
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid2 Power Transmit Time Error
SPCS_Priority       1                   
Value               0                   
MSN                 us                  
HomeSId             NA                  
ESN                 NA                  
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Power               
Subsubroamer        Port A BC6 CH 25 Mid1 Power
ServSid             CDMA BC6 Power Port A BC6 CH 25 Mid1Power
SPCS_Priority       1                   
Value               -10.35              
MSN                 dBm                 
HomeSId             -7                  
ESN                 -13                 
HOMESID_Data        \n                  
roamer              CDMA BC6            
Subroamer           Mod                 
Subsubroamer        Port A BC6 CH 25 Mid1 Power Rho
ServSid             CDMA BC6 Mod Port A BC6 CH 25 Mid1 Power Rho
SPCS_Priority       1                   
Value               99.86               
MSN                 N/A                 
HomeSId             100                 
ESN                 98

Hope it helps.
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

XML parsing

I have an xml file where the format looks like below <SESSIONCOMPONENT REFOBJECTNAME ="pre_session_command" REUSABLE ="NO" TYPE ="Pre-session command"> <TASK DESCRIPTION ="" NAME ="pre_session_command" REUSABLE ="NO" TYPE ="Command" VERSIONNUMBER ="1"> ... (8 Replies)
Discussion started by: r_t_1601
8 Replies

2. Shell Programming and Scripting

XML Parsing :

HI I want to parse below file in to two output :- Input :- ?xml version="1.0" encoding="UTF-8"?> <bulkCmConfigDataFile xmlns:un="utranNrm.xsd" <configData dnPrefix="Undefined"> <xn:SubNetwork id="ONRM_ROOT_MO_R"> <xn:MeContext id="C136"> ... (3 Replies)
Discussion started by: asavaliya
3 Replies

3. Shell Programming and Scripting

XML parsing

i have xml output in below format... <AlertsResponse> <Alert id="11216" name="fgdfg"> <AlertActionLog timestamp="1356521629778" user="admin" detail="Recovery Alert"/> </Alert> <Alert id="11215" name="gdfg <AlertActionLog timestamp="1356430119840" user=""... (12 Replies)
Discussion started by: vivek d r
12 Replies

4. Shell Programming and Scripting

XML: parsing of the Google contacts XML file

I am trying to parse the XML Google contact file using tools like xmllint and I even dived into the XSL Style Sheets using xsltproc but I get nowhere. I can not supply any sample file as it contains private data but you can download your own contacts using this script: #!/bin/sh # imports... (9 Replies)
Discussion started by: ripat
9 Replies

5. Shell Programming and Scripting

Parsing XML

I am trying to parse an xml file and trying to grab certain values and inserting them into database table. I have the following xml that I am parsing: <dd:service name="locator" link="false"> <dd:activation mode="manual" /> <dd:run mode="direct_persistent" proxified="false" managed="true"... (7 Replies)
Discussion started by: $criptKid617
7 Replies

6. UNIX for Advanced & Expert Users

XML Parsing

I had a big XML and from which I have to make a layout as below *TOTAL+CB | *CB+FX | CS |*IR | *TOTAL | -------------------------------------------------------------------------------------------------- |CB FX | | | | DMFXNY EMSGFX... (6 Replies)
Discussion started by: manas_ranjan
6 Replies

7. Shell Programming and Scripting

XML parsing

I have a xml file attached. I need to parse parameterId and its value My output should be like 151515 38 151522 32769 and so on.. Please help me. Its urgent (6 Replies)
Discussion started by: LavanyaP
6 Replies

8. Shell Programming and Scripting

XML Parsing

Hi, Need a script to parse the following xml file content <tag1 Name="val1"> <abc Name="key"/> <abc Name="pass">*********</abc> </tag1> <tag2 Name="Core"> <Host Name="a.b.c"> <tag1 Name="abc"> <abc Name="ac">None</abc> ... (4 Replies)
Discussion started by: Mavericc
4 Replies

9. Shell Programming and Scripting

parsing xml

I want to use wget comment to parse an xml parse that exist in an online website. How can I connect it using shell script through Unix and how can I parse it?? (1 Reply)
Discussion started by: walnut
1 Replies

10. Programming

XML parsing

Hi I want to take an XML file and transform it into a pipe-delimited format. What is the best tool to use for this? I have libxml2 which seems to be the best xml parser around. The xml file will have the following format. <Txn> <Date>120504</Date> <id>99</id> <Items> <Item>... (1 Reply)
Discussion started by: handak9
1 Replies
Login or Register to Ask a Question