Reading block by block in XML


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting Reading block by block in XML
# 1  
Old 07-25-2012
Reading block by block in XML

Hi ,

Can you pleas help me with below requirement?
There is only one big line in the file. I need to parse block by block(particular tag values, 'Val' in below case) to get different parameters.

Example:-
Portion of the Input string:-

<?xml version="1.1" encoding="UTF-8"?> <Data><Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds" ab="dsds" Dc="4" Te=" Ca="xxx" Sc="320.240" Us="" Cd="X"</Val><Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="home" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds&dsdsds=dsds&ss?" ab="dsds" Dc="4" Te=" Ca="xxx" Sc="320.240" Us="" Cd="X"</Val> ..../>

Output:-
If value of Db parameter in <Val> block/tag is not null then I need to show both Db and corresponding qry parameter value.

This should be output for above one :-

encyclopedia -> sdsds?q=dsds
home -> sdsds?q=dsds&dsdsds=dsds&ss
....
....
Thanks in advance.

KM
# 2  
Old 07-25-2012
Hi,

you should check xsltproc, it's build to solve that.
# 3  
Old 07-25-2012
You could write a simple awk programme to extract the bits you need.

Code:
awk '
    /^Val.*Db="[^"]+"/ {
        gsub( "^Val ", "" );
        gsub( "=\"", "<" );
        gsub( "\" *", ">" );
        la = split( $0, a, ">" );
        for( i = 1; i <= la; i++ )
        {
            split( a[i], b, "<" );
            h[b[1]] = b[2];
        }

        printf( "%s -> %s\n", h["Db"], h["qry"] );
        delete h;
    }' RS="[<>]"   input-file >output-file

It makes a few assumptions about your code (and that you have GNU awk) which might be wrong, but it works on the small sample you posted and thus might work across all of your input.
This User Gave Thanks to agama For This Post:
# 4  
Old 07-26-2012
Thanks a lot agama. Its working as I expected.
Could you please explain the code. I am newbie to Linux. So it would be very helpful for me if you kindly explain the code.

Thanks again.

KM
# 5  
Old 07-27-2012
First, the very last line sets the record separator variable (RS) to be either the greater-than or less-than symbol. That splits all of the input file into records based on either of those rather than a newline. An important thing to note is that awk removes those symbols from the input as it uses them to split the input into records.

Awk processes records and the programme is applied to each record. for more details about awk, and the general syntax of an awk programme it is best to have a peek at this:
Awk - A Tutorial and Introduction - by Bruce Barnett

Comments in-line below should explain things more...


Code:
awk '
    /^Val.*Db="[^"]+"/ {   # execute this block of code for all records that start with "Val" and also contain a Db field that is not empty
        gsub( "^Val ", "" );  # replace the Val and trailing space with nothing
        gsub( "=\"", "<" );   # replace all =" with a less-than symbol
        gsub( "\" *", ">" );   # replace all quotes trailed by one or more spaces with a greater-than sym
        la = split( $0, a, ">" );  # split the record into array a based on greater-than sym
        for( i = 1; i <= la; i++ ) # for each token in a (something like Db<foo) 
        {
            split( a[i], b, "<" );   # split it into two components (name and value) 
            h[b[1]] = b[2];       # save the pair in a hash keyed on the name
        }

        printf( "%s -> %s\n", h["Db"], h["qry"] );  # print the two values that are interesting
        delete h;   # reset the hash
    }' RS="[<>]"



So, for the first bits of your input (<?xml version="1.1" encoding="UTF-8"?> <Data> awk treats it as several records:

Code:
?xml version="1.1" encoding="UTF-8"?
 
Data

(Notice that the blanks between greater and less than symbols end up being blank records; not important, but interesting.) None of these records match our desired record, and they are discarded.

The first record that matches looks initially like:
Code:
Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0"    qry="sdsds?q=dsds" ab="dsds" Dc="4" Te="" Ca="xxx" Sc="320.240" Us="" Cd="X"



After substitutions it becomes:
Code:
Ti<1342750845538>Du<0>De<blackberry8520_ver1RIM>Db<encyclopedia>Pdb<>Uq<0>Dq<0>qry<sdsds?q=dsds>ab<dsds>Dc<4>Te<>Ca<xxx>Sc<320.240>Us<>Cd<X>



The split into 'a' using the greater than symbol as the separator yields these tokens in the array:
Code:
a[1]= Ti<1342750845538
a[2]= Du<0
a[3]= De<blackberry8520_ver1RIM
a[4]= Db<encyclopedia
a[5]= Pdb<
a[6]= Uq<0
a[7]= Dq<0
a[8]= qry<sdsds?q=dsds
a[9]= ab<dsds
a[10]= Dc<4
a[11]= Te<
a[12]= Ca<xxx
a[13]= Sc<320.240
a[14]= Us<
a[15]= Cd<X

While your sample data didn't contain any spaces between the double quotes (e.g. Db="foo bar") the bracketing and splitting would have preserved them.

The tokens in the array 'a' can then be split, and placed into the hash 'h'. So a[8] is split into 'qry' and 'sdsds?q=dsds' and then can be referenced by name (e.g. h["qry"]).

Hope this helps you understand a bit more.

I also noticed this odd bit in your sample data: Te=" Ca="xxx" I'm not an XML expert, but this seems illegal syntax. I treated it as Te="".
This User Gave Thanks to agama For This Post:
Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extract XML block when value is matched (Shell script)

Hi everyone, So i'm struggling with an xml (log file) where we get information about some devices, so the logfile is filled with multiple "blocks" like that. Based on the <devId> i want to extract this part of the xml file. If possible I want it to have an script for this, cause we'll use... (5 Replies)
Discussion started by: Pouky
5 Replies

2. Shell Programming and Scripting

How can I extract XML block around matching search string?

I want to extract XML block surrounding search string Ex: print XML block for string "myapp1-ear" surrounded by "<application> .. </application>" Input XML: <?xml version="1.0" encoding="UTF-8"?> <deployment-request> <requestor> <first-name>kchinnam</first-name> ... (16 Replies)
Discussion started by: kchinnam
16 Replies

3. Shell Programming and Scripting

Commenting a block of code in xml where the tags may be similar

I want to comment a block of code in xml. Note that the tags will be similar. In the below xml code, I want to block the listener block for com.pkg1.class2. How do i do it ? Thanks in Advance <listener> <listener-class>com.pkg1.class1</listener-class> </listener> ......... <listener>... (4 Replies)
Discussion started by: Lakshmikumari
4 Replies

4. Shell Programming and Scripting

Printing a block of lines from a file, if that block does not contain two patterns using sed

I want to process a file block by block using sed, and if that block does not contain two patterns, then that complete block has to be printed. See below for the example data. ................................server 1............................... running process 1 running... (8 Replies)
Discussion started by: Kesavan
8 Replies

5. Shell Programming and Scripting

Uncomment XML block using sed

Hi All, I need to umcomment an XML block (if it's not already uncommented) in a shell script. There are several commented blocks in the file that need to remain commented out. The challenging part for me is that I need to match a comment on one line and an XML tag on the following line. Also,... (0 Replies)
Discussion started by: raiderfan1
0 Replies

6. Shell Programming and Scripting

How to grab data from xml block?

I tried searching the forums, but couldn't find anything relevant to my question. I have an xml file like the following: <topLevel numberBlock="BLOCK1"> <item="content1" title="Content 1"> <RefPath="path/to/file1.txt /> </item> <item"content2" title="Content 2" >... (4 Replies)
Discussion started by: jl487
4 Replies

7. Shell Programming and Scripting

Script to put block comment after finding regex in xml file

hi, i need my bash script to find regex in xml file.. and comment 2 lines before and after the line that contains regex.. can't use # needs to be <!-- at the beginning and --> and the end of the comment. so eg.. first block <filter> <filter-name>MyRegEx</filter-name> ... (11 Replies)
Discussion started by: Poki
11 Replies

8. Shell Programming and Scripting

Extract selective block from XML file

Hi, There's an xml file produced from a front-end tool as shown below: <INPUT DATABASE ="ORACLE" DBNAME ="UNIX" NAME ="FACT_TABLE" OWNERNAME ="DIPS"> <INPUTFIELD DATATYPE ="double" DEFAULTVALUE ="" DESCRIPTION ="" NAME ="STORE_KEY" PICTURETEXT ="" PORTTYPE ="INPUT" PRECISION ="15" SCALE... (6 Replies)
Discussion started by: dips_ag
6 Replies

9. Shell Programming and Scripting

Read block of lines from xml file

Hi I am new to this forum. I have few XML files and from each xml file I want to copy some specific 50 no of lines and copy them to some other file. how to do that? pls help.. (5 Replies)
Discussion started by: pritam1980
5 Replies

10. Programming

Reading a process control block

Hello, I want to know what call to use to read the details of a process control block in solaris ?:) (2 Replies)
Discussion started by: hmurali
2 Replies
Login or Register to Ask a Question