Reading block by block in XML

07-25-2012

Registered User

23, 0

Join Date: Jul 2012

Last Activity: 18 October 2012, 7:07 PM EDT

Posts: 23

Thanks Given: 5

Thanked 0 Times in 0 Posts

Reading block by block in XML

Hi ,

Can you pleas help me with below requirement?
There is only one big line in the file. I need to parse block by block(particular tag values, 'Val' in below case) to get different parameters.

Example:-
Portion of the Input string:-
<?xml version="1.1" encoding="UTF-8"?> <Data><Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds" ab="dsds" Dc="4" Te=" Ca="xxx" Sc="320.240" Us="" Cd="X"</Val><Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="home" Pdb="" Uq="0" Dq="0" qry="sdsds?q=dsds&dsdsds=dsds&ss?" ab="dsds" Dc="4" Te=" Ca="xxx" Sc="320.240" Us="" Cd="X"</Val> ..../>

Output:-
If value of Db parameter in <Val> block/tag is not null then I need to show both Db and corresponding qry parameter value.

This should be output for above one :-

encyclopedia -> sdsds?q=dsds
home -> sdsds?q=dsds&dsdsds=dsds&ss
....
....
Thanks in advance.

KM

kmajumder

View Public Profile for kmajumder

Find all posts by kmajumder

07-25-2012

Registered User

164, 39

Join Date: Sep 2010

Last Activity: 1 April 2015, 7:46 AM EDT

Posts: 164

Thanks Given: 4

Thanked 39 Times in 38 Posts

Hi,

you should check xsltproc, it's build to solve that.

Chirel

View Public Profile for Chirel

Find all posts by Chirel

07-25-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

You could write a simple awk programme to extract the bits you need.

Code:

awk '
    /^Val.*Db="[^"]+"/ {
        gsub( "^Val ", "" );
        gsub( "=\"", "<" );
        gsub( "\" *", ">" );
        la = split( $0, a, ">" );
        for( i = 1; i <= la; i++ )
        {
            split( a[i], b, "<" );
            h[b[1]] = b[2];
        }

        printf( "%s -> %s\n", h["Db"], h["qry"] );
        delete h;
    }' RS="[<>]"   input-file >output-file

It makes a few assumptions about your code (and that you have GNU awk) which might be wrong, but it works on the small sample you posted and thus might work across all of your input.

This User Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

07-26-2012

Registered User

23, 0

Join Date: Jul 2012

Last Activity: 18 October 2012, 7:07 PM EDT

Posts: 23

Thanks Given: 5

Thanked 0 Times in 0 Posts

Thanks a lot agama. Its working as I expected.
Could you please explain the code. I am newbie to Linux. So it would be very helpful for me if you kindly explain the code.

Thanks again.

KM

kmajumder

View Public Profile for kmajumder

Find all posts by kmajumder

07-27-2012

Registered User

1,466, 512

Join Date: Jul 2010

Last Activity: 7 April 2014, 3:02 PM EDT

Location: earth>US>UTC-5

Posts: 1,466

Thanks Given: 110

Thanked 512 Times in 491 Posts

First, the very last line sets the record separator variable (RS) to be either the greater-than or less-than symbol. That splits all of the input file into records based on either of those rather than a newline. An important thing to note is that awk removes those symbols from the input as it uses them to split the input into records.

Awk processes records and the programme is applied to each record. for more details about awk, and the general syntax of an awk programme it is best to have a peek at this:
Awk - A Tutorial and Introduction - by Bruce Barnett

Comments in-line below should explain things more...

Code:

awk '
    /^Val.*Db="[^"]+"/ {   # execute this block of code for all records that start with "Val" and also contain a Db field that is not empty
        gsub( "^Val ", "" );  # replace the Val and trailing space with nothing
        gsub( "=\"", "<" );   # replace all =" with a less-than symbol
        gsub( "\" *", ">" );   # replace all quotes trailed by one or more spaces with a greater-than sym
        la = split( $0, a, ">" );  # split the record into array a based on greater-than sym
        for( i = 1; i <= la; i++ ) # for each token in a (something like Db<foo) 
        {
            split( a[i], b, "<" );   # split it into two components (name and value) 
            h[b[1]] = b[2];       # save the pair in a hash keyed on the name
        }

        printf( "%s -> %s\n", h["Db"], h["qry"] );  # print the two values that are interesting
        delete h;   # reset the hash
    }' RS="[<>]"

So, for the first bits of your input (<?xml version="1.1" encoding="UTF-8"?> <Data> awk treats it as several records:

Code:

?xml version="1.1" encoding="UTF-8"?
 
Data

(Notice that the blanks between greater and less than symbols end up being blank records; not important, but interesting.) None of these records match our desired record, and they are discarded.

The first record that matches looks initially like:

Code:

Val Ti="1342750845538" Du="0" De="blackberry8520_ver1RIM" Db="encyclopedia" Pdb="" Uq="0" Dq="0"    qry="sdsds?q=dsds" ab="dsds" Dc="4" Te="" Ca="xxx" Sc="320.240" Us="" Cd="X"

After substitutions it becomes:

Code:

Ti<1342750845538>Du<0>De<blackberry8520_ver1RIM>Db<encyclopedia>Pdb<>Uq<0>Dq<0>qry<sdsds?q=dsds>ab<dsds>Dc<4>Te<>Ca<xxx>Sc<320.240>Us<>Cd<X>

The split into 'a' using the greater than symbol as the separator yields these tokens in the array:

Code:

a[1]= Ti<1342750845538
a[2]= Du<0
a[3]= De<blackberry8520_ver1RIM
a[4]= Db<encyclopedia
a[5]= Pdb<
a[6]= Uq<0
a[7]= Dq<0
a[8]= qry<sdsds?q=dsds
a[9]= ab<dsds
a[10]= Dc<4
a[11]= Te<
a[12]= Ca<xxx
a[13]= Sc<320.240
a[14]= Us<
a[15]= Cd<X

While your sample data didn't contain any spaces between the double quotes (e.g. Db="foo bar") the bracketing and splitting would have preserved them.

The tokens in the array 'a' can then be split, and placed into the hash 'h'. So a[8] is split into 'qry' and 'sdsds?q=dsds' and then can be referenced by name (e.g. h["qry"]).

Hope this helps you understand a bit more.

I also noticed this odd bit in your sample data: Te=" Ca="xxx" I'm not an XML expert, but this seems illegal syntax. I treated it as Te="".

This User Gave Thanks to agama For This Post:

agama

View Public Profile for agama

Find all posts by agama

Shell Programming and Scripting

Reading block by block in XML

10 More Discussions You Might Find Interesting

1. UNIX for Beginners Questions & Answers

Extract XML block when value is matched (Shell script)

Discussion started by: Pouky

2. Shell Programming and Scripting

How can I extract XML block around matching search string?

Discussion started by: kchinnam

3. Shell Programming and Scripting

Commenting a block of code in xml where the tags may be similar

Discussion started by: Lakshmikumari

4. Shell Programming and Scripting

Printing a block of lines from a file, if that block does not contain two patterns using sed

Discussion started by: Kesavan

5. Shell Programming and Scripting

Uncomment XML block using sed

Discussion started by: raiderfan1

6. Shell Programming and Scripting

How to grab data from xml block?

Discussion started by: jl487

7. Shell Programming and Scripting

Script to put block comment after finding regex in xml file

Discussion started by: Poki

8. Shell Programming and Scripting

Extract selective block from XML file

Discussion started by: dips_ag

9. Shell Programming and Scripting

Read block of lines from xml file

Discussion started by: pritam1980

10. Programming

Reading a process control block

Discussion started by: hmurali