XML Fields comparison using awk script

01-29-2016

Registered User

34, 0

Join Date: Jan 2014

Last Activity: 20 June 2018, 2:56 AM EDT

Posts: 34

Thanks Given: 18

Thanked 0 Times in 0 Posts

XML Fields comparison using awk script

Hello All,

I have many zipped XMLs (example file name in tgz formate - file_rec.trx.2016-01-23.000123.exc.85sesdzd45wsds5299c8f2994f7.tgz) looks following and I need to verify two numbers, they are RecordNumber and EnrolData (only sequence number, NOT hole).
for all the records, both should be equal, but as an error, for some records, record number is NOT same as EnrolData's sequence number. I need to find out what all those records and in which files. could some one please help me? I have tried this using following awk script but no luck.

XML Format:

Code:

<XXXXXXXXXXXXX>
    <RecordNumber>12345</RecordNumber>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXXXXXXXXX><![CDATA[XXXXXXXXXXXXXX:XXXXXXXXXXXXX XXXX XXXXXX]]></XXXXXXXXXXXXX>
    <EnrolData><![CDATA[E0000003350000000012345Part1              XXXXXX
	XXXXXXXXXXXXXXXX                                            XXXXXXXXXXXXXXX:XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXX   
	XXXXXXXXXXXXXXX  
	XXXX                                                                                                                                                      
	
XXXXXXXXXXXXXXXXX                    XXXX                                XXXXXXXXXXXXX.XXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXX.XXXXXX                                                        XXXXXXXXXXXXXX                                      
        XXXX                          XXXX                          XXXXXXXXXXXXX                XXXX                          XXXXXXXXXXXXX       
		XXXX                          XXXXXXXXXXXXX                X                            
		XXXXXXXXXXXXX                                                                                             		
		XXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXX.XXX                           XXX
]]></EnrolData>
</XXXXXXXXXXXXX>

Script that I am trying:

Code:

#!/bin/sh
for file in $(ls file_rec.trx.{4}(\d)-{2}(\d)-{2}(\d).{1,}(\d).exc.*.tgz)
do
awk'
 /<RecordNumber>/ {
        getline
        while ( $0 !~ /<\/RecordNumber>/ ) {
               rNumber = $1
                getline
        }
        nextline
}

/<EnrolData><\!\[CDATA\[/ {
        getline
        while ($0 !~ "\]\]><\/EnrolData>" ) {
               eData=substr($1,19,5) #Here I actually need to get the sub string from "E0000003350000000012345Part1              XXXXXX                                        " 
#but the problem is record number may not fixed digits and the number between Part1 and E may not be fixed digits. 
#one thing for sure is sequence number present always before Part1
                getline
        }
        nextline
}
{
if (rNumber==eData){
#here I need to print the formate - <filename> : <RecordNumber> - <EnrolData sequence number>
print "$file - $(rNumber) - $(eData)"
}' $file

Last edited by VasuKukkapalli; 01-29-2016 at 02:44 PM..

VasuKukkapalli

View Public Profile for VasuKukkapalli

Find all posts by VasuKukkapalli

01-29-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

So - what be the EnrolData? If it's NOT 3350000000012345 - what is it?

RudiC

View Public Profile for RudiC

Find all posts by RudiC

01-29-2016

Registered User

34, 0

Join Date: Jan 2014

Last Activity: 20 June 2018, 2:56 AM EDT

Posts: 34

Thanks Given: 18

Thanked 0 Times in 0 Posts

Hello Rudi, Thank you for checking this for me. here is the answer:

E00000033500000000 - Some different string which you may ignore. unfortunately this string length may change.
12345 - this is the actual sequence number that we need to compare with RecordNumber. In other words, this is the sequence number which must be equal to the Record number
Part1 - This is also another string and is FIXED for each and all files.

---------- Post updated at 01:39 PM ---------- Previous update was at 01:28 PM ----------

To be more clear, in below XML, the two separated numbers (12345 in two XML tags - RecordNumber and EnrolData) must be equal, but for some reason, in some records, they are not coming as same. Also the string Part1 is same for all records and for all files.
So I need to find out in what file how they are not coming as different.

Code:

<XXXXXXXXXXXXX>
    <RecordNumber>

12345

Code:

</RecordNumber>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXXXXXXXXX><![CDATA[XXXXXXXXXXXXXX:XXXXXXXXXXXXX XXXX XXXXXX]]></XXXXXXXXXXXXX>
    <EnrolData><![CDATA[E00000033500000000

12345

Code:

Part1              XXXXXX
	XXXXXXXXXXXXXXXX                                            XXXXXXXXXXXXXXX:XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXX   
	XXXXXXXXXXXXXXX  
	XXXX                                                                                                                                                      
	
XXXXXXXXXXXXXXXXX                    XXXX                                XXXXXXXXXXXXX.XXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXX.XXXXXX                                                        XXXXXXXXXXXXXX                                      
        XXXX                          XXXX                          XXXXXXXXXXXXX                XXXX                          XXXXXXXXXXXXX       
		XXXX                          XXXXXXXXXXXXX                X                            
		XXXXXXXXXXXXX                                                                                             		
		XXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXX.XXX                           XXX
]]></EnrolData>
</XXXXXXXXXXXXX>

Hope this clarifies.

VasuKukkapalli

View Public Profile for VasuKukkapalli

Find all posts by VasuKukkapalli

01-30-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

So - would 102345 and 10002345 or even 10002 be valid numbers? If so, how to discriminate from above (3350...012345)?

---------- Post updated at 10:44 ---------- Previous update was at 10:43 ----------

We need something to tell where to stop ignoring digits and start counting them...

RudiC

View Public Profile for RudiC

Find all posts by RudiC

02-01-2016

Registered User

34, 0

Join Date: Jan 2014

Last Activity: 20 June 2018, 2:56 AM EDT

Posts: 34

Thanks Given: 18

Thanked 0 Times in 0 Posts

Hello RudiC, unfortunately that is what the problem for me, there is no valid scenario for this, Record number length is not fixed, but the number right before Part1 is the number that we need to compare with the Record Number. I am not sure how to extract this.

VasuKukkapalli

View Public Profile for VasuKukkapalli

Find all posts by VasuKukkapalli

02-01-2016

Registered User

15,129, 5,008

Join Date: Jul 2012

Last Activity: 4 May 2020, 4:31 PM EDT

Location: Aachen, Germany

Posts: 15,129

Thanks Given: 735

Thanked 5,008 Times in 4,483 Posts

Well, assuming the record number length is as good an approach as any other, try

Code:

awk '
FNR == 1                                {RN = ED = 0}
match ($0, /<RecordNumber>[^<]*/)       {RN=substr($0, RSTART+14, RLENGTH-14); print FILENAME; print RN}
match ($0, /<EnrolData>.*Part1/)        {ED=substr($0, RSTART+RLENGTH-5-length(RN), length(RN)); print ED}
' file
file
12345
12345

RudiC

View Public Profile for RudiC

Find all posts by RudiC

Shell Programming and Scripting

XML Fields comparison using awk script

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Discussion started by: newstart

2. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Discussion started by: 100bees

3. Shell Programming and Scripting

How to get fields and get output with awk or shell script.?

Discussion started by: sabercats

4. Shell Programming and Scripting

Awk - Script assistance on identifying non matching fields

Discussion started by: tekvaio

5. Shell Programming and Scripting

Comparison of fields in Files

Discussion started by: Praveenkulkarni

6. Shell Programming and Scripting

numbers comparison in fields of a file and print least value of them

Discussion started by: novice_man

7. Shell Programming and Scripting

awk script to (un)/concatenate fields in file

Discussion started by: anthony.cros

8. Shell Programming and Scripting

Simple XML file comparison and merging

Discussion started by: karlp

9. Shell Programming and Scripting

awk sed cut? to rearrange random number of fields into 3 fields

Discussion started by: axo959

10. HP-UX

XML parsing performace comparison with windows using sax

Discussion started by: saurabh.sid