XML Fields comparison using awk script


 
Thread Tools Search this Thread
Top Forums Shell Programming and Scripting XML Fields comparison using awk script
# 1  
Old 01-29-2016
Linux XML Fields comparison using awk script

Hello All,

I have many zipped XMLs (example file name in tgz formate - file_rec.trx.2016-01-23.000123.exc.85sesdzd45wsds5299c8f2994f7.tgz) looks following and I need to verify two numbers, they are RecordNumber and EnrolData (only sequence number, NOT hole).
for all the records, both should be equal, but as an error, for some records, record number is NOT same as EnrolData's sequence number. I need to find out what all those records and in which files. could some one please help me? I have tried this using following awk script but no luck.

XML Format:
Code:
<XXXXXXXXXXXXX>
    <RecordNumber>12345</RecordNumber>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXXXXXXXXX><![CDATA[XXXXXXXXXXXXXX:XXXXXXXXXXXXX XXXX XXXXXX]]></XXXXXXXXXXXXX>
    <EnrolData><![CDATA[E0000003350000000012345Part1              XXXXXX
	XXXXXXXXXXXXXXXX                                            XXXXXXXXXXXXXXX:XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXX   
	XXXXXXXXXXXXXXX  
	XXXX                                                                                                                                                      
	
XXXXXXXXXXXXXXXXX                    XXXX                                XXXXXXXXXXXXX.XXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXX.XXXXXX                                                        XXXXXXXXXXXXXX                                      
        XXXX                          XXXX                          XXXXXXXXXXXXX                XXXX                          XXXXXXXXXXXXX       
		XXXX                          XXXXXXXXXXXXX                X                            
		XXXXXXXXXXXXX                                                                                             		
		XXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXX.XXX                           XXX
]]></EnrolData>
</XXXXXXXXXXXXX>

Script that I am trying:
Code:
#!/bin/sh
for file in $(ls file_rec.trx.{4}(\d)-{2}(\d)-{2}(\d).{1,}(\d).exc.*.tgz)
do
awk'
 /<RecordNumber>/ {
        getline
        while ( $0 !~ /<\/RecordNumber>/ ) {
               rNumber = $1
                getline
        }
        nextline
}

/<EnrolData><\!\[CDATA\[/ {
        getline
        while ($0 !~ "\]\]><\/EnrolData>" ) {
               eData=substr($1,19,5) #Here I actually need to get the sub string from "E0000003350000000012345Part1              XXXXXX                                        " 
#but the problem is record number may not fixed digits and the number between Part1 and E may not be fixed digits. 
#one thing for sure is sequence number present always before Part1
                getline
        }
        nextline
}
{
if (rNumber==eData){
#here I need to print the formate - <filename> : <RecordNumber> - <EnrolData sequence number>
print "$file - $(rNumber) - $(eData)"
}' $file


Last edited by VasuKukkapalli; 01-29-2016 at 02:44 PM..
# 2  
Old 01-29-2016
So - what be the EnrolData? If it's NOT 3350000000012345 - what is it?
# 3  
Old 01-29-2016
Hello Rudi, Thank you for checking this for me. here is the answer:

E00000033500000000 - Some different string which you may ignore. unfortunately this string length may change.
12345 - this is the actual sequence number that we need to compare with RecordNumber. In other words, this is the sequence number which must be equal to the Record number
Part1 - This is also another string and is FIXED for each and all files.

---------- Post updated at 01:39 PM ---------- Previous update was at 01:28 PM ----------

To be more clear, in below XML, the two separated numbers (12345 in two XML tags - RecordNumber and EnrolData) must be equal, but for some reason, in some records, they are not coming as same. Also the string Part1 is same for all records and for all files.
So I need to find out in what file how they are not coming as different.

Code:
<XXXXXXXXXXXXX>
    <RecordNumber>

12345
Code:
</RecordNumber>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXX>XXXXXX</XXXXXX>
    <XXXXXXXXXXXXX><![CDATA[XXXXXXXXXXXXXX:XXXXXXXXXXXXX XXXX XXXXXX]]></XXXXXXXXXXXXX>
    <EnrolData><![CDATA[E00000033500000000

12345
Code:
Part1              XXXXXX
	XXXXXXXXXXXXXXXX                                            XXXXXXXXXXXXXXX:XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXXXXXXXXX.XXX   
	XXXXXXXXXXXXXXX  
	XXXX                                                                                                                                                      
	
XXXXXXXXXXXXXXXXX                    XXXX                                XXXXXXXXXXXXX.XXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXX.XXXXXX                                                        XXXXXXXXXXXXXX                                      
        XXXX                          XXXX                          XXXXXXXXXXXXX                XXXX                          XXXXXXXXXXXXX       
		XXXX                          XXXXXXXXXXXXX                X                            
		XXXXXXXXXXXXX                                                                                             		
		XXXXXXXXXXXXXXXXXXX.XXXXXXXXXXXXXXXXXXXX.XXX                           XXX
]]></EnrolData>
</XXXXXXXXXXXXX>

Hope this clarifies.
# 4  
Old 01-30-2016
So - would 102345 and 10002345 or even 10002 be valid numbers? If so, how to discriminate from above (3350...012345)?

---------- Post updated at 10:44 ---------- Previous update was at 10:43 ----------

We need something to tell where to stop ignoring digits and start counting them...
# 5  
Old 02-01-2016
Hello RudiC, unfortunately that is what the problem for me, there is no valid scenario for this, Record number length is not fixed, but the number right before Part1 is the number that we need to compare with the Record Number. I am not sure how to extract this.
# 6  
Old 02-01-2016
Well, assuming the record number length is as good an approach as any other, try
Code:
awk '
FNR == 1                                {RN = ED = 0}
match ($0, /<RecordNumber>[^<]*/)       {RN=substr($0, RSTART+14, RLENGTH-14); print FILENAME; print RN}
match ($0, /<EnrolData>.*Part1/)        {ED=substr($0, RSTART+RLENGTH-5-length(RN), length(RN)); print ED}
' file
file
12345
12345

Login or Register to Ask a Question

Previous Thread | Next Thread

10 More Discussions You Might Find Interesting

1. Shell Programming and Scripting

awk sort based on difference of fields and print all fields

Hi I have a file as below <field1> <field2> <field3> ... <field_num1> <field_num2> Trying to sort based on difference of <field_num1> and <field_num2> in desceding order and print all fields. I tried this and it doesn't sort on the difference field .. Appreciate your help. cat... (9 Replies)
Discussion started by: newstart
9 Replies

2. Shell Programming and Scripting

How to print 1st field and last 2 fields together and the rest of the fields after it using awk?

Hi experts, I need to print the first field first then last two fields should come next and then i need to print rest of the fields. Input : a1,abc,jsd,fhf,fkk,b1,b2 a2,acb,dfg,ghj,b3,c4 a3,djf,wdjg,fkg,dff,ggk,d4,d5 Expected output: a1,b1,b2,abc,jsd,fhf,fkk... (6 Replies)
Discussion started by: 100bees
6 Replies

3. Shell Programming and Scripting

How to get fields and get output with awk or shell script.?

I have a flat file A.txt with field seperate by a pipe 2012/11/13 20:06:11 | 284:hawk pid=014268 opened Locations 12, 13, 14, 15 for /home/hawk_t112/t112/macteam/qt/NET12/full_ddr3_2X_FV_4BD_1.qt/dbFiles/t112.proto|2012/11/14 15:19:26 | still running |norway|norway 2012/11/14 12:53:51 | ... (6 Replies)
Discussion started by: sabercats
6 Replies

4. Shell Programming and Scripting

Awk - Script assistance on identifying non matching fields

Hoping for some assistance. my source file consists of: os, ip, username win7, 123.56.78, john win7, 123.56.78, paul win7, 10.1.1.1, john win7, 10.2.2.3, joe I've been trying to run a script that will only return ip and username where the IP address is the same and the username is... (3 Replies)
Discussion started by: tekvaio
3 Replies

5. Shell Programming and Scripting

Comparison of fields in Files

Hello, I have two files with tab delimited data. The file will contain details something like below: FILENAME.A.B.C. 3 5 VALID PROCESSED I would have a bench mark file. I would be getting new files of the same format. My requirement is to compare a particular field for a... (3 Replies)
Discussion started by: Praveenkulkarni
3 Replies

6. Shell Programming and Scripting

numbers comparison in fields of a file and print least value of them

Hi , I'm trying to compare fields in the file, I want compare the numbers in each column and get the least value of it. > cat input_file 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 -0.2050 -0.6629 -0.6407 -0.6599 -0.4085 -0.3959 -0.2526 -0.3597 0.3439 0.2275 0.2780 ... (5 Replies)
Discussion started by: novice_man
5 Replies

7. Shell Programming and Scripting

awk script to (un)/concatenate fields in file

Hi everyone, I'm trying to use the "join" function for more than 1 field. Since it's not possible as it is, I want to take my input files and concatenate the joining fields as 1 field (separated by "|"). I wrote 2 awk script to do and undo it (see below). However I'm new to awk and I'm certain I... (5 Replies)
Discussion started by: anthony.cros
5 Replies

8. Shell Programming and Scripting

Simple XML file comparison and merging

Okay, first of all, thanks to everyone who's helped me out before... I appreciate the opportunity to learn. I have two iTunes XML files, and I simply want to compare the contents, then merge. Theoretically, this will allow me to merge two libraries, keeping playlists intact (depending on iTunes'... (4 Replies)
Discussion started by: karlp
4 Replies

9. Shell Programming and Scripting

awk sed cut? to rearrange random number of fields into 3 fields

I'm working on formatting some attendance data to meet a vendors requirements to upload to their system. With some help on the forums here, I have the data close. But they've since changed what they want. The vendor wants me to submit three fields to them. Field 1 is the studentid field,... (4 Replies)
Discussion started by: axo959
4 Replies

10. HP-UX

XML parsing performace comparison with windows using sax

sorry wrong forum..i dont know how to delete this or how to move it to HP UX section... I tested SAX XML parsing using xerces(http://xerces.apache.org/xerces-j/). I tested on Windows XP and HP-UX . I found that parsing time on HP is 5 times that on Windows. My server startup reads a lot of XML... (1 Reply)
Discussion started by: saurabh.sid
1 Replies
Login or Register to Ask a Question