Sponsored Content
Top Forums Programming Python: Parsing and comparing XMLs with minidom Post 302672149 by Bloomy on Monday 16th of July 2012 03:47:38 AM
Old 07-16-2012
Python: Parsing and comparing XMLs with minidom

Hi there!

I'd like to parse and compare 2 XML files with the minidom parser as follows:

I have 2 XML files with loads of data. One is in English (the source file), the other one the corresponding French translation (the target file).
E.g.:
source file:
Code:
<macro>
       <id> 123</id>
              <string> DOG </string>
              <string>dogs/dog/dog's</string>
              <string>Cross-language reference</string>
              <string>English dog: dogs/dog/dog's</string>
      (..........)
<macro>

Code:
target file:
<macro>
       <id> 123</id>
              <string> CHIEN </string>
              <string>chien/chiens</string>
              <string>Cross-language reference</string>
              <string></string>
      (..........)
<macro>

The French target file has an empty cross-language reference where I'd like to put in the information from the English source file whenever the 2 macros have the same ID.
I already wrote some code in which I replaced the string tag name with a unique tag name in order to identify the cross-language reference. I also extracted the ID and the Cross-language reference info from both files. Now I want to compare the 2 files and if 2 macros have the same ID, exchange the empty reference in the French file with the info from the English file.

Here is my code:

Code:
import re
#import easy to use xml parser called minidom:
from xml.dom.minidom import parseString
#all these imports are standard on most modern python implementations
 
#open the xml file for reading that contains the correct CL references:
file = open('PATH/english.xml','r+')
#open the xml file for reading that contains the missing CL references:
file_target =  open('PATH/french.xml','r+')
#convert to string:
data = file.read()
#replace xml tag with a unique name in order to identify it later on
data = re.sub(r"<string>Cross-language reference</string>(\s+)<string>(.*)</string>",r"<cl>Cross-language reference</cl>\1<cl>\2</cl>",data)
file.seek(0)
file.write(data)
#remove old data
file.truncate()
#close file because we don't need it anymore:
file.close()
#convert to string:
target = file_target.read()
#replace xml tag with a unique name in order to identify it later on
target = re.sub(r"<string>Cross-language reference</string>(\s+)<string>(.*)</string>",r"<cl>Cross-language reference</cl>\1<cl>\2</cl>",target)
file_target.seek(0)
file_target.write(target)
#remove old data
file_target.truncate()
#close file because we don't need it anymore:
file_target.close()


#extract CL-reference from source file
dom = parseString(data)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag = dom.getElementsByTagName('cl')[1].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData=xmlTag.replace('<cl>','').replace('</cl>','')
#print out the xml tag and data in this format: <tag>data</tag>
print (xmlTag)
#just print the data
print (xmlData)

#IdTag = dom.getElementsByTagName('id')[0].toxml()
#IdData = xmlData=xmlTag.replace('<tagName>','').replace('</tagName>','')

#extract CL-reference from target file
dom_2 = parseString(target)
#retrieve the first xml tag (<tag>data</tag>) that the parser finds with name tagName:
xmlTag_2 = dom_2.getElementsByTagName('cl')[1].toxml()
#strip off the tag (<tag>data</tag>  --->   data):
xmlData_2=xmlTag_2.replace('<cl>','').replace('</cl>','')
#print out the xml tag and data in this format: <tag>data</tag>
print (xmlTag_2)
#just print the data
print (xmlData_2)


#extract id from source file
xyz = parseString(data)
xmlTag_id_source = xyz.getElementsByTagName('id')[0].toxml()
xmlData_id_source = xmlTag_id_source.replace('<id>','').replace('</id>','')
print ("xmlTag_id_source: "+xmlTag_id_source)
print ("xmlData_id_source: "+xmlData_id_source)

#extract id from target file
abc = parseString(target)
xmlTag_id_target = abc.getElementsByTagName('id')[0].toxml()
xmlData_id_target = xmlTag_id_target.replace('<id>','').replace('</id>','')
print (xmlTag_id_target)
print (xmlData_id_target)


with open(file,'r')as sfile:
    with open(file_target,'w') as tfile:
        lines = sfile.readlines()
        if xmlTag_id_source==xmlTag_id_target:
         # do the replacement in the second line.
         # (remember that arrays are zero indexed)
             lines[1]=re.sub(xmlData_2,xmlData,lines[1])
             tfile.writelines(lines)

print ("DONE")

The replacement of the tag as well as the extraction of ID and cross-language reference work, but the final part where I'm trying to replace the stuff in the target file returns the error:

Code:
Traceback (most recent call last):
  File "PATH\test.py", line 74, in <module>
    with open(file,'r') as sfile:
TypeError: invalid file: <_io.TextIOWrapper name='PATH\english.xml' mode='r+' encoding='cp1252'>

I didn't find any useful information on the web that helped me figure out what's wrong.
I know that my code above might look a bit messy, but I'm a beginner and things like "you should use a different parser" or "you should do it completely different" won't really help me. I'd like to know how I can find a solution by using and adjusting my code above :-).

I am grateful for any suggestions! Thanks in advance and kind regards!
 

8 More Discussions You Might Find Interesting

1. Programming

Needing help parsing XML/RDF using Python

Hello, I am trying to make script to parse the install.rdf files found in firefox xpi extentions to isolate the extention ID so I can name a directory and automate installation of system-wide extension. I am very facile with the command line, but not with programming languages (esp... (0 Replies)
Discussion started by: Narnie
0 Replies

2. Shell Programming and Scripting

Need to Split Big XML into multiple xmls

Hi friends.. We have urgent requirement.We need to split the big xml having multiple orders into multiple xmls having each order in each xml. For Example In input XMl will be in following format with multiple line orders.. <OrderDetail BillToKey="20100805337" Createuserid="CreateGuestOrder"... (8 Replies)
Discussion started by: dprakash
8 Replies

3. Programming

Parsing command line arguments in Python

Hi, I've a python script called aaa.py and passing an command line option " -a" to the script like, ./aaa.py -a & Inside the script if the -a option is given I do some operation if not something else. code looks like ./aaa.py -a . . if options.a ---some operation--- if not options.a... (1 Reply)
Discussion started by: testin
1 Replies

4. UNIX for Dummies Questions & Answers

awk/grep or parsing in python code

Hello, I am writing a python code. The output of the python code needs a little bit of parsing. From the output of python code, which has a lot of redundant data, I need to cut only those words or numbers which end with &. for example: if the output is-- "This is an example of tgbn123& what i... (0 Replies)
Discussion started by: Screech_you
0 Replies

5. Shell Programming and Scripting

**python** unable to read the background color in python

I am working on requirement on spreadsheet in python scripting. I have a spreadsheet containing cell values and with background color. I am able to read the value value but unable to get the background color of that particular cell. Actually my requirement is to read the cell value along... (1 Reply)
Discussion started by: giridhar276
1 Replies

6. UNIX for Beginners Questions & Answers

Searching the value of a specific attribute among xmls files from a particular directory location

Hi Folks , I have the different xml files at the following directory `/opt/app/rty/servers/tr/current/ops/config` Let's say there are three files named abc.xml bv.xml ert.xml Now inside these xml there can be many tags as like shown below <bean id="sdrt"... (6 Replies)
Discussion started by: unclesamm
6 Replies

7. UNIX for Beginners Questions & Answers

Getting error in sed command in replacing a word in all xmls

Hi Folks , I have to replace the following value in all the xml files so the value is tcp://pondevpms1.fm.rbsgrp.net:6033,pondevpms2.fm.rbsgrp.net:6033 and the value with it need to be replaces is shown below tcp://pondevpms1:3063 so i have fired the below command inside... (3 Replies)
Discussion started by: unclesamm
3 Replies

8. Programming

Create a C source and compile inside Python 1.4.0 to 3.7.0 in Python for ALL? platforms...

Hi all... As you know I like making code backwards compatible for as many platforms as possible. This Python script was in fact dedicated for the AMIGA A1200 using Pythons 1.4.0, 1.5.2, 1.6.0, 2.0.1, and 2.4.6 as that is all we have for varying levels of upgrades from a HDD and 4MB FastRam... (1 Reply)
Discussion started by: wisecracker
1 Replies
All times are GMT -4. The time now is 06:31 AM.
Unix & Linux Forums Content Copyright 1993-2022. All Rights Reserved.
Privacy Policy