Python XML ElementTree not reading node with & - python

I have an XML, with one of the nodes having '&' within a string:
<uid>JAMES&001</uid>
now, when I try to read the whole xml using the following code:
tree = et.parse(fileName)
root = tree.getroot()
ids = root.findall("uid")
I get the error on the link of the above-mentioned node:
xml.etree.ElelmentTree.ParseError: not well-formed (invalid token): line17, column 21
The code works fine on other instances where there is no '&'. I guess it's breaking the string.
Can it be fixed with encoding? How? I searched through other questions but couldn't find an answer.
TIA

You need to sanitize your xml first since it isn't well formed.
You need to replace the offending & - something like .replace("&", "&")
One way to use it:
with open(fileName, 'r+') as f:
read_data = f.read()
doc = ET.fromstring(read_data.replace("&", "&"))
print(doc.find('./uid').text)
Output, given your sample, should be
JAMES&001

Related

Problems isolating xml from a string

In my python script, I am struggling with xml files. I am using urllib to download xml files and convert them to a string. Next, Id like to parse the xml-file.
Sample link of a typical file
import urllib
data = urllib.request.urlopen(link).read()
data = str(data)
data2 = data.replace('\n', '')
I wanted to strip data of \n, but data2 is not stripped of \n characters, sample output looks like this for data2:
SwapInvolved>\n </transactionCoding>\n <transactionTimeliness>\n <value></value>\n
Why?
Also, since the file I pull is xml, I would like to go though ElementTree to parse it but I get an error.
e = xml.etree.ElementTree.parse(data).getroot()
OSError: [Errno 36] File name too long:
In the end, I want the xml from the link and parse it. I am doing it wrong though.
Your first problem is that you need to escape the '\n' in string.replace() because your string contains literal sequences of \ and n. Your code is looking for linefeeds but your data contains string representations of linefeeds.
Do this instead: data2 = data.replace(r"\n","")
Your second problem is that xml.etree.ElementTree.parse() is expecting a filename, not a string. Use xml.etree.ElementTree.fromstring() instead.

Problems extracting XML code from Word with Python

I am attempting to extract the XML code from a Word document with Python. Here's the code I tried:
def getXml(docxFilename):
zip = zipfile.ZipFile(open(docxFilename,"rb"))
xmlString= str(zip.read("word/document.xml"))
return xmlString
I created a test document and ran the function getXML on it. Here's the result:
b'<?xml version="1.0" encoding="UTF-8" standalone="yes"?>\r\n<w:document xmlns:ve="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml"><w:body><w:p w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidRDefault="00B52719"><w:pPr><w:rPr><w:rFonts w:ascii="Times New Roman" w:hAnsi="Times New Roman" w:cs="Times New Roman"/><w:sz w:val="24"/><w:szCs w:val="24"/></w:rPr></w:pPr><w:r><w:t>Test</w:t></w:r></w:p><w:sectPr w:rsidR="00971B91" w:rsidRPr="00971B91" w:rsidSect="009C4305"><w:pgSz w:w="12240" w:h="15840"/><w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:document>'
There are some obvious issues. One is that the XML code begins with an "b' " and ends with an apostrophe. Second, there is a "\r\n" right after the first set of angle brackets.
My ultimate goal is to modify the XML code to create a new Word document -- see this question -- but the anomalies with the extracted XML are preventing me from doing this.
Does anyone know why the extracted XML has these strange features and how I can remove them?
EDIT: I tried using the lxml module to parse this code but I only got different errors.
I created a new function getXmlTree:
from lxml import etree
def getXmlTree(xmlString):
return etree.fromstring(xmlString)
I then ran the code etree.tostring(getXmlTree(getXml("test.docx")),pretty_print=True) and received much more sensible XML code.
The problems arise when I tried to create a new Word document. I created the following function to convert XML code into a Word document (shamelessly stolen from here):
import zipfile
from lxml import etree
import os
import tempfile
import shutil
def createNewDocx(originalDocx,xmlContent,newFilename):
tmpDir = tempfile.mkdtemp()
zip = zipfile.ZipFile(open(originalDocx,"rb"))
zip.extractall(tmpDir)
with open(os.path.join(tmpDir,"word/document.xml"),"w") as f:
xmlString = etree.tostring(xmlContent,pretty_print=True)
f.write(xmlString)
filenames = zip.namelist()
zipCopyFilename = newFilename
with zipfile.ZipFile(zipCopyFilename,"w") as docx:
for filename in filenames:
docx.write(os.path.join(tmpDir,filename),filename)
shutil.rmtree(tmpDir)
Before trying to create a new Word document, I wanted to see if I could create a copy of my original test document by substituting xmlContent = getXmlTree(getXml("test.docx")) as an argument in the above function. When I ran the code, however, I received an error message:
f.write(xmlString)
TypeError: must be str, not bytes
Using f.write(str(xmlString)) instead didn't help; it created a new word document, but Word would crash if I tried to open it.
EDIT2: tried running the above code with f.write(xmlString.decode("utf-8")) instead, but it didn't help; Word still crashed.
My guess is that the XML is not being encoded properly. First, write the document file as binary using "wb" as the mode. Second, tell etree.tostring() what the encoding is and to include the XML declaration.
with open(os.path.join(tmpDir, "word/document.xml"), "wb") as f:
xmlBytes = etree.tostring(xmlContent, encoding="UTF-8", xml_declaration=True, pretty_print=True)
f.write(xmlBytes)

creating xml documents with whitespace with xml.etree.cElementTree

I'm working on a project to store various bits of text in xml files, but because people besides me are going to look at it and use it, it has to be properly indented and such. I looked at a question on how to generate xml files using cElement Tree here, and the guy says something about putting in info about making things pretty if people ask, but there isn't anything there (I guess because no one asked). So basically, is there a way to properly indent and whitespace using cElementTree, or should i just throw up my hands and go learn how to use lxml.
You can use minidom to prettify our xml string:
from xml.etree import ElementTree as ET
from xml.dom import minidom
# Return a pretty-printed XML string for the Element.
def prettify(xmlStr):
INDENT = " "
rough_string = ET.tostring(xmlStr, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=INDENT)
# name of root tag
root = ET.Element("root")
child = ET.SubElement(root, 'child')
child.text = 'This is text of child'
prettified_xmlStr = prettify(root)
output_file = open("Output.xml", "w")
output_file.write(prettified_xmlStr)
output_file.close()
print("Done!")
Answering myself here:
Not with ElementTree. The best option would be to download and install the module for lxml, then simply enable the option
prettyprint = True
when generating new XML files.

lxml adds urlencoding in xml?

I'll preface this by indicating I'm using Python 2.7.3 (x64) on Windows 7, with lxml 2.3.6.
I have a little, odd, problem I'm hoping somebody can help with. I haven't find a solution online, perhaps I'm not searching for the right thing.
Anyway, I have a problem where I'm programmatically building some XML with lxml, then outputting this to a text file, the problem is lxml is converting carriage returns to the text 
, almost like urlencoding - but I'm not using HTML I'm using XML.
For example, I have a simple text file created in Notepad, like this:
This
is
my
text
I then build some xml and add this text into the xml:
from lxml import etree
textstr = ""
fh = open("mytext.txt", "rb")
for line in fh:
textstr += line
root = etree.Element("root")
a = etree.SubElement(root, "some_element")
a.text = textstr
print etree.tostring(root)
The problem here is the output of the print looks like this:
<root><some_element>This
is
my
text</some_element></root>
For my purposes the line breaks are fine, but the 
 elements are not.
What I have been able to figure out is that this is happening because I'm opening the text file in binary mode "rb" (which I actually need to do as my app is indexing a large text file). If I don't open the file in binary mode "r", then the output does not contain 
 (but of course, then my indexing doesn't work).
I've also tried changing the etree.tostring to:
print etree.tostring(root, method="xml")
However there is no difference in the output.
Now, I CAN dump the xml text to a string then do a replace of the $#13; artifacts, however, I was hoping for a more elegant solution - because the text files I parse are not under my control and I'm worried that other elements of the text file might be converted to url style encoding without my knowledge.
Does anyone know a way of preventing this encoding from happening?
Windows uses \r\n to represent a line ending, Unix uses \n.
This will remove the \r at the end of the line, if there is one there (so the code will work with unix text files too.) It will remove at most one \r, so if there is an \r somewhere else in the line it will be preserved.
import re
textstr = ""
with open("mytext.txt", "rb") as fh:
for line in fh:
textstr += re.sub(r'\r$', '', line)
print(repr(textstr))

How to update/modify an XML file in python?

I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.

Categories

Resources