I'm trying to write a plugin to read data from XML file
inside test.xml there is:
<data>
<items>
<item test1="Arabic Words"></item>
<item test2="English Words"></item>
</items>
</data>
and the code is :
# coding: utf-8
from xml.dom import minidom
xmldoc = minidom.parse('test.xml')
itemlist = xmldoc.getElementsByTagName('item')
test1 = itemlist[0].attributes['test1'].value
test2 = itemlist[1].attributes['test2'].value
print(test1)
print(test2)
But I encounter a problem with coding: I can't set it to utf-8.
How can I make minidom interpret files with UTF-8 encoding?
Typically, valid XML begins with a XML pseudotag, containing the encoding:
<?xml version="1.0" encoding="UTF-8"?>
...
minidom should respect that; if your file has such a tag but isn't interpreted as UTF-8, you should file a bug against minidom; but I'd generally expect that your files simply don't contain this line.
You can use
minidom.parseString("<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" + open("file.xml","r").read())
to work around that (but I recommend fixing your XML files).
Either use encode/decode functions or the import codecs.
Example:
x = 'abcd'
y = x.encode('utf-8')
y.decode('utf-8')
Just use encoding/decoding and use minidom to parse a string instead of passing a file name.
Related
I am using python to do some conditional changes to an XML document. The incoming document has <?xml version="1.0" ?> at the top.
I'm using xml.etree.ElementTree.
How I'm parsing the changed XMl:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
The output has this at the top:
<?xml version='1.0' encoding='utf8'?>
The client wants the "encoding" tag removed but if I remove it then it either doesn't include the line at all or it puts in encoding= 'us-ascii'
Can this be done so the output matches: <?xml version="1.0" ?>?
(I don't know why it matters honestly but that's what I was told needed to happen)
As pointed out in this answer there is no way to make ElementTree omit the encoding attribute. However, as #James suggested in a comment, it can be stripped from the resulting output like this:
filter_update_body = ET.tostring(root, encoding="utf8", method="xml")
filter_update_body = filter_update_body.replace(b"encoding='utf8'", b"", 1)
The b prefixes are required because ET.tostring() will return a bytes object if encoding != "unicode". In turn, we need to call bytes.replace().
With encoding = "unicode" (note that this is the literal string "unicode"), it will return a regular str. In this case, the bs can be omitted. We use good old str.replace().
It's worth noting that the choice between bytes and str also affects how the XML will eventually be written to a file. A bytes object should be written in binary mode, a str in text mode.
I have this char in an xml file:
<data>
<products>
<color>fumè</color>
</product>
</data>
I try to generate an instance of ElementTree with the following code:
string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))
and I get the following error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)
(NOTE: The position is not exact, I sampled the xml from a larger one).
How to solve it? Thanks
Might you have stumbled upon this problem while using Requests (HTTP for Humans), response.text decodes the response by default, you can use response.content to get the undecoded data, so ElementTree can decode it itself. Just remember to use the correct encoding.
More info: http://docs.python-requests.org/en/latest/user/quickstart/#response-content
You need to decode utf-8 strings into a unicode object. So
string_data.encode('utf-8')
should be
string_data.decode('utf-8')
assuming string_data is actually an utf-8 string.
So to summarize: To get an utf-8 string from a unicode object you encode the unicode (using the utf-8 encoding), and to turn a string to a unicode object you decode the string using the respective encoding.
For more details on the concepts I suggest reading The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (not Python specific).
You do not need to decode XML for ElementTree to work. XML carries it's own encoding information (defaulting to UTF-8) and ElementTree does the work for you, outputting unicode:
>>> data = '''\
... <data>
... <products>
... <color>fumè</color>
... </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'
If your data is contained in a file(like) object, just pass the filename or file object directly to the ElementTree.parse() function:
x = ElementTree.parse('file.xml')
Have you tried using the parse function, instead of opening the file... (which BTW would require a .read() after it for the .fromstring() to work...)
import xml.etree.ElementTree as ET
tree = ET.parse('file.xml')
root = tree.getroot()
# etc...
The most likely your file is not UTF-8. è character can be from some other encoding, latin-1 for example.
Function open() does not return a string.
Instead use open('file.xml').read().
I have a Python script which parse XML from a website, so that implies that I can't touch the original XML and it looks like that:
<?xml version='1.0' encoding='UTF-8'?>
<list>
<orderDate>09/06/2017</orderDate>
<orderObject>RC CAR</orderObject>
<orderName>2ª versione</orderName>
<orderShipped>true</orderShipped>
</list>
I'm facing a problem when the server answer with a XML data as above, when "orderName" contains a number with an ordinal indicator like in this case "ª" it gives me the following:
xml.parsers.expat.ExpatError: not well-formed (invalid token):
In Python side I'm using minidom as a parser with this code:
xmldoc = minidom.parse(order_data)
I want to specify that when another XML does not contain an ordinal number everything works perfectly. Thanks to whoever will help me.
I want to parse a file with minidom:
with codecs.open(fname, encoding="utf-8") as xml:
dom = parse(xml)
Returns a UnicodeEncodeError. The XML file is in UTF-8 without BOM format and has
<?xml version="1.0" encoding="utf-8"?>
in the first line.
If I first read the file, .encode("utf-8") it and pass it to parseString, it works. Is there a way to parse an UTF-8 XML file directly with minidom.parse?
Leave the decoding to the XML parser; it'll detect what codec to use. Open the file without converting to unicode:
with open(fname) as xml:
dom = parse(xml)
Note the use of the standard function open() instead of codecs.open().
This applies to any XML parser; it is the job of the parser to determine from the XML prologue what codec to use for parsing the document. If no prologue is present then UTF-8 is the default.
A big set of XML files have the wrong encoding defined. It should be utf-8 but the content has latin-1 characters all over the place. What's the best way to parse this content?
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
Edit: this is happening with Adobe InDesign IDML files, it seems the "Content" text has latin-1 but the rest could be utf-8. I'm favoring normal parsing with utf-8, then reencode the Unicode text chunks in Content to utf-8 and then re-parsing with latin-1. What a mess.
ಠ_ಠ
You can override the encoding specified in the XML when you parse it:
class xml.etree.ElementTree.XMLParser(html=0, target=None, encoding=None)
Element
structure builder for XML source data,
based on the expat parser. html are
predefined HTML entities. This flag is
not supported by the current
implementation. target is the target
object. If omitted, the builder uses
an instance of the standard
TreeBuilder class. encoding 1 is
optional. If given, the value
overrides the encoding specified in
the XML file.
docs
Don't try to deal with encoding problems during parse, but pre-process the offending file(s).