I have Python script to parse XML files into a more friendly format for another platform.
Every so often one of the data files contains no data - only the encoding info and no other tags, which is causing ElementTree to throw a ParseError when it finds them.
<?xml version="1.0" encoding="utf-8"?>
Is there a way of testing for the empty file before calling ElementTree?
Ta.
You should ask for forgiveness not permission here.
Handle the exception by wrapping the code in a try/except block.
import xml.etree.ElementTree as ET
...
try:
tree = ET.parse(fooxml)
except ET.ParseError:
# log error
pass
Of course have several ways, use:
try:
pass # delete this and add your parse code
except:
pass # write your exception when empty
or use if statement:
if (some code to evalue if xml is not empty):
# your code
elif (some code to check if .xml is empty):
# your code
let me know how it was!
Of course you could catch the exception that lxml throws. If you want to avoid parsing, you could check if the file contains only one < symbol:
with open("input.xml","rb") as f:
contents = f.read()
if contents.count(b"<")<=1:
# empty or only header: skip
pass
else:
x = etree.XML(contents)
of course this heuristic method doesn't protect from other parsing errors. So it's best to just protect the parsing by a try/except block.
But this method has the advantage of being extremely fast if you have lots of corrupt 1-line "header only" file.
Related
I want to read a XML string, edit it and save it as a XML file.
However I get the mentioned error in the title when I do .write()
I found out that when you read an XML string using ElementTree.fromstring(string) it will create an ElementTree.Element and not an ElementTree itself. An Element has no write method but the ElementTree does.
How can I write an Element to a XML file? Or how can I create an ElementTree and add my Element to that and then use the .write method?
I found out that when you read a xml string using ElementTree.fromstring(string) it will actually create an ElementTree.Element and not a ElementTree itself.
Yes, you get the top-level element back (also called the "document element").
An Element has no write method but the ElementTree does.
The ElementTree constructor signature goes like this:
class xml.etree.ElementTree.ElementTree(element=None, file=None)
Therefore it's completely straightforward:
import xml.etree.ElementTree as ET
doc = ET.fromstring("<test>test öäü</test>")
tree = ET.ElementTree(doc)
tree.write("test.xml", encoding="utf-8")
You always should specify the encoding when writing an XML file. Most of the time, UTF-8 is the best choice.
In case this helps anyone who gets this unclear error message when trying to use ElementTree to write an xml file, and spends way too long on it (like I did):
File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 788, in _get_writer
write = file_or_filename.write
AttributeError: 'str' object has no attribute 'write'
... in my case, it was simply because the path to the directory I was trying to write my xml file to did not exist! For example:
tree.write("/FolderDidNotExist/test.xml", encoding="utf-8")
a simple mkdir /FolderDidNotExist did the trick. No more error. (Of course, this error message could use some "love" so I'm posting this here in case I forget what it means again [which I've done] and need to google this again)
Below is what I'm doing currently, just wondering if there is a better way.
with open("sample.json", "r") as fp:
json_dict = json.load(fp)
json_string = json.dumps(json_dict)
with open("sample.json") as f:
json_string = f.read()
No need to parse and unparse it.
If you need to raise an exception on invalid JSON, you can parse the string and skip the work of unparsing it:
with open("sample.json") as f:
json_string = f.read()
json.loads(json_string) # Raises an exception if the JSON is invalid.
Json file is just a regular file. You open() it and read() it. It will give you a str. If you want to make sure it contains valid JSON, put the load part of the above code in a try/except block.
I don't know if it's Pythonic or just pointless but you could also do this if validation is part of your requirements:
import json
# I'm fully aware of the missing "ẁith" or "close" in the line below
json_string = json.dumps(json.load(open('sample.json')))
Otherwise, user2357112 already said it: "No need to parse and unparse it."
You are doing it right. You probably can find some libraries with different implementations for performance or memory optimization specifics. The python standard is reliable, cover most of the cases, is assured to compatible to other platforms and is simple. It cannot get more pythonic than that.
The way you're doing it is probably fine if you just need to have it raise an Exception for invalid JSON. However, if you want to make sure that you're not changing the file at all you could try something like this:
import json
with open("sample.json") as fp:
json_string = fp.read()
json.loads(json_string)
It will still raise a ValueError if it is invalid JSON, and you know that you haven't changed the data at all. If you're wondering what may change, off the top of my head: the order of dict items, and whitespace, Not to mention if there are duplicate keys in the JSON.
I am trying to read xml behind an spss file, I would like to move from etree to objectify.
How can I convert this function below to return an objectify object? I would like to do this because objectify xml object would be easier for me (as a newbie) to work with as it is more pythonic.
def get_etree(path_file):
from lxml import etree
with open(path_file, 'r+') as f:
xml_text = f.read()
recovering_parser = etree.XMLParser(recover=True)
xml = etree.parse(StringIO(xml_text), parser=recovering_parser)
return xml
my failed attempt:
def get_etree(path_file):
from lxml import etree, objectify
with open(path_file, 'r+') as f:
xml_text = objectify.fromstring(xml)
return xml
but I get this error:
lxml.etree.XMLSyntaxError: xmlns:mdm: 'http://www.spss.com/mr/dm/metadatamodel/Arc 3/2000-02-04' is not a valid URI
The first, biggest mistake is to read a file into a string and feed that string to an XML parser.
Python will read the file as whatever your default file encoding is (unless you specify the encoding when you call read()), and that step will very likely break anything other than plain ASCII files.
XML files come in many encodings, you cannot predict them, and you really shouldn't make assumptions about them. XML files solve that problem with the XML declaration.
<?xml version="1.0" encoding="Windows-1252"?>
An XML parser will read that bit of information and configure itself correctly before reading the rest of the file. Make use of that facility. Never use open() and read() for XML files.
Luckily lxml makes it very easy:
from lxml import etree, objectify
def get_etree(path_file):
return etree.parse(path_file, parser=etree.XMLParser(recover=True))
def get_objectify(path_file):
return objectify.parse(path_file)
and
path = r"/path/to/your.xml"
xml1 = get_etree(path)
xml2 = get_objectify(path)
print xml1 # -> <lxml.etree._ElementTree object at 0x02A7B918>
print xml2 # -> <lxml.etree._ElementTree object at 0x02A7B878>
P.S.: Think hard if you really, positively must use a recovering parser. An XML file is a data structure. If it is broken (syntactically invalid, incomplete, wrongly decoded, you name it), would you really want to trust the (by definition undefined) result of an attempt to read it anyway or would you much rather reject it and display an error message?
I would do the latter. Using a recovering parser may cause nasty run-time errors later.
I have two types of txt files, one which is saved in some arbitrary format on the form
Header
key1 value1
key2 value2
and the other file formart is a simple json dump stored as
with open(filename,"w") as outfile:
json.dump(json_data,outfile)
From a dialog window, the user can load either of these two files, but my loader need to be able to distinguish between type1 and type2 and send the file to the correct load routine.
#Pseudocode
def load(filename):
if filename is json-loadable:
json_loader(filename)
else:
other_loader(filename)
The easiest way I can think of is to use a try/except block as
def load(filename):
try:
data = json.load(open(filename))
process_data(data)
except:
other_loader(filename)
but I do not like this approach since there is like a 50/50 risk of fail in the try/except block, and as far as I know try/except is slow if you fail.
So is there a simpler and more convenient way of checking if its json-format or not?
You can do something like this:
def convert(tup):
"""
Convert to python dict.
"""
try:
tup_json = json.loads(tup)
return tup_json
except ValueError, error: # includes JSONDecodeError
logger.error(error)
return None
converted = convert(<string_taht_neeeds_to_be_converted_to_json>):
if converted:
<do_your_logic>
else:
<if_string_is_not_converteble>
If the top-level data you're dumping is an object, you could check if the first character is {, or [ if it's an array. That's only valid if the header for the other format will never start with those characters. It's also not foolproof because it doesn't guarantee that your data is well formed JSON.
On the other hand your existing solution is fine, much more clear and robust.
I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.