Modify XML with Python - python

I have an XML file which I want to modify and save to futher use. I only want to change text of some elements, but when I do, it also deletes all the line breaks in element's attributes:
<smth.xml
attr1="name1"
attr2="name2"
smth="21315423"
Debug="false" >
becomes
<smth.xml attr1="name1" attr2="name2" smth="21315423" Debug="false">
I am currently using lxml lib with parser = etree.XMLParser(encoding='utf-8')

but when I do, it also deletes all the line breaks in element's attributes
That's normal. Line-breaks between attributes in the XML source code are not part of the XML data model.
You are allowed to type them when you edit an XML file by hand, but XML parsers are not required by the spec to pay any attention to them, because they carry no meaning. So practically no XML parser does (I don't know a single XML parser that does).
I can only recommend one thing: Get over it. Don't try to retain them, that's fruitless, pointless, and it won't work.
Load the XML, read data from it, modify it, save it. Don't get too worked up about irrelevant details.

Related

Search for a word, and modify the whole line in Python text processing

This is my carDatabase.txt
CarID:c01 ModelName:honda VehicleType:city Price:20
CarID:c02 ModelName:honda VehicleType:x Price:30
I want to search for the carID and be only able to modify the whole line without interrupting others
my current code is here:
# Converting txt data into a string and modify
carsDatabaseFile = open('carsDatabase.txt', 'r')
allDataFromDatabase = [line.split(',') for line in carsDatabaseFile.readlines()]
Note:
Your question has a couple of issues: your sample from carDatabase.txt looks like it is tab-delimited, but your current code looks like it is splitting the line around the ',' character. This also looks like a place where a list comprehension might be hurting you more than it is helping you. Break that up into a for-loop if you're trying to add some logic to manipulate a single line.
For looking at CSV files, I would highly recommend using pandas for general manipulation of data in comma ceparated as well as a number of other formats.
That said, if you are truly restricted to only using built-in packages, or you are looking at this as a learning exercise, and your goal is to directly manipulate just one line of that file, what you are looking for is the seek method. You can use this in combination with the tell method ( documented just blow seek in the above link ) to find where you are in the file.
Write a for loop to identify which line in the file you are looking for
From there, you can get the output of tell() to find the specific place in the file you are trying to manipulate
Using the output from the above two steps, you can set the file pointer to a specific location using the seek() method (by byte: files are really stored as one dimensional).
You can now use the write() method to directly update the file at the location you determined above.

xml.etree.ElementTree.Element' object has no attribute 'write'

I want to read a XML string, edit it and save it as a XML file.
However I get the mentioned error in the title when I do .write()
I found out that when you read an XML string using ElementTree.fromstring(string) it will create an ElementTree.Element and not an ElementTree itself. An Element has no write method but the ElementTree does.
How can I write an Element to a XML file? Or how can I create an ElementTree and add my Element to that and then use the .write method?
I found out that when you read a xml string using ElementTree.fromstring(string) it will actually create an ElementTree.Element and not a ElementTree itself.
Yes, you get the top-level element back (also called the "document element").
An Element has no write method but the ElementTree does.
The ElementTree constructor signature goes like this:
class xml.etree.ElementTree.ElementTree(element=None, file=None)
Therefore it's completely straightforward:
import xml.etree.ElementTree as ET
doc = ET.fromstring("<test>test öäü</test>")
tree = ET.ElementTree(doc)
tree.write("test.xml", encoding="utf-8")
You always should specify the encoding when writing an XML file. Most of the time, UTF-8 is the best choice.
In case this helps anyone who gets this unclear error message when trying to use ElementTree to write an xml file, and spends way too long on it (like I did):
File "/usr/lib/python3.5/xml/etree/ElementTree.py", line 788, in _get_writer
write = file_or_filename.write
AttributeError: 'str' object has no attribute 'write'
... in my case, it was simply because the path to the directory I was trying to write my xml file to did not exist! For example:
tree.write("/FolderDidNotExist/test.xml", encoding="utf-8")
a simple mkdir /FolderDidNotExist did the trick. No more error. (Of course, this error message could use some "love" so I'm posting this here in case I forget what it means again [which I've done] and need to google this again)

How to overwrite an XML attribute value without reading the whole file in python

How do you overwrite a single attribute value when you are only reading one element at a time?
Specifically I am using xml.etree.cElementTree.iterparse() to read each individual element. Then I am changing an attribute value.
What I need to do then is to overwrite the original element with the changed element.
Here is the example code so far:
osm_file = open(sample.osm, 'r+')
for event, elem in ET.iterparse(osm_file events=("start",)):
# Making some changes
elem.attrib['v'] = 'new_value'
# Some how write the elem back to the XML file
The one thing that I can not do is to read the whole XML file into python because the file is too big.
from usr2564301 as posted in the comments explained why this is not possible.
That cannot possibly work. The XML handling is unaware that the data came from a file and so it cannot "write back" the changed value at the exact same position in the file. Even if it could: it is physically impossible to replace a text in a file with a shorter or longer text without rewriting the entire file. (The very only exceptions being "exactly the same length text" and "the data is at the very end".) – usr2564301
iterparse still processes the whole tree. You can't avoid that:
http://effbot.org/zone/element-iterparse.htm#incremental-parsing
Incremental Parsing # Note that iterparse still builds a tree, just
like parse, but you can safely rearrange or remove parts of the tree
while parsing. For example, to parse large files, you can get rid of
elements as soon as you’ve processed them:
for event, elem in iterparse(source):
if elem.tag == "record":
... process record elements ...
elem.clear()
If your XML file is too big to handle in your program then you need to consider another data storage format like a database.
Otherwise, you could do some file manipulation magic with the text file with sed and awk or some other tool.
I've been working with huge files recently too, and couldn't fit them in memory. To fix this, I put together a simple package bigread (pip install bigread) that streams n lines of a file into RAM at once:
from bigread import Reader
# this will be the output file
with open('updated.xml', 'w') as out:
# read the input file
for i in Reader(file='input.xml', block_size=1):
# check if this is a line you need to operate on
if i.lstrip()[:5] == '<tag ':
# replace the target attribute
i = i.replace(' attr="cats" ', ' attr="dogs" ')
# write the new line to disk
out.write(i + '\n')

File Reading Options Enquiry (Python)

I am a programming student for the semester. In class we have been learning about file opening, reading and writing.
We have used a_reader to achieve such tasks for file opening. I have been reading our associated text/s and I have noticed that there is a CSV reader option which I have been using.
I wanted to know if there were anymore possible ways to open/read a file as I am trying to grow my knowledge base in python and its associated contents.
EDIT:
I was referring to CSV more specifically as that is the type of files we use at the moment. We have learnt about CSV Reader and a_reader and an example from one of our lectures is shown below.
def main():
a_reader = open('IDCJAC0016_009225_1800_Data.csv', 'rU')
file_data = a_reader.read()
a_reader.close()
print file_data
main()
It may seem overly broad but I have no knowledge which is why I am asking is there more than just the 2 ways above. If there is can someone who knows provide the types so I can read up on and research on them.
If you're asking about places to store things, the first interfaces you'll meet are files and sockets (pretend a network connection is like a file, see http://docs.python.org/2/library/socket.html).
If you mean file formats (like csv), there are many! Probably you can think of many yourself, but besides csv there are html files, pictures (png, jpg, gif), archive formats (tar, zip), text files (.txt!), python files (.py). The list goes on.
There are many ways to read files in different ways.
Just plain open will take a filename and open it as a sequence of lines. Or, you can just call read() on it, and it will read the whole file at once into one giant string.
codecs.open will take a filename and a character set, and decode each line to Unicode automatically. Or, again, you can just call read() on it, and it will read and decode the whole file at once into one giant Unicode string.
csv.reader will take a file or file-like object, and read it as a sequence of CSV rows. There's no direct equivalent of read()—but you can turn any sequence into a list by just calling list on it, so list(my_reader) will give you a list of rows (each of which is, itself, a list).
zipfile.ZipFile will take a filename, or a file or file-like object, and read it as a ZIP archive. This doesn't go line by line, of course, but you can go archived file by archived file. Or you can do fancier things, like search for archived files by name.
There are modules for reading JSON and XML documents, different ways of handling binary files, and so on. Some of them work differently—for example, you can search an XML document as a tree with one module, or go element by element with a different one.
Python has a pretty extensive standard library, and you can find the documentation online. Every module that seems like it should be able to work on files, probably can.
And, beyond what comes in the standard library, PyPI, the Python Package Index has thousands of additional modules. Looking for a way to read YAML documents? Search PyPI for yaml and you'll find it.
Finally, Python makes it very easy to add things like this on your own. The skeleton of a function like csv.reader is as simple as this:
def reader(fileobj):
for line in fileobj:
yield parse_one_csv_line(line)
You can replace that parse_one_csv_line with anything you want, and you've got a custom reader. For example, here's an uppercase_reader:
def uppercase_reader(fileobj):
for line in fileobj:
yield line.upper()
In fact, you can even write the whole thing in one line:
shouts = (line.upper() for line in fileobj)
And the best thing is that, as long as your reader only yields one line at a time, your reader is itself a file-like object, so you can pass uppercase_reader(fileobj) to csv.reader and it works just fine.

How to update/modify an XML file in python?

I have an XML document that I would like to update after it already contains data.
I thought about opening the XML file in "a" (append) mode. The problem is that the new data will be written after the root closing tag.
How can I delete the last line of a file, then start writing data from that point, and then close the root tag?
Of course I could read the whole file and do some string manipulations, but I don't think that's the best idea..
Using ElementTree:
import xml.etree.ElementTree
# Open original file
et = xml.etree.ElementTree.parse('file.xml')
# Append new tag: <a x='1' y='abc'>body text</a>
new_tag = xml.etree.ElementTree.SubElement(et.getroot(), 'a')
new_tag.text = 'body text'
new_tag.attrib['x'] = '1' # must be str; cannot be an int
new_tag.attrib['y'] = 'abc'
# Write back to file
#et.write('file.xml')
et.write('file_new.xml')
note: output written to file_new.xml for you to experiment, writing back to file.xml will replace the old content.
IMPORTANT: the ElementTree library stores attributes in a dict, as such, the order in which these attributes are listed in the xml text will NOT be preserved. Instead, they will be output in alphabetical order.
(also, comments are removed. I'm finding this rather annoying)
ie: the xml input text <b y='xxx' x='2'>some body</b> will be output as <b x='2' y='xxx'>some body</b>(after alphabetising the order parameters are defined)
This means when committing the original, and changed files to a revision control system (such as SVN, CSV, ClearCase, etc), a diff between the 2 files may not look pretty.
Useful Python XML parsers:
Minidom - functional but limited
ElementTree - decent performance, more functionality
lxml - high-performance in most cases, high functionality including real xpath support
Any of those is better than trying to update the XML file as strings of text.
What that means to you:
Open your file with an XML parser of your choice, find the node you're interested in, replace the value, serialize the file back out.
The quick and easy way, which you definitely should not do (see below), is to read the whole file into a list of strings using readlines(). I write this in case the quick and easy solution is what you're looking for.
Just open the file using open(), then call the readlines() method. What you'll get is a list of all the strings in the file. Now, you can easily add strings before the last element (just add to the list one element before the last). Finally, you can write these back to the file using writelines().
An example might help:
my_file = open(filename, "r")
lines_of_file = my_file.readlines()
lines_of_file.insert(-1, "This line is added one before the last line")
my_file.writelines(lines_of_file)
The reason you shouldn't be doing this is because, unless you are doing something very quick n' dirty, you should be using an XML parser. This is a library that allows you to work with XML intelligently, using concepts like DOM, trees, and nodes. This is not only the proper way to work with XML, it is also the standard way, making your code both more portable, and easier for other programmers to understand.
Tim's answer mentioned checking out xml.dom.minidom for this purpose, which I think would be a great idea.
While I agree with Tim and Oben Sonne that you should use an XML library, there are ways to still manipulate it as a simple string object.
I likely would not try to use a single file pointer for what you are describing, and instead read the file into memory, edit it, then write it out.:
inFile = open('file.xml', 'r')
data = inFile.readlines()
inFile.close()
# some manipulation on `data`
outFile = open('file.xml', 'w')
outFile.writelines(data)
outFile.close()
For the modification, you could use tag.text from xml. Here is snippet:
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for rank in root.iter('rank'):
new_rank = int(rank.text) + 1
rank.text = str(new_rank)
tree.write('output.xml')
The rank in the code is example of tag, which depending on your XML file contents.
What you really want to do is use an XML parser and append the new elements with the API provided.
Then simply overwrite the file.
The easiest to use would probably be a DOM parser like the one below:
http://docs.python.org/library/xml.dom.minidom.html
To make this process more robust, you could consider using the SAX parser (that way you don't have to hold the whole file in memory), read & write till the end of tree and then start appending.
You should read the XML file using specific XML modules. That way you can edit the XML document in memory and rewrite your changed XML document into the file.
Here is a quick start: http://docs.python.org/library/xml.dom.minidom.html
There are a lot of other XML utilities, which one is best depends on the nature of your XML file and in which way you want to edit it.
As Edan Maor explained, the quick and dirty way to do it (for [utc-16] encoded .xml files), which you should not do for the resons Edam Maor explained, can done with the following python 2.7 code in case time constraints do not allow you to learn (propper) XML parses.
Assuming you want to:
Delete the last line in the original xml file.
Add a line
substitute a line
Close the root tag.
It worked in python 2.7 modifying an .xml file named "b.xml" located in folder "a", where "a" was located in the "working folder" of python. It outputs the new modified file as "c.xml" in folder "a", without yielding encoding errors (for me) in further use outside of python 2.7.
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
line_index =0 #set line count to 0 before starting
file = io.open('a/b.xml', 'r', encoding='utf-16')
lines = file.readlines()
outFile = open('a/c.xml', 'w')
for line in lines[0:len(lines)]:
line_index =line_index +1
if line_index == len(lines):
#1. & 2. delete last line and adding another line in its place not writing it
outFile.writelines("Write extra line here" + '\n')
# 4. Close root tag:
outFile.writelines("</phonebook>") # as in:
#http://tizag.com/xmlTutorial/xmldocument.php
else:
#3. Substitue a line if it finds the following substring in a line:
pattern = '<Author>'
subst = ' <Author>' + domain + '\\' + user_name + '</Author>'
if pattern in line:
line = subst
print line
outFile.writelines(line)#just writing/copying all the lines from the original xml except for the last.

Categories

Resources