What's the best way to handle XML in Python? - python

I am working on an XML file that is very large (I think that since about it is 45 GB) I have to search through the document of OpenStreetMap data and find a particular path that satisfies a certain criteria. I am currently using XML element tree however I think it is a little slow since I need to search the XML for specific Geographic coordinates. Is there a better way to do this ?
I have also seen a little bit of lxml however, I'd like to know if there's a better choice between the two ? Thank you!

The sax package is well suited for parsing huge xml files that can't be loaded into memory all at once. The parser will go "line-by-line" and notify you when it encountered the beginning or ending of an element, instead of loading the file into memory, parsing it, and giving you the whole tree.

Related

saving back to a docx xml file with python

I am trying to preform find and replace on docx files, whilst still maintaining formatting.
From reading up on this, the best way seems to be preforming the find/replace on the xml file of the document.
I can load in the xml file and find/replace on it, but unsure how to write it back.
docx:
Hello {text}!
python:
import zipfile
zip = zipfile.ZipFile(open('test.docx', 'rb'))
xmlString = zip.read('word/document.xml').decode('utf-8')
xmlString.replace('{text}', 'world')
What you are trying is dangerous, because you are processing a high lever docx file at a lower level. If you really want to do it, just use the hints from overwriting file in ziparchive as suggested by #shahvishal.
But unless you fully know all the details of docx format, my advice is : do not do that. Suppose there is for any reason in an internal field or attribute the string {text}. You are likely to change the file in an unexpected way leading immediately or even worse later to the destruction of the file (Word being no longer able to process it).
If you do your processing on a Windows machine with an installed word, you certainly could try to use automation to process the file with Microsoft Word. Unfortunately, I only did that long time ago and cannot give useful links. You will need:
general knowledge on pywin30 module
sufficient knowledge on the Automation interface of MS/WORD. Hopefully, its documentation is nice with many examples provided you have a full installation of Microsoft Office including macro help
You're really going to want to use a library for reading/writing docx files rather than trying to just deal with them as raw XML. A cursory search came up with the pypi module docx but I haven't used this module so I can't endorse it:
https://pypi.python.org/pypi/docx/0.2.4
I've had the (unfortunate) experience of dealing with the manipulation of MS Office documents from other programming languages, and spending the time to find good libraries really paid off.
The old saying goes "don't reinvent the wheel" and I think that's definitely true when manipulating non-trivial file formats. If a somewhat mature library exists to do the job, use it!
You would need to replace the file in the zip archive. There is no "simple" way of achieving this. The following is a question that should help:
overwriting file in ziparchive

memory efficient way to change and parse a large XML file in python

I want to parse a large XML file (25 GB) in python, and change some of its elements.
I tried ElementTree from xml.etree but it takes too much time at the first step (ElementTree.parse).
I read somewhere that SAX is fast and do not load the entire file into the memory but it just for parsing not modifying.
'iterparse' should also be just for parsing not modifying.
Is there any other option which is fast and memory efficient?
What is important for you here is that you need a streaming parser, which is what sax is. (There is a built in sax implementation in python and lxml provides one.) The problem is that since you are trying to modify the xml file, you will have to rewrite the xml file as you read it.
An XML file is a text file, You can't go and change some data in the middle of the text file without rewriting the entire text file (unless the data is the exact same size which is unlikely)
You can use SAX to read in each element and register an event to write back each element after it is been read and modified. If your changes are really simple it may be even faster to not even bother with the XML parsing and just match text for what you are looking for.
If you are doing any signinficant work with this large of an XML file, then I would say you shouldn't be using an XML file, you should be using a database.
The problem you have run into here is the same issue that Cobol programmers on mainframes had when they were working with File based data

Map xml file with extracted information

I am trying to map one xml file to another, based on a configuration file (that too can be an xml file).
Input
<ia>
<ib>...</ib>
<ic>...</ic>
</ia>
Output
<oa>
<ob>...</ob>
<oc>...</oc>
</oa>
Config
<config>
<conf>
<input>ia</input>
<output>oa</output>
</conf>
<conf>
<input>ib</input>
<output>ob</output>
</conf>
.....
</config>
So, the intention is to parse an xml file and retrieve the information interesting to me, and write into another xml file where the mapping information is specified in the config file.
Due to the scripting nature (and extending with plugins lateron), and the support for xml processing I was considering python. I just learned the syntax and basics of language, and came to know about lxml
One way of doing this
parse the config file (where , tag can have xpath to the node I am interested in)
read the input file
write into output, using etbuilder based on the config file
Being new to python, and not seeing xpath support for etbuilder I wonder is this the best approach. Also not sure all the exceptional cases. Is there an easier way, or native support in any other libraries. If possible, I do not want to spend too much time on this task as I could focus on the core task.
thanks ins advance.
If you wish to transform an XML file into another XML file then XSLT was made for this purpose. You have to define a .xslt file that describes the transformation of XML content and what the eventual output should look like. That's one way to do it.
You can also read the XML file using lxml and generate the output XML with lxml.etree.ElementTree. I'm not familiar with etbuilder but I don't think generating the desired output is that difficult. Once you have parsed the input files, you can build the config XML and write it to a file.
XPath is primarily for reading XML content, you don't need it for constructing XML files. In fact, if you use a proper XML parser then you don't need XPath either to read the file contents, although XPath could make life a bit easier.

Confused by which XML processing option to use

I'm fairly new to Python, and I've just started working with XML parsing. I am getting a bit overwhelmed by all the options for working with XML, and I'm hoping an experienced person can give me some advice (and perhaps a code sample??) for the simple problem I'm working on.
I am working on a simple Python contact management application that does not involve a database - each contact's information is stored in a separate text file using XML. For example, assume the following is the contents of the file "1234.xml"
<contact>
<id>1234</id>
<name>Johnny Appleseed</name>
<phone>8145551212</phone>
<address>
<street>1234 Main Street</street>
<city>Hometown</city>
<state>OH</state>
</address>
<address>
<street>1313 Mockingbird Lane</street>
<city>White Plains</city>
<state>NY</state>
</address>
</contact>
For sake of example, let's assume there can be only one phone number, but multiple address blocks.
For what I'm doing here, I need to be able to parse the XML from the file, make changes to the data, and then update the XML and save it back to the file. Let's assume there are three types of data changes that might occur:
changing the data for one or more items, such as updating a phone number
adding a new address block (and the corresponding data for the street/city/state of the new address)
deleting an existing address block
Given what I'm trying to do here, can you recommend a particular way of doing this? (SAX, DOM, minidom, ElementTree, something else?) Code samples for whatever you suggest would be greatly appreciated.
Thank you!
Ron
The SAX and DOM APIs are older; they were pretty much translated from the Java world into Python. The ElementTree API was designed specifically to be Pythonic, i.e. to fit with the Python way of problem solving, so prefer that.
The richest and fastest implementation of ElementTree that I know of is lxml. It's XPath functionality is very useful. Untested example:
from lxml import etree
contacts = etree.parse(open("1234.xml"))
for c in contacts.xpath('//contact'):
if c.xpath('/name')[0].text == 'Johnny Appleseed':
c.xpath('/phone')[0].text = NEW_PHONE_NUMBER
print >> open("1234.xml", "w"), etree.tostring(contacts)
The best solution is to use ElementTree and parse this into a set of classes and manipulate the classes and then serialize them back to XML. You can do this by hand if the XML is really as simple as your example or you can use some tool or library to generate the bindings.
Working directly with XML in most cases always ends in tears, or at least hair pulling. It also isn't very maintainable, when the XML changes it usually will break your hand coded parsing.
Using a binding solution will be more robust to changes and easier to modify when you do need to manually intervene.

splitting an exml file into smaller files

the xml file contains information about movies. how do i split the xml file into smaller files? ( so each small file is a separate movie)
Without knowing the details, here is a broad outline of a possible approach:
Parse the XML using a suitable library (BeautifulSoup, lxml etc.)
Find the element corresponding to each movie. This can be done using a plain findAll or may require using an XPATH expression.
Pretty print the subtree starting corresponding to each movie element into separate files.
Of course a more detailed answer is not possible unless you post some sample XML and provide more details.

Categories

Resources