As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm searching for an easy to handle python native module to create python object representation from xml.
I found several modules via google (one of them is XMLObject) but didn't want to try out all of them.
What do you think is the best way to do such things?
EDIT: I missed to mention that the XML I'd like to read is not generated by me. It's an existing XML file in a structure of which I have no control over.
You say you want an object representation, which I would interpret to mean that nodes become objects, and the attributes and children of the node are represented as attributes of the object (possibly according to some Schema). This is what XMLObject does, I believe.
There are some packages that I know of. 4Suite includes some tools to do this, and I believe Amara specifically implements this (built on top of 4Suite). You can also use lxml.objectify, which was inspired by Amara and gnosis.xml.objectify.
Of course a third option is, given a concrete representation of the XML (using ElementTree or lxml) you can build your own custom model around that. lxml.html is an example of that, extending the base interface of lxml with some HTML-specific functionality.
I second the suggestion of xml.etree.ElementTree, mostly because it's now in the stdlib.
There is also a faster implementation, xml.etree.cElementTree available too.
If you really need performance, I would suggest lxml
http://www.ibm.com/developerworks//xml/library/x-hiperfparse/
I've heard the easiest is ElementTree, though I rarely work with XML and I can't say anything from experience.
There's also the excellent 3rd party library pyxser for Python.
pyxser stands for Python XML
Serialization and is a Python object
to XML serializer and deserializer. In
other words, it can convert a Python
object into XML and also, convert that
XML back into the original Python
object.
I use (and like) PyRXP, which creates a tuple built from the XML document.
The main issue with a straight XML -> python object structure is that there is no python analog for a attributed list - that is, a list with elements, that also happens to have attributes. If you like, it is both a list and a dictionary at the same time.
I parse the result from PyRXP, and create the list/dictionary depending upon the structure - the XML I am dealing with is either list or attribute-based, never both. (I am consuming data from a known source).
Python has pickle and cPickle modules for Python object serialization. Both of these modules provide functionality to serialize/deserialize Python object hierarchy to convert to/from a byte stream:
https://docs.python.org/library/pickle.html
The following provides similar interface: pickle(), unpickle() for serialization to/from XML
http://code.activestate.com/recipes/355487/
I've had pretty good luck with Wai Yip Tung's xml2obj function available here:
http://code.activestate.com/recipes/534109-xml-to-python-data-structure/
It's ~84 lines of code. It's native and pure python; using xml.sax and re (regular expression) libraries. You just pass it XML and get back your object.
Related
I'm currently working on a project where I need to transfer objects from ruby to python and back again, obviously serialization is the way to go. I've looked at things like yaml but decided to write my own as I didn't want to deal with the dependencies of the libraries and such when it came time to distribute. I've wrote up how this serialization format works here.
my question is that as this format is intended to work cross language between ruby and python,
how should I serialize ruby's symbols? I'm not aware of a object that works the same way in python. should a dump containing a symbol fail? should I just serialize it as a string? what would be best?
Doesn't that depend on what your project needs? If symbols are important, you'll need some way to deal with them.
I'm not a Ruby programmer, but from what I've just read, I think converting them to strings is probably easiest. The standard Python interpreter will reuse memory for identical short strings, which seems to be a key reason suggested for using symbols.
EDIT: If it needs to work for other programmers, passing values back and forth shouldn't change them. So you either have to handle symbols properly, or throw an error straight away. It should be simple enough in Python:
class Symbol(str):
pass
# In serialising code:
if isinstance(x, Symbol):
serialise_as_symbol(x)
Any reason you're not using a standard data interchange format like JSON or XML? They seem to be acceptable to countless applications, services, and programmers.
If symbols are a stumbling block then you have three choices, don't allow them, convert them to strings on the fly or figure out a way to make them universal and/or innocuous in other languages.
For a while I've been using a package called "gnosis-utils" which provides an XML pickling service for Python. This class works reasonably well, however it seems to have been neglected by it's developer for the last four years.
At the time we originally selected gnosis it was the only XML serization tool for Python. The advantage of Gnosis was that it provided a set of classes whose function was very similar to the built-in Python XML pickler. It produced XML which python-developers found easy to read, but non-python developers found confusing.
Now that the proejct has grown we have a new requirement: We need to be able to exchange XML with our colleagues who prefer Java or .Net. These non-python developers will not be using Python - they intend to produce XML directly, hence we have a need to simplify the format of the XML.
So are there any alternatives to Gnosis. Our requirements:
Must work on Python 2.4 / Windows x86 32bit
Output must be XML, as simple as possible
API must resemble Pickle as closely as possible
Performance is not hugely important
Of course we could simply adapt Gnosis, however we'd prefer to simply use a component which already provides the functions we requrie (assuming that it exists).
So what you're looking for is a python library that spits out arbitrary XML for your objects? You don't need to control the format, so you can't be bothered to actually write something that iterates over the relevant properties of your data and generates the XML using one of the existing tools?
This seems like a bad idea. Arbitrary XML serialization doesn't sound like a good way to move forward. Any format that includes all of pickle's features is going to be ugly, verbose, and very nasty to use. It will not be simple. It will not translate well into Java.
What does your data look like?
If you tell us precisely what aspects of pickle you need (and why lxml.objectify doesn't fulfill those), we will be better able to help you.
Have you considered using JSON for your serialization? It's easy to parse, natively supports python-like data structures, and has wide-reaching support. As an added bonus, it doesn't open your code to all kinds of evil exploits the way the native pickle module does.
Honestly, you need to bite the bullet and define a format, and build a serializer using the standard XML tools, if you absolutely must use XML. Consider JSON.
There is xml_marshaller which provides a simple way of dumping arbitrary Python objects to XML:
>>> from xml_marshaller import xml_marshaller
>>> class Foo(object): pass
>>> foo = Foo()
>>> foo.bar = 'baz'
>>> dump_str = xml_marshaller.dumps(foo)
Pretty printing the above with lxml (which is a dependency of xml_marshaller anyway):
>>> from lxml.etree import fromstring, tostring
>>> print tostring(fromstring(dump_str), pretty_print=True)
You get output like this:
<marshal>
<object id="i2" module="__main__" class="Foo">
<tuple/>
<dictionary id="i3">
<string>bar</string>
<string>baz</string>
</dictionary>
</object>
</marshal>
I did not check for Python 2.4 compatibility since this question was asked long ago, but a solution for xml dumping arbitrary Python objects remains relevant.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
EDIT
The use of the phrase "bad at XML" in this question has been a point of contention, so I'd like to start out by providing a very clear definition of what I mean by this term in this context: if support for standard XML APIs is poor, and forces one to use a language-specific API, in which namespaces seem to be an afterthought, then I would be inclined to characterize that language as being not as well suited to using XML as other mainstream languages that do not have these issues. "Bad at XML" is just a shorthand for these conditions, and I think it is a fair way to characterize it. As I will describe, my initial experience with Python has raised concerns about whether it fulfils these conditions; but, because in general my experience with Python has been quite positive, it seems likely that I'm missing something, thus motivating this question.
I'm trying to do some very simple XML processing with Python. I had initially hoped to be able to reuse my knowledge of standard W3C DOM API's, and happily found that the xml.dom and xml.dom.minidom modules did a good job of supporting these API's. Unfortunately, however, serialization proved to be problematic, for the following reasons:
xml.dom does not come with a serializer
the PyXML library, which includes a serializer for xml.dom, is no longer maintained, AND
minidom does not support serialization of namespaces, even though namespaces are supported in the API
I looked through the list of other W3C-like libraries here:
http://wiki.python.org/moin/PythonXml#W3CDOM-likelibraries
I found that many other libraries, such as 4Suite and libxml2dom, are also not maintained.
On the other hand, itools at first glance appears to be maintained, but there does not appear to be an Ubuntu/Debian package available, and so would be difficult to deploy and maintain.
At this point, it seemed like trying to use W3C DOM API's in my Python application was going to be dead-end, and I began to look at the ElementTree API. But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created:
http://lxml.de/tutorial.html#namespaces
So, my question is, have I overlooked something, or is support for XML (in particular W3C DOM) actually quite bad in Python?
EDIT
Here follows a list of more precise questions, the answers to which would really help me:
Is there reasonable support for W3C DOM in Python?
If not xml.dom, do you use e.g. etree instead of W3C DOM?
If so, which library is best, and how do you overcome the issues regarding namespacing in the API?
If you use W3C DOM instead, are you aware of a library that implements serialization with support for namespaces?
I would say python handles XML pretty well. The number of different libraries available speaks to that - you have lots of options. And if there are features missing from libraries that you would like to use, feel free to contribute some patches!
I personally use the DOM and lxml.etree (etree is really fast). However, I feel your pain about the namespace thing. I wrote a quick helper function to deal with it:
DEFAULT_NS = "http://www.domain.org/path/to/xml"
def add_xml_namespace(path, namespace=DEFAULT_NS):
"""Adds namespaces to an XPath-ish expression path for etree
Test simple expression:
>>> add_xml_namespace('image/namingData/fileBaseName')
'{http://www.domain.org/path/to/xml}image/{http://www.domain.org/path/to/xml}namingData/{http://www.domain.org/path/to/xml}fileBaseName'
More complicated expression
>>> add_xml_namespace('.//image/*')
'.//{http://www.domain.org/path/to/xml}image/*'
>>> add_xml_namespace('.//image/text()')
'.//{http://www.domain.org/path/to/xml}image/text()'
"""
pattern = re.compile(r'^[A-Za-z0-9-]+$')
tags = path.split('/')
for i in xrange(len(tags)):
if pattern.match(tags[i]):
tags[i] = "{%s}%s" % (namespace, tags[i])
return '/'.join(tags)
I use it like so:
from lxml import etree
from utilities import add_xml_namespace as ns
tree = etree.parse('file.xml')
node = tree.get_root().find(ns('root/group/subgroup'))
# etc.
If you don't know the namespace ahead of time, you can extract it from a root node:
tree = etree.parse('file.xml')
root = tree.getroot().tag
namespace = root[1:root.index('}')]
ns = lambda path: add_xml_namespace(path, namespace)
...
Additional comment: There is a little work involved here, but work is necessary when dealing with XML. That's not a python issue, it's an XML issue.
Python is great at handling XML, I consider lxml to be the best xml library I have ever worked with, it is powerful and significantly simpler the DOM. The namespace handling took some getting used to, but I think it is another great way lxml keeps things simple.
EDIT
After re reading the question, it is unclear exactly whether the the author meant the serialization of python objects, or just the DOM tree. The portion of my answer bellow assumed the former.
XML serialization is a completely different issue. Personal I don't think it is very important. Most XML serializers produce output that its pretty specific to the language or runtime, which defeats the purpose of having such an open format. I realize that there are some generic XML serialization schema, but Python provides 2 solutions that are superior for 95% of situations, Pickling and JSON.
If your application doesn't have to share objects with non-python systems, Pickling is the fastest and most powerful serialization solution you will find. JSON is significantly faster to parse and generate, and much easier to work with than XML. JSON has plenty of limitations, but it is frequently easier to work around them, than deal with the headaches of XML.
There are plenty of other serialization formats that, depending on the application, I would recommend ahead of XML (E.G.: Google Protocol Buffers, or YAML.)
Also, don't forget about SAX. Event driven parsers are only useful for reading XML, but I have found that it is still the best solution for some problems.
But the way the eTree API supports namespaces I think is horribly ugly, requiring one to use string concatenation every time an element in a particular namespace is created
Here's how you create an element with a namespace in .NET's System.Linq.Xml DOM:
XNamespace ns = "my-namespace";
XElement elm = new XElement(ns + "foo");
Here's how you create an element in a namespace in lxml:
ns = "{my-namespace}"
elm = etree.Element(ns + "foo")
I'm not seeing horrible ugliness here. In fact, the developers of the .NET API have bent over backwards, creating base classes that support operator overloading, to make it possible for their API to handle namespaces as intuitively as lxml's does.
How this is uglier than the W3C DOM requiring you to use different methods to create elements with and without namespaces is beyond me.
This question may be seen as subjective, but I'd like to ask SO users which common structured textual data format is best supported in Python.
My initial choices are:
XML
JSON
and YAML
Which of these three is easiest to work with in Python (ie. has the best library support / performance) ... or is there another format that I haven't mentioned that is better supported in Python.
I cannot use a Python only format (e.g. Pickling) since interop is quite important, but the majority of the code that handles these files will be written in Python, so I'm keen to use a format that has the strongest support in Python.
CSV or fixed column text may also be viable for most use cases, however I'd prefer the flexibility of a more scalable format.
Thank you
Note
Regarding interop I will be generating these files initially from Ruby, using Builder, however Ruby will not be consuming these files again.
I would go with JSON, I mean YAML is awesome but interop with it is not that great.
XML is just an ugly mess to look at and has too much fat.
Python has a built-in JSON module since version 2.6.
JSON has great python support and it is much more compact than XML (and the API is generally more convenient if you're just trying to dump and load objects). There's no out of the box support for YAML that I know of, although I haven't really checked. In the abstract I would suggest using JSON due to the low overhead of the format and the wide range of language support, but it does depend a bit on your application - if you're working in a space that already has established applications, the formats they use might be preferable, even if they're technically deficient.
I think it depends a lot on what you need to do with the data. If you're going to be building a complex database and doing processing and transformations on it, I suspect you'd be better off with XML. I've found the lxml module pretty useful in this regard. It has full support for standards like xpath and xslt, and this support is implemented in native code so you'll get good performance.
But if you're doing something more simple, then likely you'd be better off to use a simpler format like yaml or json. I've heard tell of "json transforms" but don't know how mature the technology is or how developed Python's access to it is.
It's pretty much all the same, out of those three. Use whichever is easier to inter-operate with.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
A colleague is looking to generate UML class diagrams from heaps of Python source code.
He's primarily interested in the inheritance relationships, and mildly interested in compositional relationships, and doesn't care much about class attributes that are just Python primitives.
The source code is pretty straightforward and not tremendously evil--it doesn't do any fancy metaclass magic, for example. (It's mostly from the days of Python 1.5.2, with some sprinklings of "modern" 2.3ish stuff.)
What's the best existing solution to recommend?
You may have heard of Pylint that helps statically checking Python code. Few people know that it comes with a tool named Pyreverse that draws UML diagrams from the Python code it reads. Pyreverse uses Graphviz as a backend.
It is used like this:
pyreverse -o png -p yourpackage .
where the . can also be a single file.
Epydoc is a tool to generate API documentation from Python source code. It also generates UML class diagrams, using Graphviz in fancy ways. Here is an example of diagram generated from the source code of Epydoc itself.
Because Epydoc performs both object introspection and source parsing it can gather more informations respect to static code analysers such as Doxygen: it can inspect a fair amount of dynamically generated classes and functions, but can also use comments or unassigned strings as a documentation source, e.g. for variables and class public attributes.
Certain classes of well-behaved programs may be diagrammable, but in the general case, it can't be done. Python objects can be extended at run time, and objects of any type can be assigned to any instance variable. Figuring out what classes an object can contain pointers to (composition) would require a full understanding of the runtime behavior of the program.
Python's metaclass capabilities mean that reasoning about the inheritance structure would also require a full understanding of the runtime behavior of the program.
To prove that these are impossible, you argue that if such a UML diagrammer existed, then you could take an arbitrary program, convert "halt" statements into statements that would impact the UML diagram, and use the UML diagrammer to solve the halting problem, which as we know is impossible.
It is worth mentioning Gaphor. A Python modelling/UML tool.
If you use Eclipse, maybe PyUML. Haven't used it, though.
vipera is a small application designer, and uml is included. You can see it in:
vipera
Best regards.
The SPE IDE has built-in UML creator. Just open the files in SPE and click on the UML tab.
I don't know how comprhensive it is for your needs, but it doesn't require any additional downloads or configurations to use.
Sparx's Enterprise Architect performs round-tripping of Python source. They have a free time-limited trial edition.
Umbrello does that too. in the menu go to Code -> import project and then point to the root deirectory of your project. then it reverses the code for ya...