Map xml file with extracted information - python

I am trying to map one xml file to another, based on a configuration file (that too can be an xml file).
Input
<ia>
<ib>...</ib>
<ic>...</ic>
</ia>
Output
<oa>
<ob>...</ob>
<oc>...</oc>
</oa>
Config
<config>
<conf>
<input>ia</input>
<output>oa</output>
</conf>
<conf>
<input>ib</input>
<output>ob</output>
</conf>
.....
</config>
So, the intention is to parse an xml file and retrieve the information interesting to me, and write into another xml file where the mapping information is specified in the config file.
Due to the scripting nature (and extending with plugins lateron), and the support for xml processing I was considering python. I just learned the syntax and basics of language, and came to know about lxml
One way of doing this
parse the config file (where , tag can have xpath to the node I am interested in)
read the input file
write into output, using etbuilder based on the config file
Being new to python, and not seeing xpath support for etbuilder I wonder is this the best approach. Also not sure all the exceptional cases. Is there an easier way, or native support in any other libraries. If possible, I do not want to spend too much time on this task as I could focus on the core task.
thanks ins advance.

If you wish to transform an XML file into another XML file then XSLT was made for this purpose. You have to define a .xslt file that describes the transformation of XML content and what the eventual output should look like. That's one way to do it.
You can also read the XML file using lxml and generate the output XML with lxml.etree.ElementTree. I'm not familiar with etbuilder but I don't think generating the desired output is that difficult. Once you have parsed the input files, you can build the config XML and write it to a file.
XPath is primarily for reading XML content, you don't need it for constructing XML files. In fact, if you use a proper XML parser then you don't need XPath either to read the file contents, although XPath could make life a bit easier.

Related

How can I get line and column numbers of elements when parsing xml with Python?

What I'm trying to do is write a GUI program in Python that displays the contents of XML files with some basic features like syntax highlighting (this isn't the only thing it needs to do, but it is one of the things).
To do this, I figured I should use an xml parsing package such as lxml or ElementTree. When I render the xml, I would like to be able to use the data structure produced by the parser to do things like syntax highlighting (or whatever). lxml has a "sourceline" property, but nothing for column numbers as far as I can tell.
Am I going about this the right way? Is there a better way to accomplish what I want? Otherwise I think will have to write my own XML parser, which I'm not enthusiastic about.

Parse a python file to a XML template in Apache Velocity

I have a Python source file which I would like to convert, according to a certain template written in Apache Velocity, to an XML file.
Is it possible to read the contents of a python code file and extract the necessary information in native Velocity language? Or do I need to write a python script (or some other language)to parse the python file to the xml template?
As you have identified, you need to have two distinct phases:
Extract information from the Python source file (aka parsing)
Publish this information back to an XML file
Apache Velocity can help you accomplish step 1, but knows nothing about parsing.
There are several ways to achieve the parsing step:
you could parse it with whatever tool you like, and publish it in a format easily understandable for Velocity, like a Properties file that you would put in the context. If you need to use hierarchical properties, have a look at the ValueParser tool, which can return submaps from sets of properties like foo.bar=woogie and foo.schmoo=wiggie.
you could give a try to Stillness, a parsing tool that uses a Velocity-like syntax (disclaimer: I wrote it).

memory efficient way to change and parse a large XML file in python

I want to parse a large XML file (25 GB) in python, and change some of its elements.
I tried ElementTree from xml.etree but it takes too much time at the first step (ElementTree.parse).
I read somewhere that SAX is fast and do not load the entire file into the memory but it just for parsing not modifying.
'iterparse' should also be just for parsing not modifying.
Is there any other option which is fast and memory efficient?
What is important for you here is that you need a streaming parser, which is what sax is. (There is a built in sax implementation in python and lxml provides one.) The problem is that since you are trying to modify the xml file, you will have to rewrite the xml file as you read it.
An XML file is a text file, You can't go and change some data in the middle of the text file without rewriting the entire text file (unless the data is the exact same size which is unlikely)
You can use SAX to read in each element and register an event to write back each element after it is been read and modified. If your changes are really simple it may be even faster to not even bother with the XML parsing and just match text for what you are looking for.
If you are doing any signinficant work with this large of an XML file, then I would say you shouldn't be using an XML file, you should be using a database.
The problem you have run into here is the same issue that Cobol programmers on mainframes had when they were working with File based data

Need to read XML files as a stream using BeautifulSoup in Python

I have a dilemma.
I need to read very large XML files from all kinds of sources, so the files are often invalid XML or malformed XML. I still must be able to read the files and extract some info from them. I do need to get tag information, so I need XML parser.
Is it possible to use Beautiful Soup to read the data as a stream instead of the whole file into memory?
I tried to use ElementTree, but I cannot because it chokes on any malformed XML.
If Python is not the best language to use for this project please add your recommendations.
Beautiful Soup has no streaming API that I know of. You have, however, alternatives.
The classic approach for parsing large XML streams is using an event-oriented parser, namely SAX. In python, xml.sax.xmlreader. It will not choke with malformed XML. You can avoid erroneous portions of the file and extract information from the rest.
SAX, however, is low-level and a bit rough around the edges. In the context of python, it feels terrible.
The xml.etree.cElementTree implementation, on the other hand, has a much nicer interface, is pretty fast, and can handle streaming through the iterparse() method.
ElementTree is superior, if you can find a way to manage the errors.

splitting an exml file into smaller files

the xml file contains information about movies. how do i split the xml file into smaller files? ( so each small file is a separate movie)
Without knowing the details, here is a broad outline of a possible approach:
Parse the XML using a suitable library (BeautifulSoup, lxml etc.)
Find the element corresponding to each movie. This can be done using a plain findAll or may require using an XPATH expression.
Pretty print the subtree starting corresponding to each movie element into separate files.
Of course a more detailed answer is not possible unless you post some sample XML and provide more details.

Categories

Resources