lxml not adding newlines when inserting a new element into existing xml - python

I have a large set of existing xml files, and I am trying to add one element to all of them (they are pom.xml for a number of maven projects, and I am trying to add a parent element to all of them). The following is my exact code.
The problem is that the final xml output in pom2.xml has the complete parent element in a single line. Though, when I print the element by itself, it writes it out in 4 lines as usual. How do I print out the complete xml with proper formatting for the parent element?
from lxml import etree
parentPom = etree.Element('parent')
groupId = etree.Element('groupId')
groupId.text = 'org.myorg'
parentPom.append(groupId)
artifactId = etree.Element('artifactId')
artifactId.text = 'myorg-master-pom'
parentPom.append(artifactId)
version = etree.Element('version')
version.text = '1.0.0'
parentPom.append(version)
print etree.tostring(parentPom, pretty_print=True)
pom = etree.parse("pom.xml")
projectElement = pom.getroot()
projectElement.insert(0, parentPom)
file = open("pom2.xml", 'wb')
file.write(etree.tostring(projectElement, pretty_print=True))
file.close()
Output of print:
<parent>
<groupId>org.myorg</groupId>
<artifactId>myorg-master-pom</artifactId>
<version>1.0.0</version>
</parent>
Output of same element in pom2.xml:
<parent><groupId>com.inmobi</groupId><artifactId>inmobi-master-pom</artifactId><version>1.0.1</version></parent><modelVersion>4.0.0</modelVersion>

This might be of intrest to you.
http://lxml.de/FAQ.html#why-doesn-t-the-pretty-print-option-reformat-my-xml-output
In short for future reference:
parser = etree.XMLParser(remove_blank_text=True)
pom = etree.parse("pom.xml",parser)

Related

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source
This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Error while parsing xml file in python

This is the xml file I am trying to parse. This file does not have a root tag.
<data txt="some0" txt1 = "some1" txt2 = "some2" >
<data2>
< bank = "SBI" bank2 = "SBI2" >
<data2>
<data3>
<branch = "bang1" branch = bang"2" >
<data3>
<data>
My script contains below lines. The below can be used to get the specific data after parsing it.
data = re.findall("<data txt=.*?</data>", re.DOTALL)
tree = ElementTree.fromstringlist(data)
I am unabale to parse this file because its not having root tag. please help me how to parse if the file is having no tag ??
As pointed out in a comment already, you can just parse the whole thing. If the missing root element is the problem, you can grab the contents of the file as a string and then add an arbitrary root tag at the beginning and the end.
stringdata = "<myroot>%s</myroot>" % stringdata
and then parse the string.
EDIT:
In response to comment.
If you have one string, you'll want fromstring, but you'll almost certainly get the same error. Something else is going on. Try this ...
from xml.etree import ElementTree
stringdata = "<myroot>%s</myroot>" % stringdata
tree = ElementTree.fromstring(stringdata)
Then get what you need from tree.

How can i do replace a child element(s) in ElementTree

I want to replace child elements from one tree to another , based on some criteria. I can do this using Comprehension ? But how do we replace element in ElementTree?
You can't replace an element from the ElementTree you can only work with Element.
Even when you call ElementTree.find() it's just a shortcut for getroot().find().
So you really need to:
extract the parent element
use comprehension (or whatever you like) on that parent element
The extraction of the parent element can be easy if your target is a root sub-element (just call getroot()) otherwise you'll have to find it.
Unlike the DOM, etree has no explicit multi-document functions. However, you should be able to just move elements freely from one document to another. You may want to call _setroot after doing so.
By calling insert and then remove, you can replace a node in a document.
I'm new to python, but I've found a dodgy way to do this:
Input file input1.xml:
<?xml version="1.0" encoding="UTF-8"?>
<root>
<import ref="input2.xml" />
<name awesome="true">Chuck</name>
</root>
Input file input2.xml:
<?xml version="1.0" encoding="UTF-8"?>
<foo>
<bar>blah blah</bar>
</foo>
Python code: (note, messy and hacky)
import os
import xml.etree.ElementTree as ElementTree
def getElementTree(xmlFile):
print "-- Processing file: '%s' in: '%s'" %(xmlFile, os.getcwd())
xmlFH = open(xmlFile, 'r')
xmlStr = xmlFH.read()
et = ElementTree.fromstring(xmlStr)
parent_map = dict((c, p) for p in et.getiterator() for c in p)
# ref: https://stackoverflow.com/questions/2170610/access-elementtree-node-parent-node/2170994
importList = et.findall('.//import[#ref]')
for importPlaceholder in importList:
old_dir = os.getcwd()
new_dir = os.path.dirname(importPlaceholder.attrib['ref'])
shallPushd = os.path.exists(new_dir)
if shallPushd:
print " pushd: %s" %(new_dir)
os.chdir(new_dir) # pushd (for relative linking)
# Recursing to import element from file reference
importedElement = getElementTree(os.path.basename(importPlaceholder.attrib['ref']))
# element replacement
parent = parent_map[importPlaceholder]
index = parent._children.index(importPlaceholder)
parent._children[index] = importedElement
if shallPushd:
print " popd: %s" %(old_dir)
os.chdir(old_dir) # popd
return et
xmlET = getElementTree("input1.xml")
print ElementTree.tostring(xmlET)
gives the output:
-- Processing file: 'input1.xml' in: 'C:\temp\testing'
-- Processing file: 'input2.xml' in: 'C:\temp\testing'
<root>
<foo>
<bar>blah blah</bar>
</foo><name awesome="true">Chuck</name>
</root>
this was concluded with information from:
stackoverflow answer: access ElementTree node parent node
accessing parents from effbot.org

How to comment out an XML Element (using minidom DOM implementation)

I would like to comment out a specific XML element in an xml file. I could just remove the element, but I would prefer to leave it commented out, in case it's needed later.
The code I use at the moment that removes the element looks like this:
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttribName1', 'AttribName2']:
element.parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
I would like to modify this so that it comments the element out rather then deleting it.
The following solution does exactly what I want.
from xml.dom import minidom
doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
parentNode = element.parentNode
parentNode.insertBefore(doc.createComment(element.toxml()), element)
parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()
You can do it with beautifulSoup. Read target tag, create appropriate comment tag and replace target tag
For example, creating comment tag:
from BeautifulSoup import BeautifulSoup
hello = "<!--Comment tag-->"
commentSoup = BeautifulSoup(hello)

Python and ElementTree: return "inner XML" excluding parent element

In Python 2.6 using ElementTree, what's a good way to fetch the XML (as a string) inside a particular element, like what you can do in HTML and javascript with innerHTML?
Here's a simplified sample of the XML node I am starting with:
<label attr="foo" attr2="bar">This is some text and a link in embedded HTML</label>
I'd like to end up with this string:
This is some text and a link in embedded HTML
I've tried iterating over the parent node and concatenating the tostring() of the children, but that gave me only the subnodes:
# returns only subnodes (e.g. and a link)
''.join([et.tostring(sub, encoding="utf-8") for sub in node])
I can hack up a solution using regular expressions, but was hoping there'd be something less hacky than this:
re.sub("</\w+?>\s*?$", "", re.sub("^\s*?<\w*?>", "", et.tostring(node, encoding="utf-8")))
How about:
from xml.etree import ElementTree as ET
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
root = ET.fromstring(xml)
def content(tag):
return tag.text + ''.join(ET.tostring(e) for e in tag)
print content(root)
print content(root.find('child2'))
Resulting in:
start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here
here as well<sub2 /><sub3 />
This is based on the other solutions, but the other solutions did not work in my case (resulted in exceptions) and this one worked:
from xml.etree import Element, ElementTree
def inner_xml(element: Element):
return (element.text or '') + ''.join(ElementTree.tostring(e, 'unicode') for e in element)
Use it the same way as in Mark Tolonen's answer.
The following worked for me:
from xml.etree import ElementTree as etree
xml = '<root>start here<child1>some text<sub1/>here</child1>and<child2>here as well<sub2/><sub3/></child2>end here</root>'
dom = etree.XML(xml)
(dom.text or '') + ''.join(map(etree.tostring, dom)) + (dom.tail or '')
# 'start here<child1>some text<sub1 />here</child1>and<child2>here as well<sub2 /><sub3 /></child2>end here'
dom.text or '' is used to get the text at the start of the root element. If there is no text dom.text is None.
Note that the result is not a valid XML - a valid XML should have only one root element.
Have a look at the ElementTree docs about mixed content.
Using Python 2.6.5, Ubuntu 10.04

Categories

Resources