Modify XML node using Python - python

I am taking an XML file as an input, have to search with a keyword i.e GENTEST05.
If found , then I need to pick up its parent node (in this example I want to pick up <ScriptElement>) and then replace the complete node <ScriptElement>blahblah</ScriptElement> with a new content.
...
...
<ScriptElement>
<ScriptElement>
<ScriptElement>
<ElementData xsi:type="anyData">
<DeltaTime>
<Area>
<Datatype>USER PROMPT [GENTEST05]</Datatype>
<Description />
<Multipartmessage>False<Multipartmessage>
<Comment>false</Comment>
</ElementData>
</ScriptElement>
<ScriptElement>
<ScriptElement>
...
...
...
I am trying to do this using Beautifulsoup. This is what I've done so far but not getting a proper way to proceed. Other than beautifulsoup, ElementTree or any other suggestion is welcome.
import sys
from BeautifulSoup import BeautifulStoneSoup as bs
xmlsoup = bs(open('file_xml' , 'r'))
a = raw_input('Enter Text')
paraText = xmlsoup.findAll(text=a)
print paraText
print paraText.findParent()

Ok here is some sample code to get you started. I used ElementTree because it's a builtin module and quite suitable for this type of task.
Here is the XML file I used:
<?xml version="1.0" ?>
<Script>
<ScriptElement/>
<ScriptElement/>
<ScriptElement>
<ElementData>
<DeltaTime/>
<Area/>
<Datatype>USER PROMPT [GENTEST05]</Datatype>
<Description/>
<Multipartmessage>False</Multipartmessage>
<Comment>false</Comment>
</ElementData>
</ScriptElement>
<ScriptElement/>
<ScriptElement/>
</Script>
Here is the python program:
import sys
import xml.etree.ElementTree as ElementTree
tree = ElementTree.parse("test.xml")
root = tree.getroot()
#The keyword to find and remove
keyword = "GENTEST05"
for element in list(root):
for sub in element.iter():
if sub.text and keyword in sub.text:
root.remove(element)
print ElementTree.tostring(root)
sys.exit()
I have kept the program simple so that you can improve on it. Since your XML has one root node, I am assuming you want to remove all parent elements of the keyword-matched element directly up to the root. In ElementTree, you can call root.remove() to remove the <ScriptElement> element that is the ancestory of the keyword-matched element.
This is just to get you started: this will only remove the first element, then print the resulting tree and quit.
Output:
<Script>
<ScriptElement />
<ScriptElement />
<ScriptElement />
<ScriptElement />
</Script>

Related

Facing an error while modifying XML file with python

I am parsing an XML file and trying to delete a empty node but I am receiving the following error:
ValueError: list.remove(x): x not in lis
The XML file is as follows:
<toc>
<topic filename="GUID-5B8DE7B7-879F-45A4-88E0-732155904029.xml" docid="GUID-5B8DE7B7-879F-45A4-88E0-732155904029" TopicTitle="Notes, cautions, and warnings" />
<topic filename="GUID-89943A8D-00D3-4263-9306-CDC944609F2B.xml" docid="GUID-89943A8D-00D3-4263-9306-CDC944609F2B" TopicTitle="HCI Deployment with Windows Server">
<childTopics>
<topic filename="GUID-A3E5EA96-2110-46FF-9251-2291DF755F50.xml" docid="GUID-A3E5EA96-2110-46FF-9251-2291DF755F50" TopicTitle="Installing the OMIMSWAC license" />
<topic filename="GUID-7C4D616D-0D9A-4AE1-BE0F-EC6FC9DAC87E.xml" docid="GUID-7C4D616D-0D9A-4AE1-BE0F-EC6FC9DAC87E" TopicTitle="Managing Microsoft HCI-based clusters">
<childTopics>
</childTopics>
</topic>
</childTopics>
</topic>
</toc>
Kindly note that this is just an example format of my XML File. I this file, I want to remove the empty tag but I am getting an error. My current code is:
import xml.etree.ElementTree as ET
tree = ET.parse("toc2 - Copy.xml")
root = tree.getroot()
node_to_remove = root.findall('.//childTopics//childTopics')
for node in node_to_remove:
root.remove(node)
You need to call remove on the node's immediate parent, not on root. This is tricky using xml.etree, but if instead you use lxml.etree you can write:
import lxml.etree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
node_to_remove = root.findall('.//childTopics//childTopics')
for node in node_to_remove:
node.getparent().remove(node)
print(ET.tostring(tree).decode())
Nodes in xml.etree do not have a getparent() method. If you're unable to use lxml, you'll need to look into other solutions for finding the parent of a node; this question has some discussion on that topic.

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?
You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source
This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Python ElementTree won't update new file after parsing

Using ElementTree to parse attribute's value in an XML and writing a new XML file. It will console the new updated value and write a new file. But won't update any changes in the new file. Please help me understand what I am doing wrong. Here is XML & Python code:
XML
<?xml version="1.0"?>
<!--
-->
<req action="get" msg="1" rank="1" rnklst="1" runuf="0" status="1" subtype="list" type="60" univ="IL" version="fhf.12.000.00" lang="ENU" chunklimit="1000" Times="1">
<flds>
<f i="bond(long) hff" aggregationtype="WeightedAverage" end="2016-02-29" freq="m" sid="fgg" start="2016-02-29"/>
<f i="bond(short) ggg" aggregationtype="WeightedAverage" end="2016-02-29" freq="m" sid="fhf" start="2016-02-29"/>
</flds>
<dat>
<r i="hello" CalculationType="3" Calculate="1" />
</dat>
</req>
Python
import xml.etree.ElementTree as ET
with open('test.xml', 'rt') as f:
tree = ET.parse(f)
for node in tree.iter('r'):
port_id = node.attrib.get('i')
new_port_id = port_id.replace(port_id, "new")
print node
tree.write('./new_test.xml')
When you get the attribute i, and assign it to port_id, you just have a regular Python string. Calling replace on it is just the Python string .replace() method.
You want to use the .set() method of the etree node:
for node in tree.iter('r'):
node.set('i', "new")
print node

List only one category Python xml

I am trying to write a python program that uses DOM to read xml file and print another xml structure that list from only one node with particular selected attribute "fun".
<?xml version="1.0" encoding="ISO-8859-1"?>
<website>
<url category="fun">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
<url category="entertainment">
<title>Fun world</title>
<author>Jack</author>
<year>2010</year>
<price>100.00</price>
</url>
</website>
I couldn't select the list from the URL having category="fun".
I tried this code:
for n in dom.getElementsByTagName('url'):
s = n.attribute['category']
if (s.value == "fun"):
print n.toxml()
Can you guys help to me to debug my code?
nb: One of your tags opens "Website" and attempts to close "website" - so you'll want to fix that one...
You've mentioned lxml.
from lxml import etree as et
root = et.fromstring(xml)
fun = root.xpath('/Website/url[#category="fun"]')
for node in fun:
print et.tostring(node)
Use getAttribute:
for n in dom.getElementsByTagName('url'):
if (n.getAttribute('category') == "fun"):
print(n.toxml())

Categories

Resources