Facing an error while modifying XML file with python

Facing an error while modifying XML file with python - python

I am parsing an XML file and trying to delete a empty node but I am receiving the following error:
ValueError: list.remove(x): x not in lis
The XML file is as follows:
<toc>
<topic filename="GUID-5B8DE7B7-879F-45A4-88E0-732155904029.xml" docid="GUID-5B8DE7B7-879F-45A4-88E0-732155904029" TopicTitle="Notes, cautions, and warnings" />
<topic filename="GUID-89943A8D-00D3-4263-9306-CDC944609F2B.xml" docid="GUID-89943A8D-00D3-4263-9306-CDC944609F2B" TopicTitle="HCI Deployment with Windows Server">
<childTopics>
<topic filename="GUID-A3E5EA96-2110-46FF-9251-2291DF755F50.xml" docid="GUID-A3E5EA96-2110-46FF-9251-2291DF755F50" TopicTitle="Installing the OMIMSWAC license" />
<topic filename="GUID-7C4D616D-0D9A-4AE1-BE0F-EC6FC9DAC87E.xml" docid="GUID-7C4D616D-0D9A-4AE1-BE0F-EC6FC9DAC87E" TopicTitle="Managing Microsoft HCI-based clusters">
<childTopics>
</childTopics>
</topic>
</childTopics>
</topic>
</toc>
Kindly note that this is just an example format of my XML File. I this file, I want to remove the empty tag but I am getting an error. My current code is:
import xml.etree.ElementTree as ET
tree = ET.parse("toc2 - Copy.xml")
root = tree.getroot()
node_to_remove = root.findall('.//childTopics//childTopics')
for node in node_to_remove:
root.remove(node)

You need to call remove on the node's immediate parent, not on root. This is tricky using xml.etree, but if instead you use lxml.etree you can write:
import lxml.etree as ET
tree = ET.parse("data.xml")
root = tree.getroot()
node_to_remove = root.findall('.//childTopics//childTopics')
for node in node_to_remove:
node.getparent().remove(node)
print(ET.tostring(tree).decode())
Nodes in xml.etree do not have a getparent() method. If you're unable to use lxml, you'll need to look into other solutions for finding the parent of a node; this question has some discussion on that topic.

Related

Parsing XML Attributes with Python

I am trying to parse out all the green highlighted attributes (some sensitive things have been blacked out), I have a bunch of XML files all with similar formats, I already know how to loop through all of them individually them I am having trouble parsing out the specific attributes though.
XML Document
I need the text in the attributes: name="text1"
from
project logLevel="verbose" version="2.0" mainModule="Main" name="text1">
destinationDir="/text2" from
put label="Put Files" destinationDir="/Trigger/FPDMMT_INBOUND">
destDir="/text3" from
copy disabled="false" version="1.0" label="Archive Files" destDir="/text3" suffix="">
I am using
import csv
import os
import re
import xml.etree.ElementTree as ET
tree = ET.parse(XMLfile_path)
item = tree.getroot()[0]
root = tree.getroot()
print (item.get("name"))
print (root.get("name"))
This outputs:
Main
text1
The item.get pulls the line at index [0] which is the first line root in the tree which is <module
The root.get pulls from the first line <project
I know there's a way to search for exactly the right part of the root/tree with something like:
test = root.find('./project/module/ftp/put')
print (test.get("destinationDir"))
I need to be able to jump directly to the thing I need and output the attributes I need.
Any help would be appreciated
Thanks.

Simplified copy of your XML:
xml = '''<project logLevel="verbose" version="2.0" mainModule="Main" name="hidden">
<module name="Main">
<createWorkspace version="1.0"/>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination1">
</put>
</ftp>
<ftp version="1.0" label="FTP connection to PRD">
<put label="Put Files" destinationDir="destination2">
</put>
</ftp>
<copy disabled="false" destDir="destination3">
</copy>
</module>
</project>
'''
# solution using ETree
from xml.etree import ElementTree as ET
root = ET.fromstring(xml)
name = root.get('name')
ftp_destination_dir1 = root.findall('./module/ftp/put')[0].get('destinationDir')
ftp_destination_dir2 = root.findall('./module/ftp/put')[1].get('destinationDir')
copy_destination_dir = root.find('./module/copy').get('destDir')
print(name)
print(ftp_destination_dir1)
print(ftp_destination_dir2)
print(copy_destination_dir)
# solution using lxml
from lxml import etree as et
root = et.fromstring(xml)
name = root.get('name')
ftp_destination_dirs = root.xpath('./module/ftp/put/#destinationDir')
copy_destination_dir = root.xpath('./module/copy/#destDir')[0]
print(name)
print(ftp_destination_dirs[0])
print(ftp_destination_dirs[1])
print(copy_destination_dir)

Reading xml with lxml lib geting strange string from xmlns tag

I am writing program to work on xml file and change it. But when I try to get to any part of it I get some extra part.
My xml file:
<?xml version="1.0" encoding="UTF-8"?>
<Package xmlns="http://soap.sforce.com/2006/04/metadata">
<types>
<members>sbaa__ApprovalChain__c.ExternalID__c</members>
<members>sbaa__ApprovalCondition__c.ExternalID__c</members>
<members>sbaa__ApprovalRule__c.ExternalID__c</members>
<name>CustomField</name>
</types>
<version>40.0</version>
</Package>
And I have my code:
from lxml import etree
import sys
tree = etree.parse('package.xml')
root = tree.getroot()
print( root[0][0].tag )
As output I expect to see members but I get something like this:
{http://soap.sforce.com/2006/04/metadata}members
Why do I see that url and how to stop it from showing up?

You have defined a default namespace (Wikipedia, lxml tutorial). When defined, it is a part of every child tag.
If you want to print the tag without the namespace, it's easy
tag = root[0][0].tag
print(tag[tag.find('}')+1:])
If you want to remove the namespace from XML, see this question.

Python -lxml xpath returns empty list

I am reading an xliff file and planning to retrieve specific element. I tried to print all the elements using
from lxml import etree
with open('path\to\file\.xliff', 'r',encoding = 'utf-8') as xml_file:
tree = etree.parse(xml_file)
root = tree.getroot()
for element in root.iter():
print("child", element)
The output was
child <Element {urn:oasis:names:tc:xliff:document:2.0}segment at 0x6b8f9c8>
child <Element {urn:oasis:names:tc:xliff:document:2.0}source at 0x6b8f908>
When I tried to get the specific element (with the help of many posts here) - source tag
segment = tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
it returns an empty list. Can someone tell me how to retrieve it properly.
Input :
<?xml version='1.0' encoding='UTF-8'?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0">
<segment id = 1>
<source>
Hello world
</source>
</segment>
<segment id = 2 >
<source>
2nd statement
</source>
</segment>
</xliff>
I want to get the values of segment and its corresponding source

This code,
tree.xpath('{urn:oasis:names:tc:xliff:document:2.0}segment')
is not accepted by lxml ("lxml.etree.XPathEvalError: Invalid expression"). You need to use findall().
The following works (in the XML sample, the segment elements are children of xliff):
from lxml import etree
tree = etree.parse("test.xliff") # XML in the question; ill-formed attributes corrected
segment = tree.findall('{urn:oasis:names:tc:xliff:document:2.0}segment')
print(segment)
However, the real XML is apparently more complex (segment is not a direct child of xliff). Then you need to add .// to search the whole tree:
segment = tree.findall('.//{urn:oasis:names:tc:xliff:document:2.0}segment')

Python ElementTree won't update new file after parsing

Using ElementTree to parse attribute's value in an XML and writing a new XML file. It will console the new updated value and write a new file. But won't update any changes in the new file. Please help me understand what I am doing wrong. Here is XML & Python code:
XML
<?xml version="1.0"?>
<!--
-->
<req action="get" msg="1" rank="1" rnklst="1" runuf="0" status="1" subtype="list" type="60" univ="IL" version="fhf.12.000.00" lang="ENU" chunklimit="1000" Times="1">
<flds>
<f i="bond(long) hff" aggregationtype="WeightedAverage" end="2016-02-29" freq="m" sid="fgg" start="2016-02-29"/>
<f i="bond(short) ggg" aggregationtype="WeightedAverage" end="2016-02-29" freq="m" sid="fhf" start="2016-02-29"/>
</flds>
<dat>
<r i="hello" CalculationType="3" Calculate="1" />
</dat>
</req>
Python
import xml.etree.ElementTree as ET
with open('test.xml', 'rt') as f:
tree = ET.parse(f)
for node in tree.iter('r'):
port_id = node.attrib.get('i')
new_port_id = port_id.replace(port_id, "new")
print node
tree.write('./new_test.xml')

When you get the attribute i, and assign it to port_id, you just have a regular Python string. Calling replace on it is just the Python string .replace() method.
You want to use the .set() method of the etree node:
for node in tree.iter('r'):
node.set('i', "new")
print node

Modify XML node using Python

I am taking an XML file as an input, have to search with a keyword i.e GENTEST05.
If found , then I need to pick up its parent node (in this example I want to pick up <ScriptElement>) and then replace the complete node <ScriptElement>blahblah</ScriptElement> with a new content.
...
...
<ScriptElement>
<ScriptElement>
<ScriptElement>
<ElementData xsi:type="anyData">
<DeltaTime>
<Area>
<Datatype>USER PROMPT [GENTEST05]</Datatype>
<Description />
<Multipartmessage>False<Multipartmessage>
<Comment>false</Comment>
</ElementData>
</ScriptElement>
<ScriptElement>
<ScriptElement>
...
...
...
I am trying to do this using Beautifulsoup. This is what I've done so far but not getting a proper way to proceed. Other than beautifulsoup, ElementTree or any other suggestion is welcome.
import sys
from BeautifulSoup import BeautifulStoneSoup as bs
xmlsoup = bs(open('file_xml' , 'r'))
a = raw_input('Enter Text')
paraText = xmlsoup.findAll(text=a)
print paraText
print paraText.findParent()

Ok here is some sample code to get you started. I used ElementTree because it's a builtin module and quite suitable for this type of task.
Here is the XML file I used:
<?xml version="1.0" ?>
<Script>
<ScriptElement/>
<ScriptElement/>
<ScriptElement>
<ElementData>
<DeltaTime/>
<Area/>
<Datatype>USER PROMPT [GENTEST05]</Datatype>
<Description/>
<Multipartmessage>False</Multipartmessage>
<Comment>false</Comment>
</ElementData>
</ScriptElement>
<ScriptElement/>
<ScriptElement/>
</Script>
Here is the python program:
import sys
import xml.etree.ElementTree as ElementTree
tree = ElementTree.parse("test.xml")
root = tree.getroot()
#The keyword to find and remove
keyword = "GENTEST05"
for element in list(root):
for sub in element.iter():
if sub.text and keyword in sub.text:
root.remove(element)
print ElementTree.tostring(root)
sys.exit()
I have kept the program simple so that you can improve on it. Since your XML has one root node, I am assuming you want to remove all parent elements of the keyword-matched element directly up to the root. In ElementTree, you can call root.remove() to remove the <ScriptElement> element that is the ancestory of the keyword-matched element.
This is just to get you started: this will only remove the first element, then print the resulting tree and quit.
Output:
<Script>
<ScriptElement />
<ScriptElement />
<ScriptElement />
<ScriptElement />
</Script>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Facing an error while modifying XML file with python - python

Related

Parsing XML Attributes with Python

Reading xml with lxml lib geting strange string from xmlns tag

Python -lxml xpath returns empty list

Python ElementTree won't update new file after parsing

Modify XML node using Python

Categories

Resources