Large XML parsing in Python

Large XML parsing in Python - python

I am a novice in python and have the following task on hand.
I have a large xml file like the one below:
<Configuration>
<Parameters>
<Component Name='ABC'>
<Group Name='DEF'>
<Parameter Name='GHI'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
<Parameter Name='JKL'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
</Group>
<Group Name='MNO'>
<Parameter Name='PQR'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
<Parameter Name='TUV'>
<Description>
Some Text
</Description>
<Type>Integer</Type>
<Restriction>
<Level>5</Level>
</Restriction>
<Value>
<Item Value='5'/>
</Value>
</Parameter>
</Group>
</Component>
</Parameters>
</Configuration>
In this xml file I have to parse through the component "ABC" go to group "MNO" and then to the parameter "TUV" and under this I have to change the item value to 10.
I have tried using xml.etree.cElementTree but to no use. And lxml dosent support on the server as its running a very old version of python. And I have no permissions to upgrade the version
I have been using the following code to parse and edit a relatively small xml:
def fnXMLModification(ArgStr):
argList = ArgStr.split()
strXMLPath = argList[0]
if not os.path.exists(strXMLPath):
fnlogs("XML File: " + strXMLPath + " does not exist.\n")
return False
try:
import xml.etree.cElementTree as ET
except ImportError:
import xml.etree.ElementTree as ET
f=open(strXMLPath, 'rt')
tree = ET.parse(f)
ValueSetFlag = False
AttrSetFlag = False
for strXPath in argList[1:]:
strXPathList = strXPath.split("[")
sxPath = strXPathList[0]
if len(strXPathList)==3:
# both present
AttrSetFlag = True
ValueSetFlag = True
valToBeSet = strXPathList[1].strip("]")
sAttr = strXPathList[2].strip("]")
attrList = sAttr.split(",")
elif len(strXPathList) == 2:
#anyone present
if "=" in strXPathList[1]:
AttrSetFlag = True
sAttr = strXPathList[1].strip("]")
attrList = sAttr.split(",")
else:
ValueSetFlag = True
valToBeSet = strXPathList[1].strip("]")
node = tree.find(sxPath)
if AttrSetFlag:
for att in attrList:
slist = att.split("=")
node.set(slist[0].strip(),slist[1].strip())
if ValueSetFlag:
node.text = valToBeSet
tree.write(strXMLPath)
fnlogs("XML File: " + strXMLPath + " has been modified successfully.\n")
return True
Using this function I am not able to traverse the current xml as it has lot of children attributes or sub groups.

import statement
import xml.etree.cElementTree as ET
Parse content by fromstring method.
root = ET.fromstring(data)
Iterate according our requirement and get target Item tag and change value of Value attribute
for component_tag in root.iter("Component"):
if "Name" in component_tag.attrib and component_tag.attrib['Name']=='ABC':
for group_tag in component_tag.iter("Group"):
if "Name" in group_tag.attrib and group_tag.attrib['Name']=='MNO':
#for value_tag in group_tag.iter("Value"):
for item_tag in group_tag.findall("Parameter[#Name='TUV']/Value/Item"):
item_tag.attrib["Value"] = "10"
We can use Xpath to get target Item tag
for item_tag in root.findall("Parameters/Component[#Name='ABC']/Group[#Name='MNO']/Parameter[#Name='TUV']/Value/Item"):
item_tag.attrib["Value"] = "10"
Use tostring method to get content.
data = ET.tostring(root)

Related

Get items from xml Python

I have an xml in python, need to obtain the elements of the "Items" tag in an iterable list.
I need get a iterable list from this XML, for example like it:
Item 1: Bicycle, value $250, iva_tax: 50.30
Item 2: Skateboard, value $120, iva_tax: 25.0
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<data>
<info>Listado de items</info>
<detalle>
<![CDATA[<?xml version="1.0" encoding="UTF-8"?>
<tienda id="tiendaProd" version="1.1.0">
<items>
<item>
<nombre>Bicycle</nombre>
<valor>250</valor>
<data>
<tax name="iva" value="50.30"></tax>
</data>
</item>
<item>
<nombre>Skateboard</nombre>
<valor>120</valor>
<data>
<tax name="iva" value="25.0"></tax>
</data>
</item>
<item>
<nombre>Motorcycle</nombre>
<valor>900</valor>
<data>
<tax name="iva" value="120.50"></tax>
</data>
</item>
</items>
</tienda>]]>
</detalle>
</data>
I am working with
import xml.etree.ElementTree as ET
for example
import xml.etree.ElementTree as ET
xml = ET.fromstring(stringBase64)
ite = xml.find('.//detalle').text
tixml = ET.fromstring(ite)

You can use BeautifulSoup4 (BS4) to do this.
from bs4 import BeautifulSoup
#Read XML file
with open("example.xml", "r") as f:
contents = f.readlines()
#Create Soup object
soup = BeautifulSoup(contents, 'xml')
#find all the item tags
item_tags = soup.find_all("item") #returns everything in the <item> tags
#find the nombre and valor tags within each item
results = {}
for item in item_tags:
num = item.find("nombre").text
val = item.find("valor").text
results[str(num)] = val
#Prints dictionary with key value pairs from the xml
print(results)

Parsing XML: Python ElementTree, find elements and its parent elements without other elements in same parent

I am using python's ElementTree library to parse an XML file which has the following structure. I am trying to get the xml string corresponding to entity with id = 192 with all its parents (folders) but without other entities
<catalog>
<folder name="entities">
<entity id="102">
</entity>
<folder name="newEntities">
<entity id="192">
</entity>
<entity id="2982">
</entity>
</folder>
</folder>
</catalog>
The required result should be
<catalog>
<folder name="entities">
<folder name="newEntities">
<entity id="192">
</entity>
</folder>
</folder>
</catalog>
assuming the 1st xml string is stored in a variable called xml_string
tree = ET.fromstring(xmlstring)
id = 192
required_element = tree.find(".//entity[#id='" + id + "']")
This gets the xml element for the required entity but not the parent folders, any quick solution fix for this?

The challenge here is to bypass the fact that ET has no parent information. The solution is to use parent_map
import copy
import xml.etree.ElementTree as ET
import xml.dom.minidom as minidom
xml = '''<catalog>
<folder name="entities">
<entity id="102">
</entity>
<folder name="newEntities">
<entity id="192">
</entity>
<entity id="2982">
</entity>
</folder>
</folder>
</catalog>'''
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ET.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent="\t")
root = ET.fromstring(xml)
parent_map = {c: p for p in root.iter() for c in p}
_id = 192
required_element = root.find(".//entity[#id='" + str(_id) + "']")
_path = [copy.deepcopy(required_element)]
while True:
parent = parent_map.get(required_element)
if parent:
_path.append(copy.deepcopy(parent))
required_element = parent
else:
break
idx = len(_path) - 1
while idx >= 1:
_path[idx].clear()
_path[idx].append(_path[idx-1])
idx -= 1
print(prettify(_path[-1]))
output
<?xml version="1.0" ?>
<catalog>
<folder>
<folder>
<entity id="192">
</entity>
</folder>
</folder>
</catalog>

Wildcard search at any nested depth using xml.etree.ElementTree

I have a group of XML files which contain entries like
<group name="XXX common string">
<value val="12" description="a dozen">
<text>one less than a baker's dozen</text>
</value>
<value val="13" description="a baker's dozen">
<text>One more than a dozen</text>
</value>
</group>
<group name="YYY common string">
<value val="42" description="the answer">
<text>What do you get if you multiple 6 by 9?</text>
</value>
</group>
Is there any simple way, using import xml.etree.ElementTree as ET and
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
if (args.info) or (args.diagnostics):
print('Parsing input file : ' + inputFileName)
tree = ET.parse(inputFileName, parser=parser)
root = tree.getroot()
to search for only <group> elements who#s name contains "common string" for a particular value val ?
Important: these groups are nested at different depths in different files.

This was a little difficult, because your own code won't work with the
example data you posted in your question (e.g., nothing there contains
the string error, and there are no id attributes, and your code
doesn't appear to search for "a particular value val, which seemed
to be one of your requirements). But here are a few ideas...
For finding all group elements that contain common string in the name attribute, you could do something like this:
>>> matching_groups = []
>>> for group in tree.xpath('//group[contains(#name, "common string")]'):
... matching_groups.append[group]
...
Which given your sample data would result in:
>>> print '\n'.join([etree.tostring(x) for x in matching_groups])
<group name="XXX common string">
<value val="12" description="a dozen">
<text>one less than a baker's dozen</text>
</value>
<value val="13" description="a baker's dozen">
<text>One more than a dozen</text>
</value>
</group>
<group name="YYY common string">
<value val="42" description="the answer">
<text>What do you get if you multiple 6 by 9?</text>
</value>
</group>
If you wanted to limit the results to only group elements that
contain value element with attribute val == 42, you could try:
>>> matching_groups = []
>>> for group in tree.xpath('//group[contains(#name, "common string")][value/#val = "42"]'):
... matching_groups.append(group)
...
Which would yield:
>>> print '\n'.join([etree.tostring(x) for x in matching_groups])
<group name="YYY common string">
<value val="42" description="the answer">
<text>What do you get if you multiple 6 by 9?</text>
</value>
</group>

The problems were 1) wildcard searching of group name, and 2) the fact that the groups were nested at different levels in different files.
I implemented this brute force approach to build a dictionary of all such error entries in an error named group, anywhere in the file.
I leave it here for posterity and invite more elephant solutions.
import xml.etree.ElementTree as ET
parser = ET.XMLParser()
parser.parser.UseForeignDTD(True)
tree = ET.parse(inputFileName, parser=parser)
root = tree.getroot()
args.errorDefinitions = {}
for element in tree.iter():
if element.tag == 'group':
if 'error' in element.get('name').lower():
if element._children:
for errorMessage in element._children[0]._children:
args.errorDefinitions[errorMessage.get('name')] = \
{'id': errorMessage.get('id'), \
'description': element._children[0].text}

replace only first occurrence of field/word on a file

I have some zipfiles ( 700+ ) with the following structure ( the file is exactly like this )
<?xml version="1.0" encoding="UTF-8"?>
<Values version="2.0">
<record name="trigger">
<value name="uniqueId">6xjUCpDlrTVHRsEVmxx0Ews6ni8=</value>
<value name="processingSuspended">false</value>
<value name="retrievalSuspended">false</value>
</record>
<record name="trigger">
<value name="uniqueId">6xjUCpDlrTVHRsEVmxx0Ews6ni8=</value>
<value name="processingSuspended">false</value>
<value name="retrievalSuspended">false</value>
</record>
</Values>
What i would like to achieve, is to replace, no matter if the value of the first occurrence fields processingSuspended and retrievalSuspended is true or false. to replace it to false. But only for the first occurrence.
EDIT:
By request im adding what i have so far, where i can get the fields that i want, But. i believe there is a simplier way to do that.:
import os
import zipfile
import glob
import time
import re
def main():
rList = []
for z in glob.glob("*.zip"):
root = zipfile.ZipFile(z)
for filename in root.namelist():
if filename.find("node.ndf") >= 0:
for line in root.read(filename).split("\n"):
if line.find("broker-trigger") >= 0:
for iline in root.read(filename).split("\n"):
Values = dict()
#match Processing state
if iline.find("processingSuspended") >= 0:
mpr = re.search(r'(.*>)(.*?)(<.*)',
iline, re.M|re.I)
#match Retrieval state
if iline.find("retrievalSuspended") >= 0:
mr = re.search(r'(.*>)(.*?)(<.*)',
iline, re.M|re.I)
Values['processingSuspended'] = mpr.group(2)
Values['retrievalSuspended'] = mr.group(2)
#print mr.group(2)
rList.append(Values)
print rList
if __name__== "__main__":
main()
Thanks in advance.

Try using lxml:
>>> xml = '''\
<?xml version="1.0" encoding="UTF-8"?>
<Values version="2.0">
<record name="trigger">
<value name="uniqueId">6xjUCpDlrTVHRsEVmxx0Ews6ni8=</value>
<value name="processingSuspended">true</value>
<value name="retrievalSuspended">true</value>
</record>
<record name="trigger">
<value name="uniqueId">6xjUCpDlrTVHRsEVmxx0Ews6ni8=</value>
<value name="processingSuspended">true</value>
<value name="retrievalSuspended">true</value>
</record>
</Values>\
'''
>>> from lxml import etree
>>> tree = etree.fromstring(xml)
>>> tree.xpath('//value[#name="processingSuspended"]')[0].text = 'false'
>>> tree.xpath('//value[#name="retrievalSuspended"]')[0].text = 'false'
This xpath expression '//value[#name="processingSuspended"]' finds all the tags value with attribute name equal to "processingSuspended". Then we just take the first one with [0] and change the tag's text to 'false'.
Output:
>>> print(etree.tostring(tree, pretty_print=True))
<Values version="2.0">
<record name="trigger">
<value name="uniqueId">6xjUCpDlrTVHRsEVmxx0Ews6ni8=</value>
<value name="processingSuspended">false</value>
<value name="retrievalSuspended">false</value>
</record>
<record name="trigger">
<value name="uniqueId">6xjUCpDlrTVHRsEVmxx0Ews6ni8=</value>
<value name="processingSuspended">true</value>
<value name="retrievalSuspended">true</value>
</record>
</Values>
>>>

You can read the zip archives and update the xml formatted data in the file they contain with Python's built-in modules. There's even a tutorial in the documentation for xml.etree.ElementTree.
import glob
import xml.etree.ElementTree as ET
import zipfile
def main():
for z in glob.glob("*.zip"):
print 'processing file: {!r}'.format(z)
zfile = zipfile.ZipFile(z)
for filename in zfile.namelist():
print 'processing archive member: {!r} in {}'.format(filename, z)
contents = zfile.open(filename).read()
print 'Before changes:'
print contents
root = ET.fromstring(contents)
if root.tag != "Values" or root.attrib["version"] != "2.0":
print 'unsupported xml file'
break
if(root[0][1].tag == "value" and
root[0][1].attrib["name"] == "processingSuspended"):
root[0][1].text = "false"
else:
print 'expected "processingSuspended" value field not found'
break
if(root[0][2].tag == "value" and
root[0][2].attrib["name"] == "retrievalSuspended"):
root[0][2].text = "false"
else:
print 'expected "retrievalSuspended" value field not found'
break
print 'After changes:'
updated_contents = ET.tostring(root)
print updated_contents
if __name__== "__main__":
main()

Editing the XML texts from a XML file using Python

I have an XML file which contains some data as given.
<?xml version="1.0" encoding="UTF-8" ?>
- <ParameterData>
<CreationInfo date="10/28/2009 03:05:14 PM" user="manoj" />
- <ParameterList count="85">
- <Parameter name="Spec 2 Included" type="boolean" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
- <Parameter name="Spec 2 Label" type="string" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
- <Parameter name="Spec 3 Included" type="boolean" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
- <Parameter name="Spec 3 Label" type="string" mode="both">
<Value>n/a</Value>
<Result>n/a</Result>
</Parameter>
</ParameterList>
</ParameterData>
I have one text file with lines as
Spec 2 Included : TRUE
Spec 2 Label: 19-Flat2-HS3
Spec 3 Included : FALSE
Spec 3 Label: 4-1-Bead1-HS3
Now I want to edit XML texts; i,e. I want to replace the field (n/a)
with the corresponding values from the text file.
Like I want the file to looks like
<?xml version="1.0" encoding="UTF-8" ?>
- <ParameterData>
<CreationInfo date="10/28/2009 03:05:14 PM" user="manoj" />
- <ParameterList count="85">
- <Parameter name="Spec 2 Included" type="boolean" mode="both">
<Value>TRUE</Value>
<Result>TRUE</Result>
</Parameter>
- <Parameter name="Spec 2 Label" type="string" mode="both">
<Value>19-Flat2-HS3</Value>
<Result>19-Flat2-HS3</Result>
</Parameter>
- <Parameter name="Spec 3 Included" type="boolean" mode="both">
<Value>FALSE</Value>
<Result>FALSE</Result>
</Parameter>
- <Parameter name="Spec 3 Label" type="string" mode="both">
<Value>4-1-Bead1-HS3</Value>
<Result>4-1-Bead1-HS3</Result>
</Parameter>
</ParameterList>
</ParameterData>
I am new to this Python-XML coding.
I dont have idea about how to edit the text fields in a XML file.
I am trying to Use elementtree.ElementTree module.
but to read the lines in XML file and extract the attributes I dont know which modules need to be imported.
Please help.
Thanks and Regards.

You can convert your data text into python dictionary by regular expression
data="""Spec 2 Included : TRUE
Spec 2 Label: 19-Flat2-HS3
Spec 3 Included : FALSE
Spec 3 Label: 4-1-Bead1-HS3"""
#data=open("data.txt").read()
import re
data=dict(re.findall('(Spec \d+ (?:Included|Label))\s*:\s*(\S+)',data))
data will be as follows
{'Spec 3 Included': 'FALSE', 'Spec 2 Included': 'TRUE', 'Spec 3 Label': '4-1-Bead1-HS3', 'Spec 2 Label': '19-Flat2-HS3'}
Then you can convert it by using any of your favoriate xml parser, I will use minidom here.
from xml.dom import minidom
dom = minidom.parseString(xml_text)
params=dom.getElementsByTagName("Parameter")
for param in params:
name=param.getAttribute("name")
if name in data:
for item in param.getElementsByTagName("*"): # You may change to "Result" or "Value" only
item.firstChild.replaceWholeText(data[name])
print dom.toxml()
#write to file
open("output.xml","wb").write(dom.toxml())
Results
<?xml version="1.0" ?><ParameterData>
<CreationInfo date="10/28/2009 03:05:14 PM" user="manoj"/>
<ParameterList count="85">
<Parameter mode="both" name="Spec 2 Included" type="boolean">
<Value>TRUE</Value>
<Result>TRUE</Result>
</Parameter>
<Parameter mode="both" name="Spec 2 Label" type="string">
<Value>19-Flat2-HS3</Value>
<Result>19-Flat2-HS3</Result>
</Parameter>
<Parameter mode="both" name="Spec 3 Included" type="boolean">
<Value>FALSE</Value>
<Result>FALSE</Result>
</Parameter>
<Parameter mode="both" name="Spec 3 Label" type="string">
<Value>4-1-Bead1-HS3</Value>
<Result>4-1-Bead1-HS3</Result>
</Parameter>
</ParameterList>
</ParameterData>

Well, you could start with
import xml.etree.ElementTree as ET
tree = ET.parse("blah.xml")
Find the elements you want to modify.
To replace the contents of an element, just do
element.text = "TRUE"
The import statement above works in Python 2.5 or later. If you have an older version of Python you'll need to install ElementTree as an extension, and then the import statement is different: import elementtree.ElementTree as ET.

Unfortunately, the XPath supported by ElementTree isn't complete. Since Python 2.6 includes an older version, finding elements by attribute (as stated here) does not work. So Python's own documentation should be your first stop: xml.etree.ElementTree
import xml.etree.ElementTree as ET
original = ET.parse("original.xml")
parameters = original.findall(".//Parameter")
changes = {}
# read changes
with open("changes.txt", "rb") as in_file:
for change in in_file:
change = change.rstrip() # remove line endings
name, value = change.split(":")
changes[name.strip()] = value.strip() # remove whitespaces
# find paramter element and apply changes
for parameter in parameters:
parameter_name = parameter.get("name")
if changes.has_key(parameter_name):
value = parameter.find("./Value")
value.text = changes[parameter_name]
result = parameter.find("./Result")
result.text = changes[parameter_name]
original.write("new.xml")

Here is how you could do it using Amara
from amara import bindery
doc = bindery.parse(XML)
def cleanup_for_dict(key, value):
return key.strip(), value.strip()
params = dict(( cleanup_for_dict(*line.split(':', 1))
for line in TEXT.splitlines()))
for param in doc.ParameterData.ParameterList.Parameter:
if param.name in params:
param.Value = params[param.name]
param.Result = params[param.name]
doc.xml_write()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Large XML parsing in Python - python

Related

Get items from xml Python

Parsing XML: Python ElementTree, find elements and its parent elements without other elements in same parent

Wildcard search at any nested depth using xml.etree.ElementTree

replace only first occurrence of field/word on a file

Editing the XML texts from a XML file using Python

Categories

Resources