How parse XML into list of dicts?

How parse XML into list of dicts? - python

Given the sample xml below:
<_Document>
<_Data1> 'foo'
<_SubData1> 'bar1' </_SubData1>
<_SubData2> 'bar2' </_SubData2>
<_SubData3> 'bar3' </_SubData3>
</_Data1>
</_Document>
I want to capture each SubData value and update it with the Data1 value in a dictionary and then append that value to a list. Such that the output would look something like:
[{Data1: 'foo', SubData1: 'bar1'}, {Data1: 'foo', SubData2: 'bar2'}, {Data1: 'foo', SubData3: 'bar3'}]
My code is:
from lxml import etree
import re
new_records = []
for child in root.iter('_Document'): #finding all children with each 'Document' string
for top_data in child.iter(): #iterating through the entirety of each 'Document' sections tags and text.
if "Data" in top_data.tag:
for data in top_data:
rec = {}
if data.text is not None and data.text.isspace() is False: #avoiding NoneTypes and empty data.
g = data.tag.strip("_") #cleaning up the tag
rec[g] = data.text.replace("\n", " ") #cleaning up the value
for b in re.finditer(r'^_SubData', data.tag): #searching through each 'SubData' contained in a given tag.
for subdata in data:
subdict = {}
if subdata.text is not None: #again preventing NoneTypes
z = subdata.tag.strip("_") #tag cleaning
subdict[z] = subdata.text.replace("\n", " ") #text cleaning
rec.update(subdict) #update the data record dictionary with the subdata
new_records.append(rec) #appending to the list
This, unfortunately, outputs:
[{Data1: 'foo', SubData3: 'bar3'}]
As it only updates and appends the final update of the dictionary.
I've tried different varieties of this including initializing a list after the first 'if' statement in the second for loop to append after each loop pass, but that required quite a bit of clean up at the end to get through the nesting it would cause.
I've also tried initializing empty dictionaries outside of the loops to update to preserve the previous updates and append that way.
I'm curious if there is some functionality of lxml that I've missed or a more pythonic approach to get the desired output.

I offered what I think of as a declarative approach in another solution. If you're more comfortable explicitly defining the structure with loops, here's an imperative approach:
from xml.etree import ElementTree as ET
import pprint
new_records = []
document = ET.parse('input.xml').getroot()
for elem in document:
if elem.tag.startswith('_Data'):
data = elem
data_name = data.tag[1:] # skip leading '_'
data_val = data.text.strip()
for elem in data:
if elem.tag.startswith('_SubData'):
subdata = elem
subdata_name = subdata.tag[1:]
subdata_val = subdata.text.strip()
new_records.append(
{data_name: data_val, subdata_name: subdata_val}
)
pprint.pprint(new_records)
The input and output is the same as in my other solution.

You can do this with Python's built-in ElementTree class and its iterparse() method which walks an XML tree and produces a pair of event and element for every step through the tree. We listen for when it starts parsing an element, and if its _Data... or _SubData... we act.
This is a declarative approach, and relies on the fact that _SubData is only a child of _Data, that is, that your very small and simple sample is exactly representative of what you're actually dealing with.
You'll need to manage a little state for the _Data elements, but that's it:
from xml.etree import ElementTree as ET
import pprint
new_records = []
data_name = None
data_val = None
for event, elem in ET.iterparse('input.xml', ['start']):
tag_name = elem.tag[1:] # skip possible leading '_'
if event == 'start' and tag_name.startswith('Data'):
data_name = tag_name
data_val = elem.text.strip()
if event == 'start' and tag_name.startswith('SubData'):
subdata_name = tag_name
subdata_val = elem.text.strip()
record = {
data_name: data_val, subdata_name: subdata_val
}
new_records.append(record)
pprint.pprint(new_records)
I modified your sample, my input.xml:
<_Document>
<_Data1>foo
<_SubData1>bar1</_SubData1>
<_SubData2>bar2</_SubData2>
<_SubData3>bar3</_SubData3>
</_Data1>
<_Data2>FOO
<_SubData1>BAR1</_SubData1>
<_SubData2>BAR2</_SubData2>
<_SubData3>BAR3</_SubData3>
</_Data2>
</_Document>
When I run my script on that input, I get:
[{'Data1': 'foo', 'SubData1': 'bar1'},
{'Data1': 'foo', 'SubData2': 'bar2'},
{'Data1': 'foo', 'SubData3': 'bar3'},
{'Data2': 'FOO', 'SubData1': 'BAR1'},
{'Data2': 'FOO', 'SubData2': 'BAR2'},
{'Data2': 'FOO', 'SubData3': 'BAR3'}]

Consider dictionary comprehension using dictionary merge:
new_records = [
{
**{doc.tag.replace('_', ''): doc.text.strip().replace("'", "")},
**{data.tag.replace('_', ''): data.text.strip().replace("'", "")}
}
for doc in root.iterfind('*')
for data in doc.iterfind('*')
]
new_records
[{'Data1': 'foo', 'SubData1': 'bar1'},
{'Data1': 'foo', 'SubData2': 'bar2'},
{'Data1': 'foo', 'SubData3': 'bar3'}]

Related

I want to remove the unwanted sub-level duplicate tags using lxml etree

This is the input sample text. I want to do in object based cleanup to avoid hierarchy issues
<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>
Required Output
<p><b><i>sample text</i></b></p>

I written this Object based cleanup using lxml for sublevel duplicate tags. It may help others.
import lxml.etree as ET
textcont = '<p><b><b><i><b><i><b><i>sample text</i></b></i></b></i></b></b></p>'
soup = ET.fromstring(textcont)
for tname in ['i','b']:
for tagn in soup.iter(tname):
if tagn.getparent().getparent() != None and tagn.getparent().getparent().tag == tname:
iparOfParent = tagn.getparent().getparent()
iParent = tagn.getparent()
if iparOfParent.text == None:
iparOfParent.addnext(iParent)
iparOfParent.getparent().remove(iparOfParent)
elif tagn.getparent() != None and tagn.getparent().tag == tname:
iParent = tagn.getparent()
if iParent.text == None:
iParent.addnext(tagn)
iParent.getparent().remove(iParent)
print(ET.tostring(soup))
output:
b'<p><b><i>sample text</i></b></p>'

Markdown, itself, provides structural to extract elements inside
Using re in python, you may extract elements and recombine them.
For example:
import re
html = """<p><b><b><i><b><i><b>
<i>sample text</i>
</b></i></b></i></b></b></p>"""
regex_object = re.compile("\<(.*?)\>")
html_objects = regex_object.findall(html)
set_html = []
for obj in html_objects:
if obj[0] != "/" and obj not in set_html:
set_html.append(obj)
regex_text = re.compile("\>(.*?)\<")
text = [result for result in regex_text.findall(html) if result][0]
# Recombine
result = ""
for obj in set_html:
result += f"<{obj}>"
result += text
for obj in set_html[::-1]:
result += f"</{obj}>"
# result = '<p><b><i>sample text</i></b></p>'

You can use the regex library re to create a function to search for the matching opening tag and closing tag pair and everything else in between. Storing tags in a dictionary will remove duplicate tags and maintain the order they were found in (if order isn't important then just use a set). Once all pairs of tags are found, wrap what's left with the keys of the dictionary in reverse order.
import re
def remove_duplicates(string):
tags = {}
while (match := re.findall(r'\<(.+)\>([\w\W]*)\<\/\1\>', string)):
tag, string = match[0][0], match[0][1] # match is [(group0, group1)]
tags.update({tag: None})
for tag in reversed(tags):
string = f'<{tag}>{string}</{tag}>'
return string
Note: I've used [\w\W]* as a cheat to match everything.

Parsing and creating a dict python

I would like to create a dict by parsing a string
<brns ret = "Herld" other = "very">
<brna name = "ame1">
I would like to create a dict that has the following key-value pairs:
dict = {'brnsret': 'Herld',
'brnsother':'very',
'brnaname':'ame1'}
I have a working script that can handle this:
<brns ret = "Herld">
<brna name = "ame1">
my Code to generate the dict:
match_tag = re.search('<(\w+)\s(\w+) = \"(\w+)\">', each_par_line)
if match_tag is not None:
dict_tag[match_tag.group(1)+match_tag.group(2)] = match_tag.group(3)
But how should I tweak my script to handle more than one attribute pair in a tag?
Thanks

An alternative option and, probably, just for educational reasons - you can pass this kind of string into a lenient HTML parser like BeautifulSoup:
from bs4 import BeautifulSoup
data = """
<brns ret = "Herld" other = "very">
<brna name = "ame1">
"""
d = {tag.name + attr: value
for tag in BeautifulSoup(data, "html.parser")()
for attr, value in tag.attrs.items()}
print(d)
Prints:
{'brnaname': 'ame1', 'brnsother': 'very', 'brnsret': 'Herld'}

Creating nested Json structure with multiple key values in Python from Json

My code is as follows:
import json
def reformat(importscompanies):
#print importscompanies
container={}
child=[]
item_dict={}
for name, imports in importscompanies.iteritems():
item_dict['name'] = imports
item_dict['size'] = '500'
child.append(dict(item_dict))
container['name'] = name
container['children'] = child
if __name__ == '__main__':
raw_data = json.load(open('data/bricsinvestorsfirst.json'))
run(raw_data)
def run(raw_data):
raw_data2 = raw_data[0]
the_output = reformat(raw_data2)
My issue is, the code isn't going through the whole file. It's only outputting one entry. Why is this? Am I rewriting something and do I need another dict that appends with every loop?
Also, it seems as though the for loop is going through the iteritems for each dict key. Is there a way to make it pass only once?
The issue is indeed
raw_data2 = raw_data[0]
I ended up creating an iterator to access the dict values.
Thanks.
Lastly, I'm hoping my final Json file looks this way, using the data I provided above:
{'name': u'name', 'children': [{'name': u'500 Startups', 'size': '500'}, {'name': u'AffinityChina', 'size': '500'}]}

Try this. Though your sample input and output data don't really give many clues as to where the "name" fields should come from. I've assumed you wanted the name of the original item in your list.
original_json = json.load(open('data/bricsinvestorsfirst.json'),'r')
response_json = {}
response_json["name"] = "analytics"
# where your children list will go
children = []
size = 500 # or whatever else you want
# For each item in your original list
for item in original_json:
children.append({"name" : item["name"],
"size" : size})
response_json["children"] = children
print json.dumps(response_json,indent=2)

"It's only outputting one entry" because you only select the first dictionary in the JSON file when you say raw_data2 = raw_data[0]
Try something like this as a starting point (I haven't tested/ran it):
import json
def run():
with open('data/bricsinvestorsfirst.json') as input_file:
raw_data = json.load(input_file)
children = []
for item in raw_data:
children.append({
'name': item['name'],
'size': '500'
})
container = {}
container['name'] = 'name'
container['children'] = children
return json.dumps(container)
if __name__ == '__main__':
print run()

XML Parsing in Python using document builder factory

I am working in STAF and STAX. Here python is used for coding . I am new to python.
Basically my task is to parse a XML file in python using Document Factory Parser.
The XML file I am trying to parse is :
<?xml version="1.0" encoding="utf-8"?>
<operating_system>
<unix_80sp1>
<tests type="quick_sanity_test">
<prerequisitescript>preparequicksanityscript</prerequisitescript>
<acbuildpath>acbuildpath</acbuildpath>
<testsuitscript>test quick sanity script</testsuitscript>
<testdir>quick sanity dir</testdir>
</tests>
<machine_name>u80sp1_L004</machine_name>
<machine_name>u80sp1_L005</machine_name>
<machine_name>xyz.pxy.dxe.cde</machine_name>
<vmware id="155.35.3.55">144.35.3.90</vmware>
<vmware id="155.35.3.56">144.35.3.91</vmware>
</unix_80sp1>
</operating_system>
I need to read all the tags .
For the tags machine_name i need to read them into a list
say all machine names should be in a list machname.
so machname should be [u80sp1_L004,u80sp1_L005,xyz.pxy.dxe.cde] after reading the tags.
I also need all the vmware tags:
all attributes should be vmware_attr =[155.35.3.55,155.35.3.56]
all vmware values should be vmware_value = [ 144.35.3.90,155.35.3.56]
I am able to read all tags properly except vmware tags and machine name tags:
I am using the following code:(i am new to xml and vmware).Help required.
The below code needs to be modified.
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = None
vmware_attr = None
machname = None
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value = thisChild.getNodeValue()
vmware_attr ==??? what method to use ?
# Get the text value for the element with tag name "machine_name"
nodeList = document.getElementsByTagName("machine_name")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
machname = thisChild.getNodeValue()
Also how to check if a tag exists or not at all. I need to code the parsing properly.

You are need to instantiate vmware_value, vmware_attr and machname as lists not as strings, so instead of this:
vmware_value = None
vmware_attr = None
machname = None
do this:
vmware_value = []
vmware_attr = []
machname = []
Then, to add items to the list, use the append method on your lists. E.g.:
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = []
vmware_attr = []
machname = []
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
vmware_attr.append(node.attributes["id"].value)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value.append(thisChild.getNodeValue())
I've also edited the code to something I think should work to append the correct values to vmware_attr and vmware_value.
I had to make the assumption that STAX uses xml.dom syntax, so if that isn't the case, you will have to edit my suggestion appropriately.

Editing XML as a dictionary in python?

I'm trying to generate customized xml files from a template xml file in python.
Conceptually, I want to read in the template xml, remove some elements, change some text attributes, and write the new xml out to a file. I wanted it to work something like this:
conf_base = ConvertXmlToDict('config-template.xml')
conf_base_dict = conf_base.UnWrap()
del conf_base_dict['root-name']['level1-name']['leaf1']
del conf_base_dict['root-name']['level1-name']['leaf2']
conf_new = ConvertDictToXml(conf_base_dict)
now I want to write to file, but I don't see how to get to
ElementTree.ElementTree.write()
conf_new.write('config-new.xml')
Is there some way to do this, or can someone suggest doing this a different way?

This'll get you a dict minus attributes. I don't know, if this is useful to anyone. I was looking for an xml to dict solution myself, when I came up with this.
import xml.etree.ElementTree as etree
tree = etree.parse('test.xml')
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
return d
This: http://www.w3schools.com/XML/note.xml
<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>
Would equal this:
{'note': [{'to': 'Tove'},
{'from': 'Jani'},
{'heading': 'Reminder'},
{'body': "Don't forget me this weekend!"}]}

I'm not sure if converting the info set to nested dicts first is easier. Using ElementTree, you can do this:
import xml.etree.ElementTree as ET
doc = ET.parse("template.xml")
lvl1 = doc.findall("level1-name")[0]
lvl1.remove(lvl1.find("leaf1")
lvl1.remove(lvl1.find("leaf2")
# or use del lvl1[idx]
doc.write("config-new.xml")
ElementTree was designed so that you don't have to convert your XML trees to lists and attributes first, since it uses exactly that internally.
It also support as small subset of XPath.

For easy manipulation of XML in python, I like the Beautiful Soup library. It works something like this:
Sample XML File:
<root>
<level1>leaf1</level1>
<level2>leaf2</level2>
</root>
Python code:
from BeautifulSoup import BeautifulStoneSoup, Tag, NavigableString
soup = BeautifulStoneSoup('config-template.xml') # get the parser for the xml file
soup.contents[0].name
# u'root'
You can use the node names as methods:
soup.root.contents[0].name
# u'level1'
It is also possible to use regexes:
import re
tags_starting_with_level = soup.findAll(re.compile('^level'))
for tag in tags_starting_with_level: print tag.name
# level1
# level2
Adding and inserting new nodes is pretty straightforward:
# build and insert a new level with a new leaf
level3 = Tag(soup, 'level3')
level3.insert(0, NavigableString('leaf3')
soup.root.insert(2, level3)
print soup.prettify()
# <root>
# <level1>
# leaf1
# </level1>
# <level2>
# leaf2
# </level2>
# <level3>
# leaf3
# </level3>
# </root>

My modification of Daniel's answer, to give a marginally neater dictionary:
def xml_to_dictionary(element):
l = len(namespace)
dictionary={}
tag = element.tag[l:]
if element.text:
if (element.text == ' '):
dictionary[tag] = {}
else:
dictionary[tag] = element.text
children = element.getchildren()
if children:
subdictionary = {}
for child in children:
for k,v in xml_to_dictionary(child).items():
if k in subdictionary:
if ( isinstance(subdictionary[k], list)):
subdictionary[k].append(v)
else:
subdictionary[k] = [subdictionary[k], v]
else:
subdictionary[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = subdictionary
else:
dictionary[tag] = [dictionary[tag], subdictionary]
if element.attrib:
attribs = {}
for k,v in element.attrib.items():
attribs[k] = v
if (dictionary[tag] == {}):
dictionary[tag] = attribs
else:
dictionary[tag] = [dictionary[tag], attribs]
return dictionary
namespace is the xmlns string, including braces, that ElementTree prepends to all tags, so here I've cleared it as there is one namespace for the entire document
NB that I adjusted the raw xml too, so that 'empty' tags would produce at most a ' ' text property in the ElementTree representation
spacepattern = re.compile(r'\s+')
mydictionary = xml_to_dictionary(ElementTree.XML(spacepattern.sub(' ', content)))
would give for instance
{'note': {'to': 'Tove',
'from': 'Jani',
'heading': 'Reminder',
'body': "Don't forget me this weekend!"}}
it's designed for specific xml that is basically equivalent to json, should handle element attributes such as
<elementName attributeName='attributeContent'>elementContent</elementName>
too
there's the possibility of merging the attribute dictionary / subtag dictionary similarly to how repeat subtags are merged, although nesting the lists seems kind of appropriate :-)

Adding this line
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
in the user247686's code you can have node attributes too.
Found it in this post https://stackoverflow.com/a/7684581/1395962
Example:
import xml.etree.ElementTree as etree
from urllib import urlopen
xml_file = "http://your_xml_url"
tree = etree.parse(urlopen(xml_file))
root = tree.getroot()
def xml_to_dict(el):
d={}
if el.text:
d[el.tag] = el.text
else:
d[el.tag] = {}
children = el.getchildren()
if children:
d[el.tag] = map(xml_to_dict, children)
d.update(('#' + k, v) for k, v in el.attrib.iteritems())
return d
Call as
xml_to_dict(root)

Have you tried this?
print xml.etree.ElementTree.tostring( conf_new )

most direct way to me :
root = ET.parse(xh)
data = root.getroot()
xdic = {}
if data > None:
for part in data.getchildren():
xdic[part.tag] = part.text

XML has a rich infoset, and it takes some special tricks to represent that in a Python dictionary. Elements are ordered, attributes are distinguished from element bodies, etc.
One project to handle round-trips between XML and Python dictionaries, with some configuration options to handle the tradeoffs in different ways is XML Support in Pickling Tools. Version 1.3 and newer is required. It isn't pure Python (and in fact is designed to make C++ / Python interaction easier), but it might be appropriate for various use cases.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How parse XML into list of dicts? - python

Related

I want to remove the unwanted sub-level duplicate tags using lxml etree

Parsing and creating a dict python

Creating nested Json structure with multiple key values in Python from Json

XML Parsing in Python using document builder factory

Editing XML as a dictionary in python?

Categories

Resources