I'm using python 3 and beautifulsoup4, pandas, counter, to convert one XML to CSV file
There is several thousand products in this Xml. I have trouble with one particular problem.
Many of this product in XML are a children of parent product, but parent product is not itself in XML.
Each of this children product have special parent tag with the same value (parent id) so we can know they are children.
<parent>x</parent>
Xml structure is folowing:
<product>
<id>1</id>
<parent>x</parent>
</product>
<product>
<id>2</id>
<parent>x</parent>
</product>
<product>
<id>3</id>
<parent>x</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>y</parent>
</product>
<product>
<id>6</id>
<parent>y</parent>
</product>
You can see that product with id 4 don't have parent tag so is not children first 3 products have one parent together with value x and last two product have another parent with value Y and so on more than 7000 products.
For my purpose I need to replace each <parent/> tag with the fist id of the same value. My desired outcome:
<product>
<id>1</id>
<parent>1</parent>
</product>
<product>
<id>2</id>
<parent>1</parent>
</product>
<product>
<id>3</id>
<parent>1</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>5</parent>
</product>
<product>
<id>6</id>
<parent>5</parent>
</product>
What I done so far I need also to convert in csv each value to their respective column and row.
#test.py
def parse_xml(xml_data):
# Initializing soup variable
soup = BeautifulSoup(xml_data, 'xml')
# Creating column for table
df = pd.DataFrame(columns=['id', 'parent'])
# Here I get all duplacates of the same tag
lst = soup.select('parent')
d = Counter(lst)
resultparent = [k for k, v in d.items() if v > 1]
#I spleat on seperate text to get all duplicate x and y as one
def a():
for index, i in enumerate(resultparent):
a = i.text
return a
# I get also id or every x an y
def b():
for index, i in enumerate(resultparent):
c = i.find_previous('id').text
return c
# now I start writing csv
all_products = soup.findAll('product')
product_length = len(all_products)
for index, product in enumerate(all_products):
parent = product.find('parent')
if parent is None:
parent = ""
else:
parent = parent.text
# here I wanted to check if I could find duplicate values with existing, I hope that if
# there will be let say parent tag is x will replace with 1 (don't work)
if parent == def a():
parent = def b()
product_id = product.find('id').text
# then I write all in csv
row = [{
'id': product_id,
'parent': parent}]
df = df.append(row, ignore_index=True)
print(f'Appending row %s of %s' % (index+1, product_length))
return df
df = parse_xml(xml_data)
df.to_csv('test.csv')
These code above don't work correctly it replace only first value x x x with 1 1 1 but don't replace -y- value and the rest when is written in CSV. Thank you for help.
Given the following input.xml file:
<products>
<product>
<id>1</id>
<parent>x</parent>
</product>
<product>
<id>2</id>
<parent>x</parent>
</product>
<product>
<id>3</id>
<parent>x</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>y</parent>
</product>
<product>
<id>6</id>
<parent>y</parent>
</product>
</products>
Here is one way to get the correct matches:
import pandas as pd
df = pd.read_xml("input.xml")
pairs = (
df.groupby("parent")
.agg(list)
.pipe(lambda df_: df_["id"].apply(lambda x: str(x[0])))
.to_dict()
)
print(pairs) # {'x': '1', 'y': '5'}
And then, using Python standard library XML module:
import xml.etree.ElementTree as ET
tree = ET.parse("file.xml")
for product in tree.getroot():
for child in product:
if child.tag == "parent":
child.text = pairs[child.text]
tree.write("output.xml")
In output.xml:
<products>
<product>
<id>1</id>
<parent>1</parent>
</product>
<product>
<id>2</id>
<parent>1</parent>
</product>
<product>
<id>3</id>
<parent>1</parent>
</product>
<product>
<id>4</id>
</product>
<product>
<id>5</id>
<parent>5</parent>
</product>
<product>
<id>6</id>
<parent>5</parent>
</product>
</products>
I have the following xml file and I will like to structure it group it by Table Id.
xml = """
<Tables Count="19">
<Table Id="1" >
<Data>
<Cell>
<Brush/>
<Text>AA</Text>
<Text>BB</Text>
</Cell>
</Data>
</Table>
<Table Id="2" >
<Data>
<Cell>
<Brush/>
<Text>CC</Text>
<Text>DD</Text>
</Cell>
</Data>
</Table>
</Tables>
"""
I would like to parse it and get something like this.
I have tried something below but couldn't figure out it.
from lxml import etree
tree = etree.fromstring(xml)
users = {}
for user in tree.xpath("//Tables"):
name = user.xpath("Table")[0].text
users[name] = []
for group in user.xpath("Data/Cell/Text"):
users[name].append(group.text)
print (users)
Is that possible to get the above result? if so, could anyone help me to do this? I really appreciate your effort.
You need to change your xpath queries to:
from lxml import etree
tree = etree.fromstring(xml)
users = {}
for user in tree.xpath("//Tables/Table"):
# ^^^
name = user.attrib['Id']
users[name] = []
for group in user.xpath(".//Data/Cell/Text"):
# ^^^
users[name].append(group.text)
print (users)
...and use the attrib dictionary.
This yields for your string:
{'1': ['AA', 'BB'], '2': ['CC', 'DD']}
If you're into "one-liners", you could even do:
users = {name: [group.text for group in user.xpath(".//Data/Cell/Text")]
for user in tree.xpath("//Tables/Table")
for name in [user.attrib["Id"]]}
Hello I am making a requests call to return order data from a online store. My issue is that once I have passed my data to a root variable the method iter is not returning the correct results. e.g. Display multiple tags of the same name rather than one and not showing the data within the tag.
I thought this was due to the XML not being correctly formatted so I formatted it by saving it to a file using pretty_print but that hasn't fixed the error.
How do I fix this? - Thanks in advance
Code:
import requests, xml.etree.ElementTree as ET, lxml.etree as etree
url="http://publicapi.ekmpowershop24.com/v1.1/publicapi.asmx"
headers = {'content-type': 'application/soap+xml'}
body = """<?xml version="1.0" encoding="utf-8"?>
<soap12:Envelope xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:soap12="http://www.w3.org/2003/05/soap-envelope">
<soap12:Body>
<GetOrders xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersRequest>
<APIKey>my_api_key</APIKey>
<FromDate>01/07/2018</FromDate>
<ToDate>04/07/2018</ToDate>
</GetOrdersRequest>
</GetOrders>
</soap12:Body>
</soap12:Envelope>"""
#send request to ekm
r = requests.post(url,data=body,headers=headers)
#save output to file
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(r.text)
file.close()
#take the file and format the xml
x = etree.parse("C:/Users/Mark/Desktop/test.xml")
newString = etree.tostring(x, pretty_print=True)
file = open("C:/Users/Mark/Desktop/test.xml", "w")
file.write(newString.decode('utf-8'))
file.close()
#parse the file to get the roots
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
#access elements names in the data
for child in root.iter('*'):
print(child.tag)
#show orders elements attributes
tree = ET.parse("C:/Users/Mark/Desktop/test.xml")
root = tree.getroot()
for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
out = {}
for child in order:
if child.tag in ('OrderID'):
out[child.tag] = child.text
print(out)
Elements output:
{http://publicapi.ekmpowershop.com/}Orders
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
{http://publicapi.ekmpowershop.com/}Order
{http://publicapi.ekmpowershop.com/}OrderID
{http://publicapi.ekmpowershop.com/}OrderNumber
{http://publicapi.ekmpowershop.com/}CustomerID
{http://publicapi.ekmpowershop.com/}CustomerUserID
Orders Output:
{http://publicapi.ekmpowershop.com/}Order {}
{http://publicapi.ekmpowershop.com/}Order {}
XML Structure after formating:
<soap:Envelope xmlns:soap="http://www.w3.org/2003/05/soap-envelope" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<GetOrdersResponse xmlns="http://publicapi.ekmpowershop.com/">
<GetOrdersResult>
<Status>Success</Status>
<Errors/>
<Date>2018-07-10T13:47:00.1682029+01:00</Date>
<TotalOrders>10</TotalOrders>
<TotalCost>100</TotalCost>
<Orders>
<Order>
<OrderID>100</OrderID>
<OrderNumber>102/040718/67</OrderNumber>
<CustomerID>6910</CustomerID>
<CustomerUserID>204</CustomerUserID>
<FirstName>TestFirst</FirstName>
<LastName>TestLast</LastName>
<CompanyName>Test Company</CompanyName>
<EmailAddress>test#Test.com</EmailAddress>
<OrderStatus>Dispatched</OrderStatus>
<OrderStatusColour>#00CC00</OrderStatusColour>
<TotalCost>85.8</TotalCost>
<OrderDate>10/07/2018 14:30:43</OrderDate>
<OrderDateISO>2018-07-10T14:30:43</OrderDateISO>
<AbandonedOrder>false</AbandonedOrder>
<EkmStatus>SUCCESS</EkmStatus>
</Order>
</Orders>
<Currency>GBP</Currency>
</GetOrdersResult>
</GetOrdersResponse>
</soap:Body>
</soap:Envelope>
You need to consider the namespace when checking for tags.
>>> # Include the namespace part of the tag in the tag values that we check.
>>> tags = ('{http://publicapi.ekmpowershop.com/}OrderID', '{http://publicapi.ekmpowershop.com/}OrderNumber')
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag] = child.text
... print(out)
...
{'{http://publicapi.ekmpowershop.com/}OrderID': '100', '{http://publicapi.ekmpowershop.com/}OrderNumber': '102/040718/67'}
If you don't want the namespace prefixes in the output, you can strip them by only including that part of the tag after the } character.
>>> for order in root.iter('{http://publicapi.ekmpowershop.com/}Order'):
... out = {}
... for child in order:
... if child.tag in tags:
... out[child.tag[child.tag.index('}')+1:]] = child.text
... print(out)
...
{'OrderID': '100', 'OrderNumber': '102/040718/67'}
Here is a little xml example:
<?xml version="1.0" encoding="UTF-8"?>
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
...
...
</list>
Now I need all Persons with a name and city.
I tried:
#!/usr/bin/python
# coding: utf8
import xml.dom.minidom as dom
tree = dom.parse("test.xml")
for listItems in tree.firstChild.childNodes:
for personItems in listItems.childNodes:
if personItems.nodeName == "name" and personItems.nextSibling == "city":
print personItems.firstChild.data.strip()
But the ouput is empty. Without the "and" condition I become all names. How can I check that the next tag after "name" is "city"?
You can do this in minidom:
import xml.dom.minidom as minidom
def getChild(n,v):
for child in n.childNodes:
if child.localName==v:
yield child
xmldoc = minidom.parse('test.xml')
person = getChild(xmldoc, 'list')
for p in person:
for v in getChild(p,'person'):
attr = v.getAttributeNode('id')
if attr:
print attr.nodeValue.strip()
This prints id of person nodes:
1
2
use element tree check this element tree
import xml.etree.ElementTree as ET
tree = ET.parse('a.xml')
root = tree.getroot()
for person in root.findall('person'):
name = person.find('name').text
try:
city = person.find('city').text
except:
continue
print name, city
for id u can get it by id= person.get('id')
output:Smith New York
Using lxml, you can use xpath to get in one step what you need:
from lxml import etree
xmlstr = """
<list>
<person id="1">
<name>Smith</name>
<city>New York</city>
</person>
<person id="2">
<name>Pitt</name>
</person>
</list>
"""
xml = etree.fromstring(xmlstr)
xp = "//person[city]"
for person in xml.xpath(xp):
print etree.tostring(person)
lxml is external python package, but is so useful, that to me it is always worth to install.
xpath is searching for any (//) element person having (declared by content of []) subelement city.
I'm looking for an XML to dictionary parser using ElementTree, I already found some but they are excluding the attributes, and in my case I have a lot of attributes.
The following XML-to-Python-dict snippet parses entities as well as attributes following this XML-to-JSON "specification":
from collections import defaultdict
def etree_to_dict(t):
d = {t.tag: {} if t.attrib else None}
children = list(t)
if children:
dd = defaultdict(list)
for dc in map(etree_to_dict, children):
for k, v in dc.items():
dd[k].append(v)
d = {t.tag: {k: v[0] if len(v) == 1 else v
for k, v in dd.items()}}
if t.attrib:
d[t.tag].update(('#' + k, v)
for k, v in t.attrib.items())
if t.text:
text = t.text.strip()
if children or t.attrib:
if text:
d[t.tag]['#text'] = text
else:
d[t.tag] = text
return d
It is used:
from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
<e />
<e>text</e>
<e name="value" />
<e name="value">text</e>
<e> <a>text</a> <b>text</b> </e>
<e> <a>text</a> <a>text</a> </e>
<e> text <a>text</a> </e>
</root>
''')
from pprint import pprint
d = etree_to_dict(e)
pprint(d)
The output of this example (as per above-linked "specification") should be:
{'root': {'e': [None,
'text',
{'#name': 'value'},
{'#text': 'text', '#name': 'value'},
{'a': 'text', 'b': 'text'},
{'a': ['text', 'text']},
{'#text': 'text', 'a': 'text'}]}}
Not necessarily pretty, but it is unambiguous, and simpler XML inputs result in simpler JSON. :)
Update
If you want to do the reverse, emit an XML string from a JSON/dict, you can use:
try:
basestring
except NameError: # python3
basestring = str
def dict_to_etree(d):
def _to_etree(d, root):
if not d:
pass
elif isinstance(d, str):
root.text = d
elif isinstance(d, dict):
for k,v in d.items():
assert isinstance(k, str)
if k.startswith('#'):
assert k == '#text' and isinstance(v, str)
root.text = v
elif k.startswith('#'):
assert isinstance(v, str)
root.set(k[1:], v)
elif isinstance(v, list):
for e in v:
_to_etree(e, ET.SubElement(root, k))
else:
_to_etree(v, ET.SubElement(root, k))
else:
assert d == 'invalid type', (type(d), d)
assert isinstance(d, dict) and len(d) == 1
tag, body = next(iter(d.items()))
node = ET.Element(tag)
_to_etree(body, node)
return node
print(ET.tostring(dict_to_etree(d)))
def etree_to_dict(t):
d = {t.tag : map(etree_to_dict, t.iterchildren())}
d.update(('#' + k, v) for k, v in t.attrib.iteritems())
d['text'] = t.text
return d
Call as
tree = etree.parse("some_file.xml")
etree_to_dict(tree.getroot())
This works as long as you don't actually have an attribute text; if you do, then change the third line in the function body to use a different key. Also, you can't handle mixed content with this.
(Tested on LXML.)
For transforming XML from/to python dictionaries, xmltodict has worked great for me:
import xmltodict
xml = '''
<root>
<e />
<e>text</e>
<e name="value" />
<e name="value">text</e>
<e> <a>text</a> <b>text</b> </e>
<e> <a>text</a> <a>text</a> </e>
<e> text <a>text</a> </e>
</root>
'''
xdict = xmltodict.parse(xml)
xdict will now look like
OrderedDict([('root',
OrderedDict([('e',
[None,
'text',
OrderedDict([('#name', 'value')]),
OrderedDict([('#name', 'value'),
('#text', 'text')]),
OrderedDict([('a', 'text'), ('b', 'text')]),
OrderedDict([('a', ['text', 'text'])]),
OrderedDict([('a', 'text'),
('#text', 'text')])])]))])
If your XML data is not in raw string/bytes form but in some ElementTree object, you just need to print it out as a string and use xmldict.parse again. For instance, if you are using lxml to process the XML documents, then
from lxml import etree
e = etree.XML(xml)
xmltodict.parse(etree.tostring(e))
will produce the same dictionary as above.
Based on #larsmans, if you don't need attributes, this will give you a tighter dictionary --
def etree_to_dict(t):
return {t.tag : map(etree_to_dict, t.iterchildren()) or t.text}
Several answers already, but here's one compact solution that maps attributes, text value and children using dict-comprehension:
def etree_to_dict(t):
if type(t) is ET.ElementTree: return etree_to_dict(t.getroot())
return {
**t.attrib,
'text': t.text,
**{e.tag: etree_to_dict(e) for e in t}
}
The lxml documentation brings an example of how to map an XML tree into a dict of dicts:
def recursive_dict(element):
return element.tag, dict(map(recursive_dict, element)) or element.text
Note that this beautiful quick-and-dirty converter expects children to have unique tag names and will silently overwrite any data that was contained in preceding siblings with the same name. For any real-world application of xml-to-dict conversion, you would better write your own, longer version of this.
You could create a custom dictionary to deal with preceding siblings with the same name being overwritten:
from collections import UserDict, namedtuple
from lxml.etree import QName
class XmlDict(UserDict):
"""Custom dict to avoid preceding siblings with the same name being overwritten."""
__ROOTELM = namedtuple('RootElm', ['tag', 'node'])
def __setitem__(self, key, value):
if key in self:
if type(self.data[key]) is list:
self.data[key].append(value)
else:
self.data[key] = [self.data[key], value]
else:
self.data[key] = value
#staticmethod
def xml2dict(element):
"""Converts an ElementTree Element to a dictionary."""
elm = XmlDict.__ROOTELM(
tag=QName(element).localname,
node=XmlDict(map(XmlDict.xml2dict, element)) or element.text,
)
return elm
Usage
from lxml import etree
from pprint import pprint
xml_f = b"""<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Person>
<First>John</First>
<Last>Smith</Last>
</Person>
<Person>
<First>Jane</First>
<Last>Doe</Last>
</Person>
</Data>"""
elm = etree.fromstring(xml_f)
d = XmlDict.xml2dict(elm)
Output
In [3]: pprint(d)
RootElm(tag='Data', node={'Person': [{'First': 'John', 'Last': 'Smith'}, {'First': 'Jane', 'Last': 'Doe'}]})
In [4]: pprint(d.node)
{'Person': [{'First': 'John', 'Last': 'Smith'},
{'First': 'Jane', 'Last': 'Doe'}]}
Here is a simple data structure in xml (save as file.xml):
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Person>
<First>John</First>
<Last>Smith</Last>
</Person>
<Person>
<First>Jane</First>
<Last>Doe</Last>
</Person>
</Data>
Here is the code to create a list of dictionary objects from it.
from lxml import etree
tree = etree.parse('file.xml')
root = tree.getroot()
datadict = []
for item in root:
d = {}
for elem in item:
d[elem.tag]=elem.text
datadict.append(d)
datadict now contains:
[{'First': 'John', 'Last': 'Smith'},{'First': 'Jane', 'Last': 'Doe'}]
and can be accessed like so:
datadict[0]['First']
'John'
datadict[1]['Last']
'Doe'
You can use this snippet that directly converts it from xml to dictionary
import xml.etree.ElementTree as ET
xml = ('<xml>' +
'<first_name>Dean Christian</first_name>' +
'<middle_name>Christian</middle_name>' +
'<last_name>Armada</last_name>' +
'</xml>')
root = ET.fromstring(xml)
x = {x.tag: root.find(x.tag).text for x in root._children}
# returns {'first_name': 'Dean Christian', 'last_name': 'Armada', 'middle_name': 'Christian'}
enhanced the accepted answer with python3 and use json list when all children have the same tag. Also provided an option whether to wrap the dict with root tag or not.
from collections import OrderedDict
from typing import Union
from xml.etree.ElementTree import ElementTree, Element
def etree_to_dict(root: Union[ElementTree, Element], include_root_tag=False):
root = root.getroot() if isinstance(root, ElementTree) else root
result = OrderedDict()
if len(root) > 1 and len({child.tag for child in root}) == 1:
result[next(iter(root)).tag] = [etree_to_dict(child) for child in root]
else:
for child in root:
result[child.tag] = etree_to_dict(child) if len(list(child)) > 0 else (child.text or "")
result.update(('#' + k, v) for k, v in root.attrib.items())
return {root.tag: result} if include_root_tag else result
d = etree_to_dict(etree.ElementTree.parse('data.xml'), True)
from lxml import etree, objectify
def formatXML(parent):
"""
Recursive operation which returns a tree formated
as dicts and lists.
Decision to add a list is to find the 'List' word
in the actual parent tag.
"""
ret = {}
if parent.items(): ret.update(dict(parent.items()))
if parent.text: ret['__content__'] = parent.text
if ('List' in parent.tag):
ret['__list__'] = []
for element in parent:
ret['__list__'].append(formatXML(element))
else:
for element in parent:
ret[element.tag] = formatXML(element)
return ret
Building on #larsmans, if the resulting keys contain xml namespace info, you can remove that before writing to the dict. Set a variable xmlns equal to the namespace and strip its value out.
xmlns = '{http://foo.namespaceinfo.com}'
def etree_to_dict(t):
if xmlns in t.tag:
t.tag = t.tag.lstrip(xmlns)
if d = {t.tag : map(etree_to_dict, t.iterchildren())}
d.update(('#' + k, v) for k, v in t.attrib.iteritems())
d['text'] = t.text
return d
If you have a schema, the xmlschema package already implements multiple XML-to-dict converters that honor the schema and attribute types. Quoting the following from the docs
Available converters
The library includes some converters. The default converter
xmlschema.XMLSchemaConverter is the base class of other converter
types. Each derived converter type implements a well know convention,
related to the conversion from XML to JSON data format:
xmlschema.ParkerConverter: Parker convention
xmlschema.BadgerFishConverter: BadgerFish convention
xmlschema.AbderaConverter: Apache Abdera project convention
xmlschema.JsonMLConverter: JsonML (JSON Mark-up Language) convention
Documentation of these different conventions is available here: http://wiki.open311.org/JSON_and_XML_Conversion/
Usage of the converters is straightforward, e.g.:
from xmlschema import ParkerConverter, XMLSchema, to_dict
xml = '...'
schema = XMLSchema('...')
to_dict(xml, schema=schema, converter=ParkerConverter)