I have an absolute path for the values of XML files I want to retrieve. The absolute path is in the format of "A/B/C". How can I do this in Python?
Another method.
from simplified_scrapy import SimplifiedDoc, utils, req
# Basic
xml = '''<ROOT><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
print (doc.select('A>B>C'))
# Multiple
xml = '''<ROOT><A><B><C>The Value 1</C></B></A><A><B><C>The Value 2</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
# print (doc.selects('A').select('B').select('C'))
print (doc.selects('A').select('B>C'))
# Mixed structure
xml = '''<ROOT><A><other>no B</other></A><A><other></other><B>no C</B></A><A><B><C>The Value</C></B></A></ROOT>'''
doc = SimplifiedDoc(xml)
nodes = doc.selects('A').selects('B').select('C')
for node in nodes:
for c in node:
if c:
print (c)
Result:
{'tag': 'C', 'html': 'The Value'}
[{'tag': 'C', 'html': 'The Value 1'}, {'tag': 'C', 'html': 'The Value 2'}]
{'tag': 'C', 'html': 'The Value'}
Using ElementTree library (Note that my answer uses core python library while the other answers are using external libraries.)
import xml.etree.ElementTree as ET
xml = '''<ROOT><A><B><C>The Value</C></B></A></ROOT>'''
root = ET.fromstring(xml)
print(root.find('./A/B/C').text)
output
The Value
You can use lxml which you can install via pip install lxml.
See also https://lxml.de/xpathxslt.html
from io import StringIO
from lxml import etree
data = '''\
<prestashop>
<combination>
<id>a</id>
<id_product>b</id_product>
<location>c</location>
<ean13>d</ean13>
<isbn>e</isbn>
<upc>f</upc>
<mpn>g</mpn>
</combination>
</prestashop>
'''
xpath = '/prestashop/combination/ean13'
f = StringIO(data)
tree = etree.parse(f)
matches = tree.xpath(xpath)
for e in matches:
print(e.text)
There is a jsx file with contents
<import name="abcd" color="green" age="25" />
<View color={dsdssd}>
<IBG
color={[color.imagecolor, color.image125]}
imageStyle={[styles.imageStyle, styles.image125]}
source={{ uri: contents.aimeecard }} >
<View color={styles.titleContainer}>
<Text color={[{green: 45}, styles.mainTileText]}</Text>
<View color={[abcde.text]} />
</View>
</View>
I need to fetch the details of first line using python script:
Expected output
name="abcd"
color="green"
age="25"
Also the path of jsx file is passed through list
ex: [abcd/file1.jsx , dcef/file2.jsx]
Python code tried for fetching jsx file through the list
for file in jsx_path:
data = md.parse("file")
print( file.firstChild.tagName )
Values are not fetched and getting error.
Can anyone help me in resolving this?
Assuming jsx_path is the list containing all the paths to the jsx files, you can iterate over each and use a context manager to avoid closing explicitly the files like so:
data = ""
for file in jsx_path:
with open(file) as f:
data += f.readline()[8:-4] + "\n"
print(data) # name="abcd" color="green" age="25"
Following your comment, if you want to output it as a dict, you can tweak the previous code:
import re
data = []
for file in jsx_path:
with open(file) as f:
data.append(re.split('\W+|=', f.readline()[8:-4]))
data_dict = []
for d in data:
data_dict.append({key:value for (key, value) in zip(d[::2], d[1::2])})
print(data_dict) # {'name': 'abcd', 'color': 'green', 'age': '25'}
Note that this is a hack. I only read the JSX file sequentially because your use case is simple enough to do so. You can also use a dedicated parser by extending the stlib class HTMLParser:
from html.parser import HTMLParser
class JSXImportParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == "import":
self._import_attrs = {key:value for (key, value) in attrs}
#property
def import_attrs(self):
return self._import_attrs
parser = JSXImportParser()
data = []
for file in jsx_path:
with open(file) as f:
parser.feed(f.read())
data.append(parser.import_attrs)
print(data) # [{'name': 'abcd', 'color': 'green', 'age': '25'}]
Note that this only extracts the details of the last import tag in each file, you can alter this behavior by tweaking the _import_attrs class attribute.
Edit: Following your additional comment about the requirement to use an XML parser library, the same thing can be achieved using ElementTree by sampling the file to extract only what's interesting for you (the import tag):
import xml.etree.ElementTree as ET
data = []
for file in jsx_path:
with open(file) as f:
import_statement = ET.XML(f.readline())
data.append(import_statement.attrib)
print(data) # [{'name': 'abcd', 'color': 'green', 'age': '25'}]
Of course this only works if the import statement is on the first line, if it's not the case, you'll have to locate it first before calling ET.XML.
I am trying to parse an XML file in python and seems like my XML is different from the normal nomenclature.
Below is my XML snippet:
<records>
<record>
<parameter>
<name>Server</name>
<value>Application_server_01</value>
</parameter
</record>
</records>
I am trying to get the value of "parameter" name and value however i seem to get empty value.
I checked the online documentation and almost all XML seems to be in the below format
<neighbor name="Switzerland" direction="W"/>
I am able to parse this fine, how can i get the values for my XML attributes without changing the formatting.
working code
import xml.etree.ElementTree as ET
tree = ET.parse('country_data.xml')
root = tree.getroot()
for neighbor in root.iter('neighbor'):
print(neighbor.attrib)
output
C:/Users/xxxxxx/PycharmProjects/default/parse.py
{'direction': 'E', 'name': 'Austria'}
{'direction': 'W', 'name': 'Switzerland'}
{'direction': 'N', 'name': 'Malaysia'}
{'direction': 'W', 'name': 'Costa Rica'}
{'direction': 'E', 'name': 'Colombia'}
PS: I will be using the XML to fire an API call and doubt if the downstream application would like the second way of formatting.
Below is my python code
import xml.etree.ElementTree as ET
tree = ET.parse('at.xml')
root = tree.getroot()
for name in root.iter('name'):
print(name.attrib)
Output for the above code
C:/Users/xxxxxx/PycharmProjects/default/learning.py
{}
{}
{}
{}
{}
{}
{}
{}
Use lxml and XPath:
from lxml import etree as et
tree = et.parse(open("/tmp/so.xml"))
name = tree.xpath("/records/record/parameter/name/text()")[0]
value = tree.xpath("/records/record/parameter/value/text()")[0]
print(name, value)
Output:
Server Application_server_01
I'm looking for an XML to dictionary parser using ElementTree, I already found some but they are excluding the attributes, and in my case I have a lot of attributes.
The following XML-to-Python-dict snippet parses entities as well as attributes following this XML-to-JSON "specification":
from collections import defaultdict
def etree_to_dict(t):
d = {t.tag: {} if t.attrib else None}
children = list(t)
if children:
dd = defaultdict(list)
for dc in map(etree_to_dict, children):
for k, v in dc.items():
dd[k].append(v)
d = {t.tag: {k: v[0] if len(v) == 1 else v
for k, v in dd.items()}}
if t.attrib:
d[t.tag].update(('#' + k, v)
for k, v in t.attrib.items())
if t.text:
text = t.text.strip()
if children or t.attrib:
if text:
d[t.tag]['#text'] = text
else:
d[t.tag] = text
return d
It is used:
from xml.etree import cElementTree as ET
e = ET.XML('''
<root>
<e />
<e>text</e>
<e name="value" />
<e name="value">text</e>
<e> <a>text</a> <b>text</b> </e>
<e> <a>text</a> <a>text</a> </e>
<e> text <a>text</a> </e>
</root>
''')
from pprint import pprint
d = etree_to_dict(e)
pprint(d)
The output of this example (as per above-linked "specification") should be:
{'root': {'e': [None,
'text',
{'#name': 'value'},
{'#text': 'text', '#name': 'value'},
{'a': 'text', 'b': 'text'},
{'a': ['text', 'text']},
{'#text': 'text', 'a': 'text'}]}}
Not necessarily pretty, but it is unambiguous, and simpler XML inputs result in simpler JSON. :)
Update
If you want to do the reverse, emit an XML string from a JSON/dict, you can use:
try:
basestring
except NameError: # python3
basestring = str
def dict_to_etree(d):
def _to_etree(d, root):
if not d:
pass
elif isinstance(d, str):
root.text = d
elif isinstance(d, dict):
for k,v in d.items():
assert isinstance(k, str)
if k.startswith('#'):
assert k == '#text' and isinstance(v, str)
root.text = v
elif k.startswith('#'):
assert isinstance(v, str)
root.set(k[1:], v)
elif isinstance(v, list):
for e in v:
_to_etree(e, ET.SubElement(root, k))
else:
_to_etree(v, ET.SubElement(root, k))
else:
assert d == 'invalid type', (type(d), d)
assert isinstance(d, dict) and len(d) == 1
tag, body = next(iter(d.items()))
node = ET.Element(tag)
_to_etree(body, node)
return node
print(ET.tostring(dict_to_etree(d)))
def etree_to_dict(t):
d = {t.tag : map(etree_to_dict, t.iterchildren())}
d.update(('#' + k, v) for k, v in t.attrib.iteritems())
d['text'] = t.text
return d
Call as
tree = etree.parse("some_file.xml")
etree_to_dict(tree.getroot())
This works as long as you don't actually have an attribute text; if you do, then change the third line in the function body to use a different key. Also, you can't handle mixed content with this.
(Tested on LXML.)
For transforming XML from/to python dictionaries, xmltodict has worked great for me:
import xmltodict
xml = '''
<root>
<e />
<e>text</e>
<e name="value" />
<e name="value">text</e>
<e> <a>text</a> <b>text</b> </e>
<e> <a>text</a> <a>text</a> </e>
<e> text <a>text</a> </e>
</root>
'''
xdict = xmltodict.parse(xml)
xdict will now look like
OrderedDict([('root',
OrderedDict([('e',
[None,
'text',
OrderedDict([('#name', 'value')]),
OrderedDict([('#name', 'value'),
('#text', 'text')]),
OrderedDict([('a', 'text'), ('b', 'text')]),
OrderedDict([('a', ['text', 'text'])]),
OrderedDict([('a', 'text'),
('#text', 'text')])])]))])
If your XML data is not in raw string/bytes form but in some ElementTree object, you just need to print it out as a string and use xmldict.parse again. For instance, if you are using lxml to process the XML documents, then
from lxml import etree
e = etree.XML(xml)
xmltodict.parse(etree.tostring(e))
will produce the same dictionary as above.
Based on #larsmans, if you don't need attributes, this will give you a tighter dictionary --
def etree_to_dict(t):
return {t.tag : map(etree_to_dict, t.iterchildren()) or t.text}
Several answers already, but here's one compact solution that maps attributes, text value and children using dict-comprehension:
def etree_to_dict(t):
if type(t) is ET.ElementTree: return etree_to_dict(t.getroot())
return {
**t.attrib,
'text': t.text,
**{e.tag: etree_to_dict(e) for e in t}
}
The lxml documentation brings an example of how to map an XML tree into a dict of dicts:
def recursive_dict(element):
return element.tag, dict(map(recursive_dict, element)) or element.text
Note that this beautiful quick-and-dirty converter expects children to have unique tag names and will silently overwrite any data that was contained in preceding siblings with the same name. For any real-world application of xml-to-dict conversion, you would better write your own, longer version of this.
You could create a custom dictionary to deal with preceding siblings with the same name being overwritten:
from collections import UserDict, namedtuple
from lxml.etree import QName
class XmlDict(UserDict):
"""Custom dict to avoid preceding siblings with the same name being overwritten."""
__ROOTELM = namedtuple('RootElm', ['tag', 'node'])
def __setitem__(self, key, value):
if key in self:
if type(self.data[key]) is list:
self.data[key].append(value)
else:
self.data[key] = [self.data[key], value]
else:
self.data[key] = value
#staticmethod
def xml2dict(element):
"""Converts an ElementTree Element to a dictionary."""
elm = XmlDict.__ROOTELM(
tag=QName(element).localname,
node=XmlDict(map(XmlDict.xml2dict, element)) or element.text,
)
return elm
Usage
from lxml import etree
from pprint import pprint
xml_f = b"""<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Person>
<First>John</First>
<Last>Smith</Last>
</Person>
<Person>
<First>Jane</First>
<Last>Doe</Last>
</Person>
</Data>"""
elm = etree.fromstring(xml_f)
d = XmlDict.xml2dict(elm)
Output
In [3]: pprint(d)
RootElm(tag='Data', node={'Person': [{'First': 'John', 'Last': 'Smith'}, {'First': 'Jane', 'Last': 'Doe'}]})
In [4]: pprint(d.node)
{'Person': [{'First': 'John', 'Last': 'Smith'},
{'First': 'Jane', 'Last': 'Doe'}]}
Here is a simple data structure in xml (save as file.xml):
<?xml version="1.0" encoding="UTF-8"?>
<Data>
<Person>
<First>John</First>
<Last>Smith</Last>
</Person>
<Person>
<First>Jane</First>
<Last>Doe</Last>
</Person>
</Data>
Here is the code to create a list of dictionary objects from it.
from lxml import etree
tree = etree.parse('file.xml')
root = tree.getroot()
datadict = []
for item in root:
d = {}
for elem in item:
d[elem.tag]=elem.text
datadict.append(d)
datadict now contains:
[{'First': 'John', 'Last': 'Smith'},{'First': 'Jane', 'Last': 'Doe'}]
and can be accessed like so:
datadict[0]['First']
'John'
datadict[1]['Last']
'Doe'
You can use this snippet that directly converts it from xml to dictionary
import xml.etree.ElementTree as ET
xml = ('<xml>' +
'<first_name>Dean Christian</first_name>' +
'<middle_name>Christian</middle_name>' +
'<last_name>Armada</last_name>' +
'</xml>')
root = ET.fromstring(xml)
x = {x.tag: root.find(x.tag).text for x in root._children}
# returns {'first_name': 'Dean Christian', 'last_name': 'Armada', 'middle_name': 'Christian'}
enhanced the accepted answer with python3 and use json list when all children have the same tag. Also provided an option whether to wrap the dict with root tag or not.
from collections import OrderedDict
from typing import Union
from xml.etree.ElementTree import ElementTree, Element
def etree_to_dict(root: Union[ElementTree, Element], include_root_tag=False):
root = root.getroot() if isinstance(root, ElementTree) else root
result = OrderedDict()
if len(root) > 1 and len({child.tag for child in root}) == 1:
result[next(iter(root)).tag] = [etree_to_dict(child) for child in root]
else:
for child in root:
result[child.tag] = etree_to_dict(child) if len(list(child)) > 0 else (child.text or "")
result.update(('#' + k, v) for k, v in root.attrib.items())
return {root.tag: result} if include_root_tag else result
d = etree_to_dict(etree.ElementTree.parse('data.xml'), True)
from lxml import etree, objectify
def formatXML(parent):
"""
Recursive operation which returns a tree formated
as dicts and lists.
Decision to add a list is to find the 'List' word
in the actual parent tag.
"""
ret = {}
if parent.items(): ret.update(dict(parent.items()))
if parent.text: ret['__content__'] = parent.text
if ('List' in parent.tag):
ret['__list__'] = []
for element in parent:
ret['__list__'].append(formatXML(element))
else:
for element in parent:
ret[element.tag] = formatXML(element)
return ret
Building on #larsmans, if the resulting keys contain xml namespace info, you can remove that before writing to the dict. Set a variable xmlns equal to the namespace and strip its value out.
xmlns = '{http://foo.namespaceinfo.com}'
def etree_to_dict(t):
if xmlns in t.tag:
t.tag = t.tag.lstrip(xmlns)
if d = {t.tag : map(etree_to_dict, t.iterchildren())}
d.update(('#' + k, v) for k, v in t.attrib.iteritems())
d['text'] = t.text
return d
If you have a schema, the xmlschema package already implements multiple XML-to-dict converters that honor the schema and attribute types. Quoting the following from the docs
Available converters
The library includes some converters. The default converter
xmlschema.XMLSchemaConverter is the base class of other converter
types. Each derived converter type implements a well know convention,
related to the conversion from XML to JSON data format:
xmlschema.ParkerConverter: Parker convention
xmlschema.BadgerFishConverter: BadgerFish convention
xmlschema.AbderaConverter: Apache Abdera project convention
xmlschema.JsonMLConverter: JsonML (JSON Mark-up Language) convention
Documentation of these different conventions is available here: http://wiki.open311.org/JSON_and_XML_Conversion/
Usage of the converters is straightforward, e.g.:
from xmlschema import ParkerConverter, XMLSchema, to_dict
xml = '...'
schema = XMLSchema('...')
to_dict(xml, schema=schema, converter=ParkerConverter)
I am trying to use the shrink the web service for site thumbnails. They have a API that returns XML telling you if the site thumbnail can be created. I am trying to use ElementTree to parse the xml, but not sure how to get to the information I need. Here is a example of the XML response:
<?xml version="1.0" encoding="UTF-8"?>
<stw:ThumbnailResponse xmlns:stw="http://www.shrinktheweb.com/doc/stwresponse.xsd">
<stw:Response>
<stw:ThumbnailResult>
<stw:Thumbnail Exists="false"></stw:Thumbnail>
<stw:Thumbnail Verified="false">fix_and_retry</stw:Thumbnail>
</stw:ThumbnailResult>
<stw:ResponseStatus>
<stw:StatusCode>Blank Detected</stw:StatusCode>
</stw:ResponseStatus>
<stw:ResponseTimestamp>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseTimestamp>
<stw:ResponseCode>
<stw:StatusCode></stw:StatusCode>
</stw:ResponseCode>
<stw:CategoryCode>
<stw:StatusCode>none</stw:StatusCode>
</stw:CategoryCode>
<stw:Quota_Remaining>
<stw:StatusCode>1</stw:StatusCode>
</stw:Quota_Remaining>
</stw:Response>
</stw:ThumbnailResponse>
I need to get the "stw:StatusCode". If I try to do a find on "stw:StatusCode" I get a "expected path separator" syntax error. Is there a way to just get the status code?
Grrr namespaces ....try this:
STW_PREFIX = "{http://www.shrinktheweb.com/doc/stwresponse.xsd}"
(see line 2 of your sample XML)
Then when you want a tag like stw:StatusCode, use STW_PREFIX + "StatusCode"
Update: That XML response isn't the most brilliant design. It's not possible to guess from your single example whether there can be more than 1 2nd-level node. Note that each 3rd-level node has a "StatusCode" child. Here is some rough-and-ready code that shows you (1) why you need that STW_PREFIX caper (2) an extract of the usable info.
import xml.etree.cElementTree as et
def showtag(elem):
return repr(elem.tag.rsplit("}")[1])
def showtext(elem):
return None if elem.text is None else repr(elem.text.strip())
root = et.fromstring(xml_response) # xml_response is your input string
print repr(root.tag) # see exactly what tag is in the element
for child in root[0]:
print showtag(child), showtext(child)
for gc in child:
print "...", showtag(gc), showtext(gc), gc.attrib
Result:
'{http://www.shrinktheweb.com/doc/stwresponse.xsd}ThumbnailResponse'
'ThumbnailResult' ''
... 'Thumbnail' None {'Exists': 'false'}
... 'Thumbnail' 'fix_and_retry' {'Verified': 'false'}
'ResponseStatus' ''
... 'StatusCode' 'Blank Detected' {}
'ResponseTimestamp' ''
... 'StatusCode' None {}
'ResponseCode' ''
... 'StatusCode' None {}
'CategoryCode' ''
... 'StatusCode' 'none' {}
'Quota_Remaining' ''
... 'StatusCode' '1' {}