I'm trying to query this XML with lxml:
<lista_tareas>
<tarea id="1" realizzato="False" data_limite="12/10/2012" priorita="1">
<description>XML TEST</description>
</tarea>
<tarea id="2" realizzato="False" data_limite="12/10/2012" priorita="1">
<description>XML TEST2</description>
</tarea>
I wrote this code:
from lxml import etree
doc = etree.parse(file_path)
root = etree.Element("lista_tareas")
for x in root:
z = x.Element("tarea")
for y in z:
element_text = y.Element("description").text
print element_text
It doesn't print anything, could you suggest me how to do?
You do not want to use the minidom; use the ElementTree API instead. The DOM API is a very verbose and constrained API, the ElementTree API plays to Python's strengths instead.
The MiniDOM module doesn't offer any query API like you are looking for.
You can use the bundled xml.etree.ElementTree module, or you could install lxml, which offers more powerful XPath and other query options.
import xml.etree.ElementTree as ET
root = ET.parse('document.xml').getroot()
for c in root.findall("./Root_Node[#id='1']/sub_node"):
# Do something with c
Using lxml:
from lxml import etree
doc = etree.parse ( source )
for c in doc.xpath ( "//Root_Node[#id='1']" ):
subnode = c.find ( "sub_node" )
# ... etc ...
Related
I have that xml file, and I need only to get value from steamID64 (76561198875082603).
<profile>
<steamID64>76561198875082603</steamID64>
<steamID>...</steamID>
<onlineState>online</onlineState>
<stateMessage>...</stateMessage>
<privacyState>public</privacyState>
<visibilityState>3</visibilityState>
<avatarIcon>...</avatarIcon>
<avatarMedium>...</avatarMedium>
<avatarFull>...</avatarFull>
<vacBanned>0</vacBanned>
<tradeBanState>None</tradeBanState>
<isLimitedAccount>0</isLimitedAccount>
<customURL>...</customURL>
<memberSince>December 8, 2018</memberSince>
<steamRating/>
<hoursPlayed2Wk>0.0</hoursPlayed2Wk>
<headline>...</headline>
<location>...</location>
<realname>
<![CDATA[ THEMakci7m87 ]]>
</realname>
<summary>...</summary>
<mostPlayedGames>...</mostPlayedGames>
<groups>...</groups>
</profile>
Now I have only that code:
xml_url = f'{url}?xml=1'
then I don't know what to do.
It's fairly simple with lxml:
import lxml.html as lh
steam = """your html above"""
doc = lh.fromstring(steam)
doc.xpath('//steamid64/text()')
Output:
['76561198875082603']
Edit:
With the actual url, it's clear that the underlying data is xml; so the better way to do it is:
import requests
from lxml import etree
url = 'https://steamcommunity.com/id/themakci7m87/?xml=1'
req = requests.get(url)
doc = etree.XML(req.text.encode())
doc.xpath('//steamID64/text()')
Same output.
You better use builtin XML lib named ElementTree
lxml is an external XML lib that requires a separate installation.
See below
import requests
import xml.etree.ElementTree as ET
r = requests.get('https://steamcommunity.com/id/themakci7m87/?xml=1')
if r.status_code == 200:
root = ET.fromstring(r.text)
steam_id_64 = root.find('./steamID64').text
print(steam_id_64)
else:
print('Failed to read data.')
output:
76561198875082603
I scanned Python source code using AppScan and it says that the code contains potential vulnerabilities (XML Injection). For example:
import xml.dom.minidom
...
dom = xml.dom.minidom.parse(filename)
...
document = xml.dom.minidom.parseString(xmlStr)
...
I installed the defusedxml and replaced all parsings where use the standard Python xml package with parse/parseString from defusedxml.minidom & defusedxml.cElementTree:
import defusedxml.minidom
...
dom = defusedxml.minidom.parse(filename)
...
document = defusedxml.minidom.parseString(xmlStr)
...
These vulnerabilities are gone from scan report. But AppScan still notify me about vulnerabilities where from standard xml package are importing any functions/classes. For example classes from ElementTree to modify/build xml tree:
from xml.etree.cElementTree import ( # vulnerability here
SubElement, Element, ElementTree)
import defusedxml.cElementTree as et
...
template = et.parse(template_filename) # safe parsing
root = template.getroot()
email_list_el = root.find('emails').find('list')
for email_address in to_list:
SubElement(email_list_el , 'string').text = email_address
root.find('subject')[0].text = subject
root.find('body')[0].text = body
...
Can this be considered a vulnerability if xml.dom.minidom is used only for writing XML?
ElementTree is not secured against maliciously constructed data. See list of vulnerabilities. Consider using defusedxml instead.
I know that I can preserve CDATA sections during XML parsing, using the following:
from lxml import etree
parser = etree.XMLParser(strip_cdata=False)
root = etree.XML('<root><![CDATA[test]]></root>', parser)
See APIs specific to lxml.etree
But, is there a simple way to "restore" CDATA section during serialization?
For example, by specifying a list of tag names…
For instance, I want to turn:
<CONFIG>
<BODY>This is a <message>.</BODY>
</CONFIG>
to:
<CONFIG>
<BODY><![CDATA[This is a <message>.]]></BODY>
</CONFIG>
Just by telling that BODY should contains CDATA…
Something like this?
from lxml import etree
parser = etree.XMLParser(strip_cdata=True)
root = etree.XML('<root><x><![CDATA[<test>]]></x></root>', parser)
print etree.tostring(root)
for elem in root.findall('x'):
elem.text = etree.CDATA(elem.text)
print etree.tostring(root)
Produces:
<root><x><test></x></root>
<root><x><![CDATA[<test>]]></x></root>
Is there a quick way to take this block of XML and extract the value of "version"?
<xml>
<creator version='1.0'>
<program>BULK_EXTRACTOR</program>
<version>1.0.3</version>
<build_environment>
<compiler>GCC 4.2</compiler>
<compilation_date>2011-09-27T11:56:35</compilation_date>
<library name="afflib" version="3.6.12"></library>
<library name="libewf" version="20100226"></library>
</build_environment>
</creator>
</xml>
I know that I can do this with Python's Beautiful Soup, but I'm look for a simple way to do it with the DOM.
Thanks!
Assuming you are looking for the version element, not the version attributes,
using lxml:
import lxml.etree as ET
content='''\
<xml>
<creator version='1.0'>
<program>BULK_EXTRACTOR</program>
<version>1.0.3</version>
<build_environment>
<compiler>GCC 4.2</compiler>
<compilation_date>2011-09-27T11:56:35</compilation_date>
<library name="afflib" version="3.6.12"></library>
<library name="libewf" version="20100226"></library>
</build_environment>
</creator>
</xml>
'''
doc=ET.fromstring(content)
version=doc.xpath('creator/version/text()')[0]
print(version)
# 1.0.3
To find the version attributes:
for elt in doc.xpath('//*[#version]'):
print(elt.tag, elt.attrib.get('name'), elt.attrib.get('version'))
# ('creator', None, '1.0')
# ('library', 'afflib', '3.6.12')
# ('library', 'libewf', '20100226')
If you don't have lxml installed, you can use ElementTree which is included in the standard library:
>>> import xml.etree.ElementTree
>>> doc = xml.etree.ElementTree.fromstring(content)
>>> doc.findtext('creator/version')
'1.0.3'
I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:
<Style name="admin-5678">
<Rule>
<Filter>[admin_level]='5'</Filter>
&maxscale_zoom11;
</Rule>
</Style>
There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:
#!/usr/bin/python
##
## Where's the entity support as documented at:
## http://effbot.org/elementtree/elementtree-xmlparser.htm
## In Python 2.7.1+ ?
##
from pprint import pprint
from xml.etree import ElementTree
from cStringIO import StringIO
parser = ElementTree.ElementTree()
#parser.entity["maxscale_zoom11"] = unichr(160)
testf = StringIO('<foo>&maxscale_zoom11;</foo>')
tree = parser.parse(testf)
#tree = parser.parse(testf,"XMLParser")
for node in tree.iter('foo'):
print node.text
Which depending on how you adjust the comments gives:
xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5
or
AttributeError: 'ElementTree' object has no attribute 'entity'
or
AttributeError: 'str' object has no attribute 'feed'
For those curious the XML is from the OpenStreetMap's mapnik project.
As #cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.
I finally got it working. Quoted from this Q&A.
Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.
This works for both Python 2.6, 2.7, 3.3, 3.4.
import xml.etree.ElementTree as ET
html = '''<html>
<div>Some reasonably well-formed HTML content.</div>
<form action="login">
<input name="foo" value="bar"/>
<input name="username"/><input name="password"/>
<div>It is not unusual to see in an HTML page.</div>
</form></html>'''
magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY nbsp ' '>
]>''' # You can define more entities here, if needed
et = ET.fromstring(magic + html)
I'm not sure if this is a bug in ElementTree or what, but you need to call UseForeignDTD(True) on the expat parser to behave the way it did in the past.
It's a bit hacky, but you can do this by creating your own instance of ElementTree.Parser, calling the method on it's instance of xml.parsers.expat, and then passing it to ElementTree.parse():
from xml.etree import ElementTree
from cStringIO import StringIO
testf = StringIO('<foo>&moo_1;</foo>')
parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity['moo_1'] = 'MOOOOO'
etree = ElementTree.ElementTree()
tree = etree.parse(testf, parser=parser)
for node in tree.iter('foo'):
print node.text
This outputs "MOOOOO"
Or using a mapping interface:
from xml.etree import ElementTree
from cStringIO import StringIO
class AllEntities:
def __getitem__(self, key):
#key is your entity, you can do whatever you want with it here
return key
testf = StringIO('<foo>&moo_1;</foo>')
parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = AllEntities()
etree = ElementTree.ElementTree()
tree = etree.parse(testf, parser=parser)
for node in tree.iter('foo'):
print node.text
This outputs "moo_1"
A more complex fix would be to subclass ElementTree.XMLParser and fix it there.