I'm trying to parse Evernote Markup Language (ENML) with lxml in Python 2.7. ENML is a superset of XHTML.
from StringIO import StringIO
import lxml.etree as etree
if __name__ == '__main__':
xml_str = StringIO('<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\r\n\r\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nA really simple example. Another sentence.\n</en-note>')
tree = etree.parse(xml_str)
The code above errors out with:
XMLSyntaxError: Entity 'nbsp' not defined, line 5, column 32
How do I successfully parse ENML?
is understood by the HTML parser, not the XML parser:
from StringIO import StringIO
import lxml.html as LH
if __name__ == '__main__':
xml_str = StringIO('<?xml version="1.0" encoding="UTF-8"?>\r\n<!DOCTYPE en-note SYSTEM "http://xml.evernote.com/pub/enml2.dtd">\r\n\r\n<en-note style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space;">\nA really simple example. Another sentence.\n</en-note>')
tree = LH.parse(xml_str)
print(LH.tostring(tree))
You can try replacing the entity names by their numerical values.
http://www.w3schools.com/tags/ref_entities.asp
Related
I have that xml file, and I need only to get value from steamID64 (76561198875082603).
<profile>
<steamID64>76561198875082603</steamID64>
<steamID>...</steamID>
<onlineState>online</onlineState>
<stateMessage>...</stateMessage>
<privacyState>public</privacyState>
<visibilityState>3</visibilityState>
<avatarIcon>...</avatarIcon>
<avatarMedium>...</avatarMedium>
<avatarFull>...</avatarFull>
<vacBanned>0</vacBanned>
<tradeBanState>None</tradeBanState>
<isLimitedAccount>0</isLimitedAccount>
<customURL>...</customURL>
<memberSince>December 8, 2018</memberSince>
<steamRating/>
<hoursPlayed2Wk>0.0</hoursPlayed2Wk>
<headline>...</headline>
<location>...</location>
<realname>
<![CDATA[ THEMakci7m87 ]]>
</realname>
<summary>...</summary>
<mostPlayedGames>...</mostPlayedGames>
<groups>...</groups>
</profile>
Now I have only that code:
xml_url = f'{url}?xml=1'
then I don't know what to do.
It's fairly simple with lxml:
import lxml.html as lh
steam = """your html above"""
doc = lh.fromstring(steam)
doc.xpath('//steamid64/text()')
Output:
['76561198875082603']
Edit:
With the actual url, it's clear that the underlying data is xml; so the better way to do it is:
import requests
from lxml import etree
url = 'https://steamcommunity.com/id/themakci7m87/?xml=1'
req = requests.get(url)
doc = etree.XML(req.text.encode())
doc.xpath('//steamID64/text()')
Same output.
You better use builtin XML lib named ElementTree
lxml is an external XML lib that requires a separate installation.
See below
import requests
import xml.etree.ElementTree as ET
r = requests.get('https://steamcommunity.com/id/themakci7m87/?xml=1')
if r.status_code == 200:
root = ET.fromstring(r.text)
steam_id_64 = root.find('./steamID64').text
print(steam_id_64)
else:
print('Failed to read data.')
output:
76561198875082603
I know that I can preserve CDATA sections during XML parsing, using the following:
from lxml import etree
parser = etree.XMLParser(strip_cdata=False)
root = etree.XML('<root><![CDATA[test]]></root>', parser)
See APIs specific to lxml.etree
But, is there a simple way to "restore" CDATA section during serialization?
For example, by specifying a list of tag names…
For instance, I want to turn:
<CONFIG>
<BODY>This is a <message>.</BODY>
</CONFIG>
to:
<CONFIG>
<BODY><![CDATA[This is a <message>.]]></BODY>
</CONFIG>
Just by telling that BODY should contains CDATA…
Something like this?
from lxml import etree
parser = etree.XMLParser(strip_cdata=True)
root = etree.XML('<root><x><![CDATA[<test>]]></x></root>', parser)
print etree.tostring(root)
for elem in root.findall('x'):
elem.text = etree.CDATA(elem.text)
print etree.tostring(root)
Produces:
<root><x><test></x></root>
<root><x><![CDATA[<test>]]></x></root>
I have a set of super simple XML files to parse... but... they use custom defined entities. I don't need to map these to characters, but I do wish to parse and act on each one. For example:
<Style name="admin-5678">
<Rule>
<Filter>[admin_level]='5'</Filter>
&maxscale_zoom11;
</Rule>
</Style>
There is a tantalizing hint at http://effbot.org/elementtree/elementtree-xmlparser.htm that XMLParser has limited entity support, but I can't find the methods mentioned, everything gives errors:
#!/usr/bin/python
##
## Where's the entity support as documented at:
## http://effbot.org/elementtree/elementtree-xmlparser.htm
## In Python 2.7.1+ ?
##
from pprint import pprint
from xml.etree import ElementTree
from cStringIO import StringIO
parser = ElementTree.ElementTree()
#parser.entity["maxscale_zoom11"] = unichr(160)
testf = StringIO('<foo>&maxscale_zoom11;</foo>')
tree = parser.parse(testf)
#tree = parser.parse(testf,"XMLParser")
for node in tree.iter('foo'):
print node.text
Which depending on how you adjust the comments gives:
xml.etree.ElementTree.ParseError: undefined entity: line 1, column 5
or
AttributeError: 'ElementTree' object has no attribute 'entity'
or
AttributeError: 'str' object has no attribute 'feed'
For those curious the XML is from the OpenStreetMap's mapnik project.
As #cnelson already pointed out in a comment, the chosen solution here won't work in Python 3.
I finally got it working. Quoted from this Q&A.
Inspired by this post, we can just prepend some XML definition to the incoming raw HTML content, and then ElementTree would work out of box.
This works for both Python 2.6, 2.7, 3.3, 3.4.
import xml.etree.ElementTree as ET
html = '''<html>
<div>Some reasonably well-formed HTML content.</div>
<form action="login">
<input name="foo" value="bar"/>
<input name="username"/><input name="password"/>
<div>It is not unusual to see in an HTML page.</div>
</form></html>'''
magic = '''<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" [
<!ENTITY nbsp ' '>
]>''' # You can define more entities here, if needed
et = ET.fromstring(magic + html)
I'm not sure if this is a bug in ElementTree or what, but you need to call UseForeignDTD(True) on the expat parser to behave the way it did in the past.
It's a bit hacky, but you can do this by creating your own instance of ElementTree.Parser, calling the method on it's instance of xml.parsers.expat, and then passing it to ElementTree.parse():
from xml.etree import ElementTree
from cStringIO import StringIO
testf = StringIO('<foo>&moo_1;</foo>')
parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity['moo_1'] = 'MOOOOO'
etree = ElementTree.ElementTree()
tree = etree.parse(testf, parser=parser)
for node in tree.iter('foo'):
print node.text
This outputs "MOOOOO"
Or using a mapping interface:
from xml.etree import ElementTree
from cStringIO import StringIO
class AllEntities:
def __getitem__(self, key):
#key is your entity, you can do whatever you want with it here
return key
testf = StringIO('<foo>&moo_1;</foo>')
parser = ElementTree.XMLParser()
parser.parser.UseForeignDTD(True)
parser.entity = AllEntities()
etree = ElementTree.ElementTree()
tree = etree.parse(testf, parser=parser)
for node in tree.iter('foo'):
print node.text
This outputs "moo_1"
A more complex fix would be to subclass ElementTree.XMLParser and fix it there.
Consider the following:
from lxml import etree
from StringIO import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
This would fail with:
lxml.etree.XMLSyntaxError: Entity 'nbsp' not defined, line 2, column 11
This is because resolve_entities=False doesn't ignore them, it just doesn't resolve them.
If I use etree.HTMLParser instead, it creates html and body tags, plus a lot of other special handling it tries to do for HTML.
What's the best way to get a â text child under the aa tag with lxml?
You can't ignore entities as they are part of the XML definition. Your document is not well-formed if it doesn't have a DTD or standalone="yes" or if it includes entities without an entity definition in the DTD. Lie and claim your document is HTML.
https://mailman-mail5.webfaction.com/pipermail/lxml/2008-February/003398.html
You can try lying and putting an XHTML DTD on your document. e.g.
from lxml import etree
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
x = """<?xml version="1.0" encoding="utf-8"?>\n<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" >\n<aa> â</aa>"""
p = etree.XMLParser(remove_blank_text=True, resolve_entities=False)
r = etree.parse(StringIO(x), p)
etree.tostring(r) # '<aa> â</aa>'
#Alex is right: your document is not well-formed XML, and so XML parsers will not parse it. One option is to pre-process the text of the document to replace bogus entities with their utf-8 characters:
entities = [
(' ', u'\u00a0'),
('â', u'\u00e2'),
...
]
for before, after in entities:
x = x.replace(before, after.encode('utf8'))
Of course, this can be broken by sufficiently weird "xml" also.
Your best bet is to fix your input XML documents to be well-formed XML.
When I was trying to do something similar, I just used x.replace('&', '&') before parsing the string.
I've written a simple script to parse XML chat logs using the BeautifulSoup module. The standard soup.prettify() works ok except chat logs have a lot of fluff in them. You can see both the script code and some of the XML input file I'm working with below:
Code
import sys
from BeautifulSoup import BeautifulSoup as Soup
def parseLog(file):
file = sys.argv[1]
handler = open(file).read()
soup = Soup(handler)
print soup.prettify()
if __name__ == "__main__":
parseLog(sys.argv[1])
Test XML Input
<?xml version="1.0"?>
<?xml-stylesheet type='text/xsl' href='MessageLog.xsl'?>
<Log FirstSessionID="1" LastSessionID="2"><Message Date="10/31/2010" Time="3:43:48 PM" DateTime="2010-10-31T20:43:48.937Z" SessionID="1"><From><User FriendlyName="Jon"/></From> <To><User FriendlyName="Bill"/></To><Text Style="font-family:Segoe UI; color:#000000; ">hey, what's up?</Text></Message>
<Message Date="10/31/2010" Time="3:44:03 PM" DateTime="2010-10-15T20:44:03.421Z" SessionID="1"><From><User FriendlyName="Jon"/></From><To><User FriendlyName="Bill"/></To><Text Style="font-family:Segoe UI; color:#000000; ">Got your message</Text></Message>
<Message Date="10/31/2010" Time="3:44:31 PM" DateTime="2010-10-15T20:44:31.390Z" SessionID="2"><From><User FriendlyName="Bill"/></From><To><User FriendlyName="Jon"/></To><Text Style="font-family:Segoe UI; color:#000000; ">oh, great</Text></Message>
<Message Date="10/31/2010" Time="3:44:59 PM" DateTime="2010-10-15T20:44:59.281Z" SessionID="2"><From><User FriendlyName="Bill"/></From><To><User FriendlyName="Jon"/></To><Text Style="font-family:Segoe UI; color:#000000; ">hey, i gotta run</Text></Message>
I'm wanting to be able to output this into a format like the following or at least something that is more readable than pure XML:
Jon:
Hey, what's up? [10/31/10 # 3:43p]
Jon:
Got your message [10/31/10 # 3:44p]
Bill:
oh, great [10/31/10 # 3:44p]
etc.. I've heard some decent things about the PyParsing module, maybe it's time to give it a shot.
BeautifulSoup makes getting at attributes and values in xml really simple. I tweaked your example function to use these features.
import sys
from BeautifulSoup import BeautifulSoup as Soup
def parseLog(file):
file = sys.argv[1]
handler = open(file).read()
soup = Soup(handler)
for message in soup.findAll('message'):
msg_attrs = dict(message.attrs)
f_user = message.find('from').user
f_user_dict = dict(f_user.attrs)
print "%s: %s [%s # %s]" % (f_user_dict[u'friendlyname'],
message.find('text').decodeContents(),
msg_attrs[u'date'],
msg_attrs[u'time'])
if __name__ == "__main__":
parseLog(sys.argv[1])
I'd recommend using the builtin ElementTree module. BeautifulSoup is meant to handle unwell-formed code like hacked up HTML, whereas XML is well-formed and meant to be read by an XML library.
Update: some of my recent reading here suggests lxml as a library built on and enhancing the standard ElementTree.