lxml xpath not working if default namespace is present

lxml xpath not working if default namespace is present - python

My org is migrating from python 2.7 to python 3.7. There is some python code that I have not written but have to be migrated. Code uses xml.xpath to parse xml.
Given that xml.xpath is not available in python 3.7. I am trying to port the code to use Xpath from lxml.etree. Intention is to keep amount of code change to minimal.
I have pasted the current implementation as well the code that I am porting it to. The ported code is not working since XML has default namespace.
Current code
>>> from xml import xpath
>>> from xml.dom.minidom import parseString
>>> Test_XML = '<ibml:validateTradeStatusRequest xmlns="http://www.fpml.org/2005/FpML-4-2" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore" xmlns:ibml="http://ibml.jpmorgan.com/2005" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ibmlVersion="1-56" version="4-2" xsi:schemaLocation="http://ibml.jpmorgan.com/2005C:\\IBTIBML\\trunk\\src\\xsd\\IBML.xsd"><header><sentBy>test_value</sentBy><sendTo>test_value</sendTo><creationTimestamp>2012-06-06T08:23:20.613</creationTimestamp></header><tradeReference></tradeReference></ibml:validateTradeStatusRequest>'
>>> doc = parseString( Test_XML )
>>> ctx = xpath.Context.Context( doc, processorNss = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
>>> expr = xpath.Compile( '/ibml:validateTradeStatusRequest/header/sentBy' )
>>> node = expr.evaluate(ctx)
>>> node[0].childNodes[0].data
u'test_value'
>>>
Attempt to port it to lxml.etree. But it does not work as XML has default namespace.
>>> from lxml import etree as ET
>>> element = ET.fromstring( Test_XML )
>>> element.xpath( '/ibml:validateTradeStatusRequest/header/sentBy', namespaces = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
[]
However if I remove the default namespace then the xpath evaluation up works fine.
But it is not desired solution as XML creation is outside of the scope of code. Also It wont be possible to change the xpath query as that is also out side the scope of code.
>>> Test_XML_No_Default_NS = '<ibml:validateTradeStatusRequest xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore" xmlns:ibml="http://ibml.jpmorgan.com/2005" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ibmlVersion="1-56" version="4-2" xsi:schemaLocation="http://ibml.jpmorgan.com/2005C:\\IBTIBML\\trunk\\src\\xsd\\IBML.xsd"><header><sentBy>test_value</sentBy><sendTo>test_value</sendTo><creationTimestamp>2012-06-06T08:23:20.613</creationTimestamp></header><tradeReference></tradeReference></ibml:validateTradeStatusRequest>'
>>> ee = ET.fromstring( Test_XML_No_Default_NS )
>>> ee.xpath( '/ibml:validateTradeStatusRequest/header/sentBy', namespaces = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
[<Element sentBy at 0x1546b058>]
>>> node = _
>>> node[0].text
'test_value'
Any suggestion on how I can move forward ?

I'm not sure this answers your question directly, but try this and see if it works (maybe with modifications)
import lxml.etree
Test_XML = '<?xml version="1.0" encoding="UTF-8"?><ibml:validateTradeStatusRequest xmlns="http://www.fpml.org/2005/FpML-4-2" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore" xmlns:ibml="http://ibml.jpmorgan.com/2005" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ibmlVersion="1-56" version="4-2" xsi:schemaLocation="http://ibml.jpmorgan.com/2005C:\\IBTIBML\\trunk\\src\\xsd\\IBML.xsd"><header><sentBy>test_value1</sentBy><sendTo>test_value2</sendTo><creationTimestamp>2012-06-06T08:23:20.613</creationTimestamp></header><tradeReference></tradeReference></ibml:validateTradeStatusRequest>'
xml = bytes(bytearray(Test_XML, encoding='utf-8'))
ee = etree.XML(xml)
target = ee.xpath( '//ibml:validateTradeStatusRequest', namespaces = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
print(target[0].xpath('//text()')[0])
Output:
test_value1

Related

Displaying XML attributes in PHP

I am a beginner in coding and have this question below. I would gladly appreciate any help.
I have this python code below that request for information regarding a organization.
Note: The Commented "target" variable is for future use when i pass the user input from php to this python script.
import requests, sys
#target = sys.argv[1]
target = "logitech"
request = requests.get('http://whois.arin.net/rest/nets;name={}'.format(target))
print(request.text)
The output is similar to this but the number of "netRef" tags may vary depending on the organization.
<?xml version='1.0'?><?xml-stylesheet type='text/xsl' href='http://whois.arin.net/xsl/website.xsl' ?><nets xmlns="http://www.arin.net/whoisrws/core/v1" xmlns:ns2="http://www.arin.net/whoisrws/rdns/v1" xmlns:ns3="http://www.arin.net/whoisrws/netref/v2" copyrightNotice="Copyright 1997-2020, American Registry for Internet Numbers, Ltd." inaccuracyReportUrl="https://www.arin.net/resources/registry/whois/inaccuracy_reporting/" termsOfUse="https://www.arin.net/resources/registry/whois/tou/"><limitExceeded limit="256">false</limitExceeded>
<netRef endAddress="173.8.217.111" startAddress="173.8.217.96" handle="NET-173-8-217-96-1" name="LOGITECH">https://whois.arin.net/rest/net/NET-173-8-217-96-1</netRef>
<netRef endAddress="50.193.49.47" startAddress="50.193.49.32" handle="NET-50-193-49-32-1" name="LOGITECH">https://whois.arin.net/rest/net/NET-50-193-49-32-1</netRef></nets>
I was wondering, is it possible to only display all of the endAddress and startAddress attributes in PHP?
I've tried using the xml.etree.ElementTree module but because the request variable is a "response" instead of a "byte", i can't parse the XML directly into an element.
My PHP code currently looks like this as i am unsure of how to proceed. testapi.py refers to the python code above.
<?php
$output1 = shell_exec('python testapi.py');
echo $output1;
?>
My desired output on the PHP side is as follow:
IP range: 173.8.217.96-173.8.217.111, 50.193.49.32-50.193.49.47
I would gladly appreciate any help, Thank You.

Python's etree maintains the fromstring method to parse XML trees from text. From there, you can parse content and be sure to assign prefixes to the default namespace in XML:
xmlns="http://www.arin.net/whoisrws/core/v1"
import requests as rq
import xml.etree.ElementTree as ET
request = rq.get('http://whois.arin.net/rest/nets;name=logitech')
tree = ET.fromstring(request.text)
nmsp = {"doc": "http://www.arin.net/whoisrws/core/v1"}
for elem in tree.findall(".//doc:netRef", nmsp):
print(f"endAddress: {elem.attrib['endAddress']}")
print(f"startAddress: {elem.attrib['startAddress']}")
print("---------------------------\n")
# endAddress: 173.8.217.111
# startAddress: 173.8.217.96
# ---------------------------
# endAddress: 50.193.49.47
# startAddress: 50.193.49.32
# ---------------------------

SyntaxError: prefix 'a' not found in prefix map

I'm trying to create a function which counts words in pptx document. The problem is that I can't figure out how to find only this kind of tags:
<a:t>Some Text</a:t>
When I try to: print xmlTree.findall('.//a:t'), it returns
SyntaxError: prefix 'a' not found in prefix map
Do you know what to do to make it work?
This is the function:
def get_pptx_word_count(filename):
import xml.etree.ElementTree as ET
import zipfile
z = zipfile.ZipFile(filename)
i=0
wordcount = 0
while True:
i+=1
slidename = 'slide{}.xml'.format(i)
try:
slide = z.read("ppt/slides/{}".format(slidename))
except KeyError:
break
xmlTree = ET.fromstring(slide)
for elem in xmlTree.iter():
if elem.tag=='a:t':
#text = elem.getText
#num = len(text.split(' '))
#wordcount+=num

The way to specify the namespace inside ElementTree is:
{namespace}element
So, you should change your query to:
print xmlTree.findall('.//{a}t')
Edit:
As #mxjn pointed out if a is a prefix and not the URI you need to insert the URI instead of a:
print xmlTree.findall('.//{http://tempuri.org/name_space_of_a}t')
or you can supply a prefix map:
prefix_map = {"a": "http://tempuri.org/name_space_of_a"}
print xmlTree.findall('.//a:t', prefix_map)

You need to tell ElementTree about your XML namespaces.
References:
Official Documentation (Python 2.7): 19.7.1.6. Parsing XML with Namespaces
Existing answer on StackOverflow: Parsing XML with namespace in Python via 'ElementTree'
Article by the author of ElementTree: ElementTree: Working with Namespaces and Qualified Names

lxml - difficulty parsing stackexchange rss feed

Hia
I am having problems parsing an rss feed from stackexchange in python.
When I try to get the summary nodes, an empty list is return
I have been trying to solve this, but can't get my head around.
Can anyone help out?
thanks
a
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []

Take a look at these two versions
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
As you discovered, the second version returns no nodes, but the lxml.html version works fine. The etree version is not working because it's expecting namespaces and the html version is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says "The HTML parser notably ignores namespaces and some other XMLisms."
Note when you print the root node of the etree version (print(data.getroot())), you get something like <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>. That means it's a feed element with a namespace of http://www.w3.org/2005/Atom. Here is a corrected version of the etree code.
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')

The problem is namespaces.
Run this :
cooking_parsed.getroot().tag
And you'll see that the element is namespaced as
{http://www.w3.org/2005/Atom}feed
Similarly if you navigate to one of the feed entries.
This means the right xpath in lxml is:
print cooking_parsed.xpath(
"//a:feed/a:entry",
namespaces={ 'a':'http://www.w3.org/2005/Atom' })

Try using BeautifulStoneSoup from the beautifulsoup import.
It might do the trick.

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

I am looking for a python module that will help me get rid of HTML tags but keep the text values. I tried BeautifulSoup before and I couldn't figure out how to do this simple task. I tried searching for Python modules that could do this but they all seem to be dependent on other libraries which does not work well on AppEngine.
Below is a sample code from Ruby's sanitize library and that's what I am after in Python:
require 'rubygems'
require 'sanitize'
html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
Sanitize.clean(html) # => 'foo'
Thanks for your suggestions.
-e

>>> import BeautifulSoup
>>> html = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
>>> bs = BeautifulSoup.BeautifulSoup(html)
>>> bs.findAll(text=True)
[u'foo']
This gives you a list of (Unicode) strings. If you want to turn it into a single string, use ''.join(thatlist).

If you don't want to use separate libs then you can import standard django utils. For example:
from django.utils.html import strip_tags
html = '<b>foo</b><img src="http://foo.com/bar.jpg'
stripped = strip_tags(html)
print stripped
# you got: foo
Also its already included in Django templates, so you dont need anything else, just use filter, like this:
{{ unsafehtml|striptags }}
Btw, this is one of the fastest way.

Using lxml:
htmlstring = '<b>foo</b><img src="http://foo.com/bar.jpg" />'
from lxml.html import fromstring
mySearchTree = fromstring(htmlstring)
for item in mySearchTree.cssselect('a'):
print item.text

#!/usr/bin/python
from xml.dom.minidom import parseString
def getText(el):
ret = ''
for child in el.childNodes:
if child.nodeType == 3:
ret += child.nodeValue
else:
ret += getText(child)
return ret
html = '<b>this is a link and some bold text </b> followed by <img src="http://foo.com/bar.jpg" /> an image'
dom = parseString('<root>' + html + '</root>')
print getText(dom.documentElement)
Prints:
this is a link and some bold text followed by an image

Late, but.
You can use Jinja2.Markup()
http://jinja.pocoo.org/docs/api/#jinja2.Markup.striptags
from jinja2 import Markup
Markup("<div>About</div>").striptags()
u'About'

All nodeValue fields are None when parsing XML

I'm building a simple web-based RSS reader in Python, but I'm having trouble parsing the XML. I started out by trying some stuff in the Python command line.
>>> from xml.dom import minidom
>>> import urllib2
>>> url ='http://www.digg.com/rss/index.xml'
>>> xmldoc = minidom.parse(urllib2.urlopen(url))
>>> channelnode = xmldoc.getElementsByTagName("channel")
>>> channelnode = xmldoc.getElementsByTagName("channel")
>>> titlenode = channelnode[0].getElementsByTagName("title")
>>> print titlenode[0]
<DOM Element: title at 0xb37440>
>>> print titlenode[0].nodeValue
None
I played around with this for a while, but the nodeValue of everything seems to be None. Yet if you look at the XML, there definitely are values there. What am I doing wrong?

For RSS feeds you should try the Universal Feed Parser library. It simplifies the handling of RSS feeds immensly.
import feedparser
d = feedparser.parse('http://www.digg.com/rss/index.xml')
title = d.channel.title

This is the syntax you are looking for:
>>> print titlenode[0].firstChild.nodeValue
digg.com: Stories / Popular
Note that the node value is a logical descendant of the node itself.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml xpath not working if default namespace is present - python

Related

Displaying XML attributes in PHP

SyntaxError: prefix 'a' not found in prefix map

lxml - difficulty parsing stackexchange rss feed

Remove HTML tags in AppEngine Python Env (equivalent to Ruby's Sanitize)

All nodeValue fields are None when parsing XML

Categories

Resources