All nodeValue fields are None when parsing XML - python

I'm building a simple web-based RSS reader in Python, but I'm having trouble parsing the XML. I started out by trying some stuff in the Python command line.
>>> from xml.dom import minidom
>>> import urllib2
>>> url ='http://www.digg.com/rss/index.xml'
>>> xmldoc = minidom.parse(urllib2.urlopen(url))
>>> channelnode = xmldoc.getElementsByTagName("channel")
>>> channelnode = xmldoc.getElementsByTagName("channel")
>>> titlenode = channelnode[0].getElementsByTagName("title")
>>> print titlenode[0]
<DOM Element: title at 0xb37440>
>>> print titlenode[0].nodeValue
None
I played around with this for a while, but the nodeValue of everything seems to be None. Yet if you look at the XML, there definitely are values there. What am I doing wrong?

For RSS feeds you should try the Universal Feed Parser library. It simplifies the handling of RSS feeds immensly.
import feedparser
d = feedparser.parse('http://www.digg.com/rss/index.xml')
title = d.channel.title

This is the syntax you are looking for:
>>> print titlenode[0].firstChild.nodeValue
digg.com: Stories / Popular
Note that the node value is a logical descendant of the node itself.

Related

lxml xpath not working if default namespace is present

My org is migrating from python 2.7 to python 3.7. There is some python code that I have not written but have to be migrated. Code uses xml.xpath to parse xml.
Given that xml.xpath is not available in python 3.7. I am trying to port the code to use Xpath from lxml.etree. Intention is to keep amount of code change to minimal.
I have pasted the current implementation as well the code that I am porting it to. The ported code is not working since XML has default namespace.
Current code
>>> from xml import xpath
>>> from xml.dom.minidom import parseString
>>> Test_XML = '<ibml:validateTradeStatusRequest xmlns="http://www.fpml.org/2005/FpML-4-2" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore" xmlns:ibml="http://ibml.jpmorgan.com/2005" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ibmlVersion="1-56" version="4-2" xsi:schemaLocation="http://ibml.jpmorgan.com/2005C:\\IBTIBML\\trunk\\src\\xsd\\IBML.xsd"><header><sentBy>test_value</sentBy><sendTo>test_value</sendTo><creationTimestamp>2012-06-06T08:23:20.613</creationTimestamp></header><tradeReference></tradeReference></ibml:validateTradeStatusRequest>'
>>> doc = parseString( Test_XML )
>>> ctx = xpath.Context.Context( doc, processorNss = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
>>> expr = xpath.Compile( '/ibml:validateTradeStatusRequest/header/sentBy' )
>>> node = expr.evaluate(ctx)
>>> node[0].childNodes[0].data
u'test_value'
>>>
Attempt to port it to lxml.etree. But it does not work as XML has default namespace.
>>> from lxml import etree as ET
>>> element = ET.fromstring( Test_XML )
>>> element.xpath( '/ibml:validateTradeStatusRequest/header/sentBy', namespaces = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
[]
However if I remove the default namespace then the xpath evaluation up works fine.
But it is not desired solution as XML creation is outside of the scope of code. Also It wont be possible to change the xpath query as that is also out side the scope of code.
>>> Test_XML_No_Default_NS = '<ibml:validateTradeStatusRequest xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore" xmlns:ibml="http://ibml.jpmorgan.com/2005" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ibmlVersion="1-56" version="4-2" xsi:schemaLocation="http://ibml.jpmorgan.com/2005C:\\IBTIBML\\trunk\\src\\xsd\\IBML.xsd"><header><sentBy>test_value</sentBy><sendTo>test_value</sendTo><creationTimestamp>2012-06-06T08:23:20.613</creationTimestamp></header><tradeReference></tradeReference></ibml:validateTradeStatusRequest>'
>>> ee = ET.fromstring( Test_XML_No_Default_NS )
>>> ee.xpath( '/ibml:validateTradeStatusRequest/header/sentBy', namespaces = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
[<Element sentBy at 0x1546b058>]
>>> node = _
>>> node[0].text
'test_value'
Any suggestion on how I can move forward ?
I'm not sure this answers your question directly, but try this and see if it works (maybe with modifications)
import lxml.etree
Test_XML = '<?xml version="1.0" encoding="UTF-8"?><ibml:validateTradeStatusRequest xmlns="http://www.fpml.org/2005/FpML-4-2" xmlns:dsig="http://www.w3.org/2000/09/xmldsig#" xmlns:ecore="http://www.eclipse.org/emf/2002/Ecore" xmlns:ibml="http://ibml.jpmorgan.com/2005" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" ibmlVersion="1-56" version="4-2" xsi:schemaLocation="http://ibml.jpmorgan.com/2005C:\\IBTIBML\\trunk\\src\\xsd\\IBML.xsd"><header><sentBy>test_value1</sentBy><sendTo>test_value2</sendTo><creationTimestamp>2012-06-06T08:23:20.613</creationTimestamp></header><tradeReference></tradeReference></ibml:validateTradeStatusRequest>'
xml = bytes(bytearray(Test_XML, encoding='utf-8'))
ee = etree.XML(xml)
target = ee.xpath( '//ibml:validateTradeStatusRequest', namespaces = {'ibml' : 'http://ibml.jpmorgan.com/2005'} )
print(target[0].xpath('//text()')[0])
Output:
test_value1

Fetch News Data from right Scrollbar using Beautifulsoup

I am using the following webpage https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ
to get the data from the right hand side scroller.
I have attached the screen shot where there is a red arrow marking the segment.
I have used the following code:
def parse():
mainPage = urllib2.urlopen("https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ")
lSoupPage = BeautifulSoup(mainPage)
for index in lSoupPage.findAll("div", {"class" : "jfk-scrollbar"}):
for item in index.findAll("div", {"class" : "news-item"}):
print item.a.text.strip()
I am not able to fetch the news-url by doing this. Please help.
The sidebar is loaded over AJAX and is not part of the page itself.
The page has a content id:
cid = lSoupPage.find('link', rel='canonical')['href'].rpartition('=')[-1]
use this to get the news data:
newsdata = urllib2.urlopen('https://www.google.com/finance/kd?output=json&keydevs=1&recnews=0&cid=' + cid)
Unfortunately, the data returned is not valid JSON; the keys are not using quotes. It is valid ECMAScript, just not valid JSON.
You can either 'repair' this by using a regular expression, or use a lenient parser that accepts ECMAscript object notation.
The latter can be done with the external demjson library:
>>> import demjson
>>> r = requests.get(
>>> data = demjson.decode(r.content)
>>> data.keys()
[u'clusters', u'result_total_articles', u'results_per_page', u'result_end_num', u'result_start_num']
>>> data['clusters'][0]['a'][0]['t']
u'Former Ford Motor Co. CEO joins Google board'
Repairing with a regular expression:
import re
import json
repaired_data = re.sub(r'(?<={|,)\s*(\w+)(?=:)', r'"\1"', broken_data)
data = json.loads(repaired_data)

Less painful way to parse a RSS-Feed with lxml?

I need to display RSS-feeds with Python, Atom for the most part. Coming from PHP, where I could get values pretty fast with $entry->link i find lxml to be much more precise, faster, albeit complicated. After hours of probing I got this working with the arstechnica-feed:
def GetRSSFeed(url):
out = []
feed = urllib.urlopen(url)
feed = etree.parse(feed)
feed = feed.getroot()
for element in feed.iterfind(".//item"):
meta = element.getchildren()
title = meta[0].text
link = meta[1].text
for subel in element.iterfind(".//description"):
desc = subel.text
entry = [title,link,desc]
out.append(entry)
return out
Could this be done any easier? How can I access tags directly? Feedparser gets the job done with one line of code! Why?
Look at the feedparser library. It gives you a nicely formatted RSS object.
> import feedparser
> feed = feedparser.parse('http://feeds.marketwatch.com/marketwatch/marketpulse/')
> print feed.keys()
['feed',
'status',
'updated',
'updated_parsed',
'encoding',
'bozo',
'headers',
'etag',
'href',
'version',
'entries',
'namespaces']
> len(feed.entries)
30
You can try speedparser, an implementation of Universal Feed Parser with lxml. Still in beta though.

lxml - difficulty parsing stackexchange rss feed

Hia
I am having problems parsing an rss feed from stackexchange in python.
When I try to get the summary nodes, an empty list is return
I have been trying to solve this, but can't get my head around.
Can anyone help out?
thanks
a
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []
Take a look at these two versions
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
As you discovered, the second version returns no nodes, but the lxml.html version works fine. The etree version is not working because it's expecting namespaces and the html version is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says "The HTML parser notably ignores namespaces and some other XMLisms."
Note when you print the root node of the etree version (print(data.getroot())), you get something like <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>. That means it's a feed element with a namespace of http://www.w3.org/2005/Atom. Here is a corrected version of the etree code.
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
The problem is namespaces.
Run this :
cooking_parsed.getroot().tag
And you'll see that the element is namespaced as
{http://www.w3.org/2005/Atom}feed
Similarly if you navigate to one of the feed entries.
This means the right xpath in lxml is:
print cooking_parsed.xpath(
"//a:feed/a:entry",
namespaces={ 'a':'http://www.w3.org/2005/Atom' })
Try using BeautifulStoneSoup from the beautifulsoup import.
It might do the trick.

Simple scraping of youtube xml to get a Python list of videos

I have an xml feed, say:
http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/
I want to get the list of hrefs for the videos:
['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]
from xml.etree import cElementTree as ET
import urllib
def get_bass_fishing_URLs():
results = []
data = urllib.urlopen(
'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
tree = ET.parse(data)
ns = '{http://www.w3.org/2005/Atom}'
for entry in tree.findall(ns + 'entry'):
for link in entry.findall(ns + 'link'):
if link.get('rel') == 'alternate':
results.append(link.get('href'))
as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).
Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.
In such a simple case, this should be enough:
import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)
If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions
import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))
links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
if link.getAttribute('rel') == 'alternate':
hrefs.append( link.getAttribute('href') )
hrefs

Categories

Resources