Hia
I am having problems parsing an rss feed from stackexchange in python.
When I try to get the summary nodes, an empty list is return
I have been trying to solve this, but can't get my head around.
Can anyone help out?
thanks
a
In [3o]: import lxml.etree, urllib2
In [31]: url_cooking = 'http://cooking.stackexchange.com/feeds'
In [32]: cooking_content = urllib2.urlopen(url_cooking)
In [33]: cooking_parsed = lxml.etree.parse(cooking_content)
In [34]: cooking_texts = cooking_parsed.xpath('.//feed/entry/summary')
In [35]: cooking_texts
Out[35]: []
Take a look at these two versions
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
#lxml.etree version
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
#lxml.html version
data = lxml.html.parse(url_cooking)
summary_nodes = data.xpath('.//feed/entry/summary')
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
As you discovered, the second version returns no nodes, but the lxml.html version works fine. The etree version is not working because it's expecting namespaces and the html version is working because it ignores namespaces. Part way down http://lxml.de/lxmlhtml.html, it says "The HTML parser notably ignores namespaces and some other XMLisms."
Note when you print the root node of the etree version (print(data.getroot())), you get something like <Element {http://www.w3.org/2005/Atom}feed at 0x22d1620>. That means it's a feed element with a namespace of http://www.w3.org/2005/Atom. Here is a corrected version of the etree code.
import lxml.html, lxml.etree
url_cooking = 'http://cooking.stackexchange.com/feeds'
ns = 'http://www.w3.org/2005/Atom'
ns_map = {'ns': ns}
data = lxml.etree.parse(url_cooking)
summary_nodes = data.xpath('//ns:feed/ns:entry/ns:summary', namespaces=ns_map)
print('Found ' + str(len(summary_nodes)) + ' summary nodes')
The problem is namespaces.
Run this :
cooking_parsed.getroot().tag
And you'll see that the element is namespaced as
{http://www.w3.org/2005/Atom}feed
Similarly if you navigate to one of the feed entries.
This means the right xpath in lxml is:
print cooking_parsed.xpath(
"//a:feed/a:entry",
namespaces={ 'a':'http://www.w3.org/2005/Atom' })
Try using BeautifulStoneSoup from the beautifulsoup import.
It might do the trick.
Related
So I'm brand new the whole web scraping thing. I've been working on a project that requires me to get the word of the day from here. I have successfully grabbed the word now I just need to get the definition, but when I do so I get this result:
Avuncular (Correct word of the day)
Definition:
[]
here's my code:
from lxml import html
import requests
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = html.fromstring(page.content)
word = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[1]/div[2]/div[1]/div/h1/text()')
WOTD = str(word)
WOTD = WOTD[2:]
WOTD = WOTD[:-2]
print(WOTD.capitalize())
print("Definition:")
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[2]/div[1]/div/div[1]/p[1]/text()')
print(wordDef)
[] is supposed to be the first definition but won't work for some reason.
Any help would be greatly appreciated.
Your xpath is slightly off. Here's the correct one:
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[3]/div[1]/div/div[1]/p[1]/text()')
Note div[3] after main/article instead of div[2]. Now when running you should get:
Avuncular
Definition:
[' suggestive of an uncle especially in kindliness or geniality']
If you wanted to avoid hardcoding index within xpath, the following would be an alternative to your current attempt:
import requests
from lxml.html import fromstring
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = fromstring(page.text)
word = tree.xpath("//*[#class='word-header']//h1")[0].text
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p/strong")[0].tail.strip()
print(f'{word}\n{wordDef}')
If the wordDef fails to get the full portion then try replacing with the below one:
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p")[0].text_content()
Output:
avuncular
suggestive of an uncle especially in kindliness or geniality
I need to get the value from the FLVPath from this link : http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite
from lxml import html
sub_r = requests.get("http://www.testpage.co/v2/videoConfigXmlCode.php?pg=video_%s_no_0_extsite" % list[6])
sub_root = lxml.html.fromstring(sub_r.content)
for sub_data in sub_root.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print sub_data.text
But no data returned
You're using lxml.html to parse the document, which causes lxml to lowercase all element and attribute names (since that doesn't matter in html), which means you'll have to use:
sub_root.xpath('//player_settings[#name="FLVPath"]/#value')
Or as you're parsing a xml file anyway, you could use lxml.etree.
You could try
print sub_data.attrib['Value']
url = "http://www.testpage.com/v2/videoConfigXmlCode.php?pg=video_29746_no_0_extsite"
response = requests.get(url)
# Use `lxml.etree` rathern than `lxml.html`,
# and unicode `response.text` instead of `response.content`
doc = lxml.etree.fromstring(response.text)
for path in doc.xpath('//PLAYER_SETTINGS[#Name="FLVPath"]/#Value'):
print path
I have a HTML file I got from Wikipedia and would like to find every link on the page such as /wiki/Absinthe and replace it with the current directory added to the front such as /home/fergus/wikiget/wiki/Absinthe so for:
Absinthe
becomes:
Absinthe
and this is throughout the whole document.
Do you have any ideas? I'm happy to use BeautifulSoup or Regex!
If that's really all you have to do, you could do it with sed and its -i option to rewrite the file in-place:
sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html
However, here's a Python solution using the lovely lxml API, in case you need to do anything more complex or you might have badly formed HTML, etc.:
from lxml import etree
import re
parser = etree.HTMLParser()
with open("wiki-file.html") as fp:
tree = etree.parse(fp, parser)
for e in tree.xpath("//a[#href]"):
link = e.attrib['href']
if re.search('^/wiki',link):
e.attrib['href'] = '/home/fergus/wikiget'+link
# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
fp.write(etree.tostring(tree))
Note that lxml is probably a better option than BeautifulSoup for this kind of task nowadays, for the reasons given by BeautifulSoup's author.
This is solution using re module:
#!/usr/bin/env python
import re
open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))
Here's another one without using re:
#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))
You can use a function with re.sub:
def match(m):
return '<a href="/home/fergus/wikiget' + m.group(1) + '">'
r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)
An example:
>>> s = 'Absinthe'
>>> r.sub(match, s)
'Absinthe'
from lxml import html
el = html.fromstring('word')
# or `el = html.parse(file_or_url).getroot()`
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print(html.tostring(el))
el.rewrite_links(repl)
print(html.tostring(el))
Output
word
word
You could also use the function lxml.html.rewrite_links() directly:
from lxml import html
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print html.rewrite_links(htmlstr, repl)
I would do
import re
ch = 'Absinthe'
r = re.compile('(<a\s+href=")(/wiki/[^"]+">[^<]+</a>)')
print ch
print
print r.sub('\\1/home/fergus/wikiget\\2',ch)
EDIT:
this solution have been said not to capture tags with additional attribute. I thought it was a narrow pattern of string that was aimed, such as WORD
If not, well, no problem, a solution with a simpler RE is easy to write
r = re.compile('(<a\s+href="/)([^>]+">)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/\\2',ch)
or why not:
r = re.compile('(<a\s+href="/)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/',ch)
I have an xml feed, say:
http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/
I want to get the list of hrefs for the videos:
['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]
from xml.etree import cElementTree as ET
import urllib
def get_bass_fishing_URLs():
results = []
data = urllib.urlopen(
'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
tree = ET.parse(data)
ns = '{http://www.w3.org/2005/Atom}'
for entry in tree.findall(ns + 'entry'):
for link in entry.findall(ns + 'link'):
if link.get('rel') == 'alternate':
results.append(link.get('href'))
as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).
Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.
In such a simple case, this should be enough:
import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)
If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions
import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))
links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
if link.getAttribute('rel') == 'alternate':
hrefs.append( link.getAttribute('href') )
hrefs
I'm building a simple web-based RSS reader in Python, but I'm having trouble parsing the XML. I started out by trying some stuff in the Python command line.
>>> from xml.dom import minidom
>>> import urllib2
>>> url ='http://www.digg.com/rss/index.xml'
>>> xmldoc = minidom.parse(urllib2.urlopen(url))
>>> channelnode = xmldoc.getElementsByTagName("channel")
>>> channelnode = xmldoc.getElementsByTagName("channel")
>>> titlenode = channelnode[0].getElementsByTagName("title")
>>> print titlenode[0]
<DOM Element: title at 0xb37440>
>>> print titlenode[0].nodeValue
None
I played around with this for a while, but the nodeValue of everything seems to be None. Yet if you look at the XML, there definitely are values there. What am I doing wrong?
For RSS feeds you should try the Universal Feed Parser library. It simplifies the handling of RSS feeds immensly.
import feedparser
d = feedparser.parse('http://www.digg.com/rss/index.xml')
title = d.channel.title
This is the syntax you are looking for:
>>> print titlenode[0].firstChild.nodeValue
digg.com: Stories / Popular
Note that the node value is a logical descendant of the node itself.