I need to display RSS-feeds with Python, Atom for the most part. Coming from PHP, where I could get values pretty fast with $entry->link i find lxml to be much more precise, faster, albeit complicated. After hours of probing I got this working with the arstechnica-feed:
def GetRSSFeed(url):
out = []
feed = urllib.urlopen(url)
feed = etree.parse(feed)
feed = feed.getroot()
for element in feed.iterfind(".//item"):
meta = element.getchildren()
title = meta[0].text
link = meta[1].text
for subel in element.iterfind(".//description"):
desc = subel.text
entry = [title,link,desc]
out.append(entry)
return out
Could this be done any easier? How can I access tags directly? Feedparser gets the job done with one line of code! Why?
Look at the feedparser library. It gives you a nicely formatted RSS object.
> import feedparser
> feed = feedparser.parse('http://feeds.marketwatch.com/marketwatch/marketpulse/')
> print feed.keys()
['feed',
'status',
'updated',
'updated_parsed',
'encoding',
'bozo',
'headers',
'etag',
'href',
'version',
'entries',
'namespaces']
> len(feed.entries)
30
You can try speedparser, an implementation of Universal Feed Parser with lxml. Still in beta though.
Related
So I'm brand new the whole web scraping thing. I've been working on a project that requires me to get the word of the day from here. I have successfully grabbed the word now I just need to get the definition, but when I do so I get this result:
Avuncular (Correct word of the day)
Definition:
[]
here's my code:
from lxml import html
import requests
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = html.fromstring(page.content)
word = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[1]/div[2]/div[1]/div/h1/text()')
WOTD = str(word)
WOTD = WOTD[2:]
WOTD = WOTD[:-2]
print(WOTD.capitalize())
print("Definition:")
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[2]/div[1]/div/div[1]/p[1]/text()')
print(wordDef)
[] is supposed to be the first definition but won't work for some reason.
Any help would be greatly appreciated.
Your xpath is slightly off. Here's the correct one:
wordDef = tree.xpath('/html/body/div[1]/div/div[4]/main/article/div[3]/div[1]/div/div[1]/p[1]/text()')
Note div[3] after main/article instead of div[2]. Now when running you should get:
Avuncular
Definition:
[' suggestive of an uncle especially in kindliness or geniality']
If you wanted to avoid hardcoding index within xpath, the following would be an alternative to your current attempt:
import requests
from lxml.html import fromstring
page = requests.get('https://www.merriam-webster.com/word-of-the-day')
tree = fromstring(page.text)
word = tree.xpath("//*[#class='word-header']//h1")[0].text
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p/strong")[0].tail.strip()
print(f'{word}\n{wordDef}')
If the wordDef fails to get the full portion then try replacing with the below one:
wordDef = tree.xpath("//h2[contains(.,'Definition')]/following-sibling::p")[0].text_content()
Output:
avuncular
suggestive of an uncle especially in kindliness or geniality
https://www.cpms.osd.mil/Content/AF%20Schedules/survey-sch/111/111R-03Apr2003.html
This is the page I am trying to parse. It's from a government site which in my experience are not known for keeping up their certificates, so you are are going to be warned about it not being safe by your browser. All I want is this part,http://imgur.com/a/BL14W.
edit: Sorry, for the lack of information. I started asking this question then I got called away at work. It's no excuse but when I came back it was time to go home so, I just kinda hit submit.
I have already tried doing it more "manually" but apparently not all of the documents came out exactly the same. Here is what I tried:
def table_parser(page):
file = open(page)
table = []
num = 0
for line in file:
if 'Grade' in line:
num += 1
if num > 0:
num += 1
if 3 <= num < 21:
line = line.rstrip()
if line != '':
split_line = line.split(' ')
split_line = [x for x in split_line if x != '']
strip_line = split_line[:16]
table.append(strip_line)
WG = []
WL = []
WS = []
for l in table:
WG.append((l[1:6]))
WL.append(l[6:11])
WS.append(l[11:16])
file.close()
# Return 3 lists for the 3 charts I want
return WG, WL, WS
This is what I used that got the about half of the 65k files I started with mostly right. I passed the returned lists into csv writers to store them till I can get them all cleaned up. I know there is probably a better way but I came up with this before I could wrap my head around BeautifulSoup. I don't necessarily want the code to do this, just pointers on where to start. I tried to find documentation on BeautifulSoup but I couldn't figure out where to start for what I need.
Your question is a little vague so I'll try my best to help you.
1. Install Beautiful Soup 4
To get a block of text from a webpage,you will need to use the external library BeautifulSoup4 (BS4). Once downloaded and installed to your computer, first import BS4 using the following from bs4 import BeautifulSoupand import urllib.request. Then simply setup BS4 using soup = BeautifulSoup("", "html.parser").
2. Download Webpage
Downloading a webpage is simple, just use site_download = urllib.request.urlopen(url). In your case, simply replace "url" with the url you provided here. Then we need to read what we've downloaded using site_read = site_download.read().decode('utf-8') followed by soup = BeautifulSoup(site_read, "html.parser").
3. Get Block of Text
You can get text in many different ways, so I'll show you a few examples.
To get the first instance of < P > tag (paragraph) text:
text = soup.find("p")
text = getText()
To get all instances of the < P > tag:
text = soup.findAll("p")
text = getText()
To get text from a specific class:
text = soup.find(attrs={"class": "class_name_here"})
text = getText()
4. Further Info
More information on how to get different types of tags and other things you can do with BS4 can be found HERE.
I'd like to switch CJK characters in Python 3.3. That is, I need to get 價(Korean) from 价(Chinese), and 価(Japanese) from 價. Is there a external module like that?
Unihan information
The Unihan page about 價 provide a simplified variant (vs. traditionnal), but doesn't seems to give Japanese/Korean one. So...
CJKlib
I would recommend to have a look at CJKlib, which has a feature section called Variants stating:
Z-variant forms, which only differ in typeface
[update] Z-variant
Your sample character 價 (U+50F9) doesn't have a z-variant. However 価 (U+4FA1) has a kZVariant to U+50F9 價. This seems weird.
Further reading
Package documentation is available on Python.org/pypi/cjklib ;
Z-variant form definition.
Here is a relatively complete conversion table. You can dump it to json for later use:
import requests
from bs4 import BeautifulSoup as BS
import json
def gen(soup):
for tr in soup.select('tr'):
tds = tr.select('td.tdR4')
if len(tds) == 6:
yield tds[2].string, tds[3].string
uri = 'http://www.kishugiken.co.jp/cn/code10d.html'
soup = BS(requests.get(uri).content, 'html5lib')
d = {}
for hanzi, kanji in gen(soup):
a = d.get(hanzi, [])
a.append(kanji)
d[hanzi] = a
print(json.dumps(d, indent=4))
The code and it's output are in this gist.
So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.
I have an xml feed, say:
http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/
I want to get the list of hrefs for the videos:
['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]
from xml.etree import cElementTree as ET
import urllib
def get_bass_fishing_URLs():
results = []
data = urllib.urlopen(
'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
tree = ET.parse(data)
ns = '{http://www.w3.org/2005/Atom}'
for entry in tree.findall(ns + 'entry'):
for link in entry.findall(ns + 'link'):
if link.get('rel') == 'alternate':
results.append(link.get('href'))
as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).
Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.
In such a simple case, this should be enough:
import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)
If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions
import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))
links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
if link.getAttribute('rel') == 'alternate':
hrefs.append( link.getAttribute('href') )
hrefs