Fetch News Data from right Scrollbar using Beautifulsoup - python

I am using the following webpage https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ
to get the data from the right hand side scroller.
I have attached the screen shot where there is a red arrow marking the segment.
I have used the following code:
def parse():
mainPage = urllib2.urlopen("https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ")
lSoupPage = BeautifulSoup(mainPage)
for index in lSoupPage.findAll("div", {"class" : "jfk-scrollbar"}):
for item in index.findAll("div", {"class" : "news-item"}):
print item.a.text.strip()
I am not able to fetch the news-url by doing this. Please help.

The sidebar is loaded over AJAX and is not part of the page itself.
The page has a content id:
cid = lSoupPage.find('link', rel='canonical')['href'].rpartition('=')[-1]
use this to get the news data:
newsdata = urllib2.urlopen('https://www.google.com/finance/kd?output=json&keydevs=1&recnews=0&cid=' + cid)
Unfortunately, the data returned is not valid JSON; the keys are not using quotes. It is valid ECMAScript, just not valid JSON.
You can either 'repair' this by using a regular expression, or use a lenient parser that accepts ECMAscript object notation.
The latter can be done with the external demjson library:
>>> import demjson
>>> r = requests.get(
>>> data = demjson.decode(r.content)
>>> data.keys()
[u'clusters', u'result_total_articles', u'results_per_page', u'result_end_num', u'result_start_num']
>>> data['clusters'][0]['a'][0]['t']
u'Former Ford Motor Co. CEO joins Google board'
Repairing with a regular expression:
import re
import json
repaired_data = re.sub(r'(?<={|,)\s*(\w+)(?=:)', r'"\1"', broken_data)
data = json.loads(repaired_data)

Related

Regular Expression to read between HTML works in RegEx tester but not in my code

I'm fairly new to RegEx (and Python) in general and am trying to use it to read the temperature and description of weather via the HTML tags of a website.
I've attempted to rework examples of what I've been shown in class and read online to do this.
url = 'https://weather.com/en-AU/weather/today/l/-27.47,153.02'
contents = urllib.request.urlopen(url).read().decode("utf-8")
start_of_div = contents.find('<div class="today_nowcard-phrase">') # start of phrase line
end_of_div = start_of_div + contents[start_of_div:].find("</div>") + 6 # close of phrase line
phrase_area = contents[start_of_div:end_of_div]
print(phrase_area)
phrase = phrase_area.rfind(r'>(.*)<') # regex tester says this works
print(phrase)
There's then another section that gets the degrees which uses the same kind of layout.
It should print a phrase like 'Sunny' or 'Light Rain' or whatever else the weather is, as well as the current degrees (celsius). Instead it prints out:
<div class="today_nowcard-phrase">Sunny</div>
- 1
<div class="today_nowcard-temp"><span class="">21<sup>
- 1
Instead of -1 it should be 'Sunny' and '21' (at that point of time). The RegEx works when I put it into RegEx testing sites, but not in my actual program (probably because of some obvious error I can't see). Any help would be appreciated.
As mentioned in comments used an html parser. The elements all have nice distinctive class names you can use e.g. .today_nowcard-temp (where the leading . is a css class selector to match on element class name)
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://weather.com/en-AU/weather/today/l/-27.47,153.02')
soup = bs(r.content, 'html.parser')
temp = soup.select_one('.today_nowcard-temp').text
desc = soup.select_one('.today_nowcard-phrase').text
print(temp, desc)

Beautiful Soup turns S&P into S&P; AT&T into AT&T; ?

I'm parsing some rather messy HTML documents using BeautifulSoup 4 (4.3.2) and am running into a problem where it'll turn a company name like S&P (Standard and Poors) or M&S (Marks and Spencer) AT&T into S&P;, M&S; and AT&T;. So it wants to complete the &[A-Z]+ pattern into an html entity, but doesn't actually use an html entity lookup table since &P; is not an html entity.
How do I make it not do that, or do I just need to regex match the invalid entities and change them back?
>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&T announces new plans')
>>> soup.text
u'AT&T; announces new plans'
>>> import bs4
>>> soup = bs4.BeautifulSoup('AT&TOP announces new plans')
>>> soup.text
u'AT&TOP; announces new plans'
I've tried the above on OSX 10.8.5 Python 2.7.5 and Scientifix Linux 6 Python 2.7.5
This appears to be a bug or feature in the way BeautifulSoup4 handles unknown HTML entity references. As Ignacio says in the comment above, it would be probably be better to pre-process the input and replace the '&' symbols with HTML entities ('&').
But if you don't want to do that for some reason - the only way I could only find a way to fix the problem was by "monkey-patching" the code. This script worked for me (Python 2.73 on Mac OS X):
import bs4
def my_handle_entityref(self, name):
character = bs4.dammit.EntitySubstitution.HTML_ENTITY_TO_CHARACTER.get(name)
if character is not None:
data = character
else:
#the original code mishandles unknown entities (the following commented-out line)
#data = "&%s;" % name
data = "&%s" % name
self.handle_data(data)
bs4.builder._htmlparser.BeautifulSoupHTMLParser.handle_entityref = my_handle_entityref
soup = bs4.BeautifulSoup('AT&T announces new plans')
print soup.text
soup = bs4.BeautifulSoup('AT&TOP announces new plans')
print soup.text
It produces the output:
AT&T announces new plans
AT&TOP announces new plans
You can see the method with the problem here:
http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L81
And the line with the problem here:
http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/builder/_htmlparser.py#L86

Is there a module to convert Chinese character to Japanese (kanji) or Korean (hanja) in Python 3?

I'd like to switch CJK characters in Python 3.3. That is, I need to get 價(Korean) from 价(Chinese), and 価(Japanese) from 價. Is there a external module like that?
Unihan information
The Unihan page about 價 provide a simplified variant (vs. traditionnal), but doesn't seems to give Japanese/Korean one. So...
CJKlib
I would recommend to have a look at CJKlib, which has a feature section called Variants stating:
Z-variant forms, which only differ in typeface
[update] Z-variant
Your sample character 價 (U+50F9) doesn't have a z-variant. However 価 (U+4FA1) has a kZVariant to U+50F9 價. This seems weird.
Further reading
Package documentation is available on Python.org/pypi/cjklib ;
Z-variant form definition.
Here is a relatively complete conversion table. You can dump it to json for later use:
import requests
from bs4 import BeautifulSoup as BS
import json
def gen(soup):
for tr in soup.select('tr'):
tds = tr.select('td.tdR4')
if len(tds) == 6:
yield tds[2].string, tds[3].string
uri = 'http://www.kishugiken.co.jp/cn/code10d.html'
soup = BS(requests.get(uri).content, 'html5lib')
d = {}
for hanzi, kanji in gen(soup):
a = d.get(hanzi, [])
a.append(kanji)
d[hanzi] = a
print(json.dumps(d, indent=4))
The code and it's output are in this gist.

Losing codification when split a string

[EDITED]
I´m using Google App Engine, and I´m trying to parse HTML content in order to extract some info. The code i´m using is:
from google.appengine.ext import webapp
from google.appengine.ext.webapp import util
from google.appengine.api import urlfetch
import BeautifulSoup
class MainHandler(webapp.RequestHandler):
def get(self):
url = 'http://ascodevida.com/ultimos'
result = urlfetch.fetch(url=url)
# ADVS de esta página.
res = BeautifulSoup.BeautifulSoup(result.content).findAll('div', {'class' : 'box story'})
ADVList = []
for i in res:
story = i.find('a', {'class' : 'advlink'}).string
link = i.find('a', {'class' : 'advlink'})['href']
ADVData = {
'adv' : story,
'link' : link
}
ADVList.append(ADVData)
self.response.headers['Content-Type'] = 'text/html; charset=UTF-8'
self.response.out.write(ADVList)
And this code this produces a response with strange chars. I´ve tried using prettify() and renderContent() methods of BeautifulSoup library, but is not effective.
Any solutions? Thanks again.
I'm a java developer and I'm using jsoup for HTML Parsing. I found similar one for python. This may help you & save your time.
http://www.crummy.com/software/BeautifulSoup/
Food for brain :
Python regular expression for HTML parsing (BeautifulSoup)
I think you are printing the list directly, which calles repr, the default output is in hex format (like \xe1).
you could try this:
>>> s = u"Leer más"
>>> repr(s)
"'Leer m\\xc3\\xa1s'"
but print statement will try to decode the string:
>>> print s
Leer más
if you want the correct result, just avoid the default behavior of list and handle every item by yourself.

Download prices with python

I have tried this before. I'm completely at a loss for ideas.
On this page this dialog box to qet quotes.
http://www.schwab.com/public/schwab/non_navigable/marketing/email/get_quote.html?
I used SPY, XLV, IBM, MSFT
The output is the above with a table.
If you have an account the quote are real time --- via cookie.
How do I get the table into python using 2.6. The data as list or dictionary
Use something like Beautiful Soup to parse the HTML response from the web site and load it into a dictionary. use the symbol as the key and a tuple of whatever data you're interested in as the value. Iterate over all the symbols returned and add one entry per symbol.
You can see examples of how to do this in Toby Segaran's "Programming Collective Intelligence". The samples are all in Python.
First problem: the data is actually in an iframe in a frame; you need to be looking at https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=APC (where you substitute the appropriate symbol on the end of the URL).
Second problem: extracting the data from the page. I personally like lxml and xpath, but there are many packages which will do the job. I would probably expect some code like
import urllib2
import lxml.html
import re
re_dollars = '\$?\s*(\d+\.\d{2})'
def urlExtractData(url, defs):
"""
Get html from url, parse according to defs, return as dictionary
defs is a list of tuples ("name", "xpath", "regex", fn )
name becomes the key in the returned dictionary
xpath is used to extract a string from the page
regex further processes the string (skipped if None)
fn casts the string to the desired type (skipped if None)
"""
page = urllib2.urlopen(url) # can modify this to include your cookies
tree = lxml.html.parse(page)
res = {}
for name,path,reg,fn in defs:
txt = tree.xpath(path)[0]
if reg != None:
match = re.search(reg,txt)
txt = match.group(1)
if fn != None:
txt = fn(txt)
res[name] = txt
return res
def getStockData(code):
url = 'https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=' + code
defs = [
("stock_name", '//span[#class="header1"]/text()', None, str),
("stock_symbol", '//span[#class="header2"]/text()', None, str),
("last_price", '//span[#class="neu"]/text()', re_dollars, float)
# etc
]
return urlExtractData(url, defs)
When called as
print repr(getStockData('MSFT'))
it returns
{'stock_name': 'Microsoft Corp', 'last_price': 25.690000000000001, 'stock_symbol': 'MSFT:NASDAQ'}
Third problem: the markup on this page is presentational, not structural - which says to me that code based on it will likely be fragile, ie any change to the structure of the page (or variation between pages) will require reworking your xpaths.
Hope that helps!
Have you thought of using yahoo's quotes api?
see: http://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22
You will be able to dynamically generate a request to the website such as:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
And just poll it with standard a http GET request. The response is in XML format.
matplotlib has a module that gets historical quotes from Yahoo:
>>> from matplotlib.finance import quotes_historical_yahoo
>>> from datetime import date
>>> from pprint import pprint
>>> pprint(quotes_historical_yahoo('IBM', date(2010, 11, 12), date(2010, 11, 18)))
[(734088.0,
144.59,
143.74000000000001,
145.77000000000001,
143.55000000000001,
4731500.0),
(734091.0,
143.88999999999999,
143.63999999999999,
144.75,
143.27000000000001,
3827700.0),
(734092.0,
142.93000000000001,
142.24000000000001,
143.38,
141.18000000000001,
6342100.0),
(734093.0,
142.49000000000001,
141.94999999999999,
142.49000000000001,
141.38999999999999,
4785900.0)]

Categories

Resources