Download prices with python - python

I have tried this before. I'm completely at a loss for ideas.
On this page this dialog box to qet quotes.
http://www.schwab.com/public/schwab/non_navigable/marketing/email/get_quote.html?
I used SPY, XLV, IBM, MSFT
The output is the above with a table.
If you have an account the quote are real time --- via cookie.
How do I get the table into python using 2.6. The data as list or dictionary

Use something like Beautiful Soup to parse the HTML response from the web site and load it into a dictionary. use the symbol as the key and a tuple of whatever data you're interested in as the value. Iterate over all the symbols returned and add one entry per symbol.
You can see examples of how to do this in Toby Segaran's "Programming Collective Intelligence". The samples are all in Python.

First problem: the data is actually in an iframe in a frame; you need to be looking at https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=APC (where you substitute the appropriate symbol on the end of the URL).
Second problem: extracting the data from the page. I personally like lxml and xpath, but there are many packages which will do the job. I would probably expect some code like
import urllib2
import lxml.html
import re
re_dollars = '\$?\s*(\d+\.\d{2})'
def urlExtractData(url, defs):
"""
Get html from url, parse according to defs, return as dictionary
defs is a list of tuples ("name", "xpath", "regex", fn )
name becomes the key in the returned dictionary
xpath is used to extract a string from the page
regex further processes the string (skipped if None)
fn casts the string to the desired type (skipped if None)
"""
page = urllib2.urlopen(url) # can modify this to include your cookies
tree = lxml.html.parse(page)
res = {}
for name,path,reg,fn in defs:
txt = tree.xpath(path)[0]
if reg != None:
match = re.search(reg,txt)
txt = match.group(1)
if fn != None:
txt = fn(txt)
res[name] = txt
return res
def getStockData(code):
url = 'https://www.schwab.wallst.com/public/research/stocks/summary.asp?user_id=schwabpublic&symbol=' + code
defs = [
("stock_name", '//span[#class="header1"]/text()', None, str),
("stock_symbol", '//span[#class="header2"]/text()', None, str),
("last_price", '//span[#class="neu"]/text()', re_dollars, float)
# etc
]
return urlExtractData(url, defs)
When called as
print repr(getStockData('MSFT'))
it returns
{'stock_name': 'Microsoft Corp', 'last_price': 25.690000000000001, 'stock_symbol': 'MSFT:NASDAQ'}
Third problem: the markup on this page is presentational, not structural - which says to me that code based on it will likely be fragile, ie any change to the structure of the page (or variation between pages) will require reworking your xpaths.
Hope that helps!

Have you thought of using yahoo's quotes api?
see: http://developer.yahoo.com/yql/console/?q=show%20tables&env=store://datatables.org/alltableswithkeys#h=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22
You will be able to dynamically generate a request to the website such as:
http://query.yahooapis.com/v1/public/yql?q=select%20*%20from%20yahoo.finance.quotes%20where%20symbol%20%3D%20%22YHOO%22&diagnostics=true&env=store%3A%2F%2Fdatatables.org%2Falltableswithkeys
And just poll it with standard a http GET request. The response is in XML format.

matplotlib has a module that gets historical quotes from Yahoo:
>>> from matplotlib.finance import quotes_historical_yahoo
>>> from datetime import date
>>> from pprint import pprint
>>> pprint(quotes_historical_yahoo('IBM', date(2010, 11, 12), date(2010, 11, 18)))
[(734088.0,
144.59,
143.74000000000001,
145.77000000000001,
143.55000000000001,
4731500.0),
(734091.0,
143.88999999999999,
143.63999999999999,
144.75,
143.27000000000001,
3827700.0),
(734092.0,
142.93000000000001,
142.24000000000001,
143.38,
141.18000000000001,
6342100.0),
(734093.0,
142.49000000000001,
141.94999999999999,
142.49000000000001,
141.38999999999999,
4785900.0)]

Related

Scraping csv file from url with React script

I want to scrape sample_info.csv file from https://depmap.org/portal/download/.
Since there is a React script on the website it's not that straightforward with BeautifulSoup and accessing the file via an appropriate tag. I did approach this from many angles and the one that gave me the best results looks like this and it returns the executed script where all downloaded files are listed together with other data. My then idea was to strip the tags and store the information in JSON. However, I think there must be some kind of mistake in the data because it is impossible to store it as JSON.
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
soup = BeautifulSoup(html_content, "lxml")
all_scripts = soup.find_all('script')
script = str(all_scripts[32])
last_char_index = script.rfind("}]")
first_char_index = script.find("[{")
script_cleaned = script[first_char_index:last_char_index+2]
script_json = json.loads(script_cleaned)
This code gives me an error
JSONDecodeError: Extra data: line 1 column 7250 (char 7249)
I know that my solution might not be elegant but it took me closest to the goal i.e. downloading the sample_info.csv file from the website. Not sure how to proceed here. If there are other options? I tried with selenium but this solution will not be feasible for the end-user of my script due to the driver path declaration
It is probably easier in this context to use regular expressions, since the string is invalid JSON.
This RegEx tool (https://pythex.org/) can be useful for testing expressions.
import re
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', script_cleaned)
#[
# ('https://ndownloader.figshare.com/files/26261524', 'CCLE_gene_cn.csv'),
# ('https://ndownloader.figshare.com/files/26261527', 'CCLE_mutations.csv'),
# ('https://ndownloader.figshare.com/files/26261293', 'Achilles_gene_effect.csv'),
# ('https://ndownloader.figshare.com/files/26261569', 'sample_info.csv'),
# ('https://ndownloader.figshare.com/files/26261476', 'CCLE_expression.csv'),
# ('https://ndownloader.figshare.com/files/17741420', 'primary_replicate_collapsed_logfold_change_v2.csv'),
# ('https://gygi.med.harvard.edu/publications/ccle', 'protein_quant_current_normalized.csv'),
# ('https://ndownloader.figshare.com/files/13515395', 'D2_combined_gene_dep_scores.csv')
# ]
Edit: This also works by passing the html_content directly (no need to BeautifulSoup).
url = 'https://depmap.org/portal/download/'
html_content = requests.get(url).text
re.findall(r'"downloadUrl": "(.*?)".*?"fileName": "(.*?)"', html_content)

Parsing Json in python 3, get email from API

I'm trying to do a little code that gets the emails (and other things in the future) from an API. But I'm getting "TypeError: list indices must be integers or slices, not str" and I don't know what to do about it. I've been looking at other questions here but I still don't get it. I might be a bit slow when it comes to this.
I've also been watching some tutorials on the tube, and done the same as them, but still getting different errors. I run Python 3.5.
Here is my code:
from urllib.request import urlopen
import json, re
# Opens the url for the API
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
# This should put the response from API in a Dict
result= r.read().decode('utf-8')
data = json.loads(result)
#This shuld get all the names from the the Dict
for name in data['name']: #TypeError here.
print(name)
I know that I could regex the text and get the result that I want.
Code for that:
from urllib.request import urlopen
import re
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
result = r.read().decode('utf-8')
f = re.findall('"email": "(\w+\S\w+)', result)
print(f)
But that seems like the wrong way to do this.
Can someone please help me understand what I'm doing wrong here?
data is a list of dicts, that's why you are getting TypeError while iterating on it.
The way to go is something like this:
for item in data: # item is {"name": "foo", "email": "foo#mail..."}
print(item['name'])
print(item['email'])
#PiAreSquared's comment is correct, just a bit more explanation here:
from urllib.request import urlopen
import json, re
# Opens the url for the API
url = 'https://jsonplaceholder.typicode.com/posts/1/comments'
r = urlopen(url)
# This should put the response from API in a Dict
result= r.read().decode('utf-8')
data = json.loads(result)
# your data is a list of elements
# and each element is a dict object, so you can loop over the data
# to get the dict element, and then access the keys and values as you wish
# see below for some example
for element in data: #TypeError here.
name = element['name']
email = element['email']
# if you want to get all names, you should do
names = [element['name'] for element in data]
# same to get all emails
emails = [email['email'] for email in data]

Fetch News Data from right Scrollbar using Beautifulsoup

I am using the following webpage https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ
to get the data from the right hand side scroller.
I have attached the screen shot where there is a red arrow marking the segment.
I have used the following code:
def parse():
mainPage = urllib2.urlopen("https://www.google.com/finance?q=NYSE%3AF&ei=LvflU_itN8zbkgW0i4GABQ")
lSoupPage = BeautifulSoup(mainPage)
for index in lSoupPage.findAll("div", {"class" : "jfk-scrollbar"}):
for item in index.findAll("div", {"class" : "news-item"}):
print item.a.text.strip()
I am not able to fetch the news-url by doing this. Please help.
The sidebar is loaded over AJAX and is not part of the page itself.
The page has a content id:
cid = lSoupPage.find('link', rel='canonical')['href'].rpartition('=')[-1]
use this to get the news data:
newsdata = urllib2.urlopen('https://www.google.com/finance/kd?output=json&keydevs=1&recnews=0&cid=' + cid)
Unfortunately, the data returned is not valid JSON; the keys are not using quotes. It is valid ECMAScript, just not valid JSON.
You can either 'repair' this by using a regular expression, or use a lenient parser that accepts ECMAscript object notation.
The latter can be done with the external demjson library:
>>> import demjson
>>> r = requests.get(
>>> data = demjson.decode(r.content)
>>> data.keys()
[u'clusters', u'result_total_articles', u'results_per_page', u'result_end_num', u'result_start_num']
>>> data['clusters'][0]['a'][0]['t']
u'Former Ford Motor Co. CEO joins Google board'
Repairing with a regular expression:
import re
import json
repaired_data = re.sub(r'(?<={|,)\s*(\w+)(?=:)', r'"\1"', broken_data)
data = json.loads(repaired_data)

Parsing JSON output using Mechanize and Python Django View

I'm currently doing a site search like : site:somedomain.com into BING using Python and Mechanize.
It is submitting fine to bing and returning an output - looks like Json? I can't seem to figure out a good way to further parse the results. Is it JSON?
I'm getting an output like:
Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=478', text='SomeSite - Professor Rating of Louis Scerbo', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=478'), ('h', 'ID=SERP,5105.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=527', text='SomeSite - Professor Rating of Jahan \xe2\x80\xa6', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=527'), ('h', 'ID=SERP,5118.1')])Link(base_url=u'http://www.bing.com/search?q=site%3Asomesite.com', url='http://www.somesite.com/prof.php?pID=645', text='SomeSite - Professor Rating of David Kutzik', tag='a', attrs=[('href', 'http://www.somesite.com/prof.php?pID=645'), ('h', 'ID=SERP,5131.1')])
I want to get all the urls like:
http://www.somesite.com/prof.php?pID=478
http://www.somesite.com/prof.php?pID=527
http://www.somesite.com/prof.php?pID=645
and so on, so the url attribute within
How can I further do this with mechanize within my code? Keep in mind, some urls in the future might look like:
http://www.anothersite.com/dir/dir/dir/send.php?pID=100
Thank you !
Well mechanize is more a browser like package for Python, for parsing HTML/XML I would recommend Lxml, you can feed that data to lxml and look for urls. Another option is to use regular expressions to look for urls, this approach would be more flexible.
import re
url_regex = re.compile('http:[^\']+')
urls = re.findall(url_regex, html_text)
Edit:
Well instead of printing output, just pass output instead of html_text in re.findall() and then print urls
Using Microsoft's Azure Datamarket API with Python requests, you can request JSON strings directly:
import requests, urllib
q = u'Hello World'
q = urllib.quote(q.encode('utf8'), '')
req = requests.get(
u'https://api.datamarket.azure.com/Data.ashx/Bing/SearchWeb/Web?$format=JSON&Query=%%27%s%%27' % q,
auth=('', u'YOU_API_KEY')
)
# print req.json()
results = req.json()['d']['results']
list_of_urls = [ r['Url'] for r in results]
Depending on your input data you may or may not need the .encode('utf8') part of "q". The "site:xy.com" query should work, too, but I didn't test this. Additionally, we occasionally had some strange encodings returned from Bing ... so we had to re-encode returned URLs like so:
url = r['Url'].encode('latin1')
But those were really special cases ...
You need to register for the Azure API (free of charge) and up to 5000 Bing search requests per month are free: http://datamarket.azure.com/dataset/bing/search
There are several params to fine tune your results: http://datamarket.azure.com/dataset/bing/search#schema

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Categories

Resources