Trouble with http request from Google Compute Engine - python

I'm trying to set up a Google Compute Engine server to pull options data using Python Pandas. When I make this request from my Mac at home, I only have problems late at night when Yahoo! is resetting its servers (the data is being pulled from Yahoo! Finance). But when I try doing the same thing from my Compute Engine server, the request always fails for some of the stocks I'm interested in, although it typically works for options on larger companies, such as 'aapl' or 'ge'. On my computer at home, running it at the same time, the same requests succeed for both small and large companies.
The requests do typically take a few seconds, maybe as many as 15. Is there a way to get to more extensive logs as to what is going on when I make these requests on the Google servers? The only things I can think of would be that there are permissions issues for some reason with these specific http requests or that there is a timeout configured that's interfering. But as far as I can tell, the general timeout should be 75 seconds for that kind of request, and there's no way it's taking that long.
Here's a sample of what I see from the python shell:
>>> from pandas.io.data import Options
>>> spwr = Options('spwr', 'yahoo')
>>> data = spwr.get_all_data()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 1090, in get_all_data
return self._get_data_in_date_range(dates=expiry_dates, call=call, put=put)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 1104, in _get_data_in_date_range
frame = self._get_option_data(expiry=expiry_date, name=name)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 723, in _get_option_data
frames = self._get_option_frames_from_yahoo(expiry)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 655, in _get_option_frames_from_yahoo
option_frames = self._option_frames_from_url(url)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 692, in _option_frames_from_url
raise RemoteDataError('Received no data from Yahoo at url: %s' % url)
pandas.io.data.RemoteDataError: Received no data from Yahoo at url: http://finance.yahoo.com/q/op?s=SPWR&date=1430438400
>>> aapl = Options('aapl', 'yahoo')
>>> data = aapl.get_all_data()
>>>
I've never yet been successful in getting the options data for 'spwr', but usually it will work for larger companies.
Any ideas how I might fix the issue? Or get to logs that will tell me more about what's happening here?

This is caused by an issue in Pandas 0.15.2. When I reverted back to Pandas 0.15.1, it started working again. The issue has been filed with Pandas. Check there to see if it has been resolved in later releases.

Related

Trouble getting the trade-price using "Requests-HTML" library

I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator like selenium or something because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully. When I run the script, I get the following error. Any help on this will be highly appreciated.
Site address : webpage_link
The script I've tried with:
import requests_html
with requests_html.HTMLSession() as session:
r = session.get('https://www.gdax.com/trade/LTC-EUR')
js = r.html.render()
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
print(item)
This is the complete traceback:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\experiment.py", line 6, in <module>
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\shutil.py", line 387, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\ar\\.pyppeteer\\.dev_profile\\tmp1gng46sw\\CrashpadMetrics-active.pma'
The price I'm after is available on the top of the page which can be visible like this 177.59 EUR Last trade price. I wish to get 177.59 or whatever the current price is.
You have several errors. The first is a 'navigation' timeout, showing that the page didn’t complete rendering:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
This traceback is not raised in the main thread, your code was not aborted because of this. Your page may or may not be complete; you may want to set a longer timeout or introduce a sleep cycle for the browser to have time to process AJAX responses.
Next, the response.html.render() element returns None. It loads the HTML into a headless Chromium browser, leaves JavaScript rendering to that browser, then copies back the page HTML into the response.html datasctructure in place, and nothing needs to be returned. So js is set to None, not a new HTML instance, causing your next traceback.
Use the existing response.html object to search, after rendering:
r.html.render()
item = r.html.find('.MarketInfo_market-num_1lAXs', first=True)
There is most likely no such CSS class, because the last 5 characters are generated on each page render, after JSON data is loaded over AJAX. This makes it hard to use CSS to find the element in question.
Moreover, I found that without a sleep cycle, the browser has no time to fetch AJAX resources and render the information you wanted to load. Give it, say, 10 seconds of sleep to do some work before copying back the HTML. Set a longer timeout (the default is 8 seconds) if you see network timeouts:
r.html.render(timeout=10, sleep=10)
You could set the timeout to 0 too, to remove the timeout and just wait indefinitely until the page has loaded.
Hopefully a future API update also provides features to wait for network activity to cease.
You can use the included parse library to find the matching CSS classes:
# search for CSS suffixes
suffixes = [r[0] for r in r.html.search_all('MarketInfo_market-num_{:w}')]
for suffix in suffixes:
# for each suffix, find all matching elements with that class
items = r.html.find('.MarketInfo_market-num_{}'.format(suffix))
for item in items:
print(item.text)
Now we get output produced:
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
Your last traceback shows that the Chromium user data path could not be cleaned up. The underlying Pyppeteer library configures the headless Chromium browser with a temporary user data path, and in your case the directory contains some still-locked resource. You can ignore the error, although you may want to try and remove any remaining files in the .pyppeteer folder at a later time.
Do you need it to go through Requests-HTML? On the day you posted, the repo was 4 days old and in the 3 days that have passed there have been 50 commits. It's not going to be completely stable for some time.
See here:
https://github.com/kennethreitz/requests-html/graphs/commit-activity
OTOH, there is an API for gdax.
https://docs.gdax.com/#market-data
Now if you're dead set on using Py3, there is a python client listed on the GDAX website. Upfront I'll mention that it's the unofficial client; however, if you use this you'd be able to quickly and easily get responses from the official GDAX api.
https://github.com/danpaquin/gdax-python
In case you want to use another way by running Selenium web scraping
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
chrome_path = r"C:\Users\Mike\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.gdax.com/trade/LTC-EUR")
item = driver.find_element_by_xpath('''//span[#class='MarketInfo_market-num_1lAXs']''')
item = item.text
print item
driver.close()
result:177.60 EUR

Error with Requests using Pandas/BeautifulSoup: requests.exceptions.TooManyRedirects: Exceeded 30 redirects

I'm using Python 3 to scrape a Pandas data frame I've created from a csv file that contains the source URLs of 63,067 webpages. The for-loop is supposed to scrape news articles from for a project to place into giant text files for cleaning later on.
I'm a bit rusty with Python and this project is the reason I've started programming in it again. I haven't used BeautifulSoup before, so I'm having some difficulty and just got the for-loop to work on the Pandas data frame with BeautifulSoup.
This is for one of the three data sets I'm using (the other two are programmed into the code below to repeat the same process for different data sets, which is why I'm mentioning this).
from bs4 import BeautifulSoup as BS
import requests, csv
import pandas as pd
negativedata = pd.read_csv('negativedata.csv')
positivedata = pd.read_csv('positivedata.csv')
neutraldata = pd.read_csv('neutraldata.csv')
negativedf = pd.DataFrame(negativedata)
positivedf = pd.DataFrame(positivedata)
neutraldf = pd.DataFrame(neutraldata)
negativeURLS = negativedf[['sourceURL']]
for link in negativeURLS.iterrows():
url = link[1]['sourceURL']
negative = requests.get(url)
negative_content = negative.text
negativesoup = BS(negative_content, "lxml")
for text in negativesoup.find_all('a', href = True):
text.append((text.get('href')))
I think finally got my for-loop to work for the code to run through all of the source URLs. However, I then get the error:
Traceback (most recent call last):
File "./datacollection.py", line 18, in <module>
negative = requests.get(url)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in send
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 640, in <listcomp>
history = [resp for resp in gen] if allow_redirects else []
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 140, in resolve_redirects
raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
I know that the issue is when I'm requesting the URLs, but I'm not sure what–or if a–URL is the problem due to the amount of webpages that are in the data frame being iterated through. Is the problem a URL or that I have too many and should use a different package like scrapy?
I would suggest using modules like mechanize for scraping. Mechanize has a way of handling robots.txt and is much better if your application is scraping data from urls of different websites. But in your case, the redirect is probably because of not having user-agent in headers as mentioned here (https://github.com/requests/requests/issues/3596). And here's how you set headers with requests (Sending "User-agent" using Requests library in Python).
P.S: mechanize is only available for python2.x. If you wish to use python3.x, there are other options (Installing mechanize for python 3.4).

Unexpected query string parameters error when page is not in base directory with Cherrypy

I am currently building a website using cherrypy and have run into an error when attempting to open a page with a query string in the url http://localhost:8080/protected/my_page.html?pk=3e4e285c-ed33-403e-a7a1-6b79b5f8356d included I get an error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\cherrypy\_cprequest.py", line 670, in respond
response.body = self.handler()
File "C:\Python27\lib\site-packages\cherrypy\lib\encoding.py", line 221, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "C:\Python27\lib\site-packages\cherrypy\_cpdispatch.py", line 66, in __call__
raise sys.exc_info()[1]
HTTPError: (404, 'Unexpected query string parameters: pk')
I currently host this page in a /protected/ directory which requires authentication. But, if I move a copy of the page back to my root dir (/wwwroot) and then attempt to load the exact same url (besides the /protected/) http://localhost:8080/my_page.html?pk=3e4e285c-ed33-403e-a7a1-6b79b5f8356d my page loads successfully and all the fields are populated with the data I expect that matches my query. That then though leaves that page available for anyone to access rather than require authentication.
The page currently should load and if the query string is in the url it should get the value from this line of python code pk = context.get("pk", default = "") and then there is a DB lookup using the cx_Oracle python module and the pk variable to find the matching record to display on the page. I tried commenting out the pk variable assignment from my page to see if it would work but I am greeted with the same error.
Besides just moving the page to my base directory to make it work, does anyone have any idea how to resolve this error?

Download xls file from url

I am unable to download a xls file from a url. I have tried with both urlopen and urlretrive. But I recieve an really long error message starting with:
Traceback (most recent call last):
File "C:/Users/Henrik/Documents/Development/Python/Projects/ImportFromWeb.py", line 6, in
f = ur.urlopen(dls)
File "C:\Users\Henrik\AppData\Local\Programs\Python\Python35\lib\urllib\request.py", line 163, in urlopen
return opener.open(url, data, timeout)
and ending with:
urllib.error.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Unfortionally I can't provide the url I am using since the data is sensitive. However I will give you the url with some parts removed.
https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504
As you can see the url dosn't end with a "/file.xls" for example. I don't know if that matters but most of the threds regarding this issue has had those types of links.
If I enter the url in my address bar the download file window appears:
Image of download window
The code I have written look like this:
import urllib.request as ur
import openpyxl as pyxl
dls = 'https://xxxx.xxxx.com/xxxxlogistics/w/functions/transportinvoicelist?0-8.IBehaviorListener.2-ListPageForm-table-TableForm-exportToolbar-xlsExport&antiCache=1477160491504'
f = ur.urlopen(dls)
I am grateful for any help you can provide!

HTTPError with example biopython code querying pubmed

I want to query pubmed through python. I found a nice biology related library to do this:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
I found some example code here:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc116
from Bio import Entrez
Entrez.email = "A.N.Other#example.com"
handle = Entrez.egquery(term="orchid")
record = Entrez.read(handle)
for row in record["eGQueryResult"]:
if row["DbName"]=="pubmed":
print row["Count"]
When I change the email and run this code I get the following error:
Traceback (most recent call last):
File "pubmed.py", line 15, in <module>
handle = Entrez.egquery(term=my_query)
File "/usr/lib/pymodules/python2.7/Bio/Entrez/__init__.py", line 299, in egquery
return _open(cgi, variables)
File "/usr/lib/pymodules/python2.7/Bio/Entrez/__init__.py", line 442, in _open
raise exception
urllib2.HTTPError: HTTP Error 404: Not Found
There is not much of a lead to the source of the problem. I don't know what url it is trying to access. When I search "pubmed entrez urllib2.HTTPError: HTTP Error 404: Not Found", I get 8 results, none of which are related (aside from this thread).
The example works for me. It looks like it was a temporary NCBI issue, although the "Error 404" is quite unusual and not typical of the network problems I have seen with Entrez. In general with any network resource, give it a few hours or a day before worrying that something has broken.
There is also an Entrez Utilities announcement mailing list you may wish to subscribe to, although if there was a planned service outage recently it was not mentioned here:
http://www.ncbi.nlm.nih.gov/mailman/listinfo/utilities-announce

Categories

Resources