Error 429 when invoking Reddit api from Google App Engine

Error 429 when invoking Reddit api from Google App Engine - python

I have been running a cron job on Google App Engine for over a month now without any issues. The job does a variety of things, one being that it uses urllib2 to make a call to retrieve a json response from Reddit as well as a few other sites. About two weeks ago I started seeing errors when invoking Reddit, but no errors when invoking the other sites. The error I am receiving is HTTP error 429.
I have tried executing the same code outside of Google App Engine and do not have any issues. I tried using urlFetch, but receive the same error.
You can see the error when using the app engine's interactive shell with the following code.
import urllib2
data = urllib2.urlopen('http://www.reddit.com/r/Music/.json', timeout=60)
Edit: Not sure why it always fails for me and not someone else. This is the error that I receive:
>>> import urllib2
>>> data = urllib2.urlopen('http://www.reddit.com/r/Music/.json', timeout=60)
Traceback (most recent call last):
File "/base/data/home/apps/s~shell-27/1.356011914885973647/shell.py", line 267, in get
exec compiled in statement_module.__dict__
File "<string>", line 1, in <module>
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 400, in open
response = meth(req, response)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 513, in http_response
'http', request, response, code, msg, hdrs)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 438, in error
return self._call_chain(*args)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 372, in _call_chain
result = func(*args)
File "/base/python27_runtime/python27_dist/lib/python2.7/urllib2.py", line 521, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 429: Unknown
similar code running outside of app engine with no problem:
print urllib2.urlopen('http://www.reddit.com/r/Music/.json').read()
At first I thought it had to do with a timeout problem since it was originally working, but since there is not a timeout error but a the strange HttpError code, I'm not sure.
Any ideas?

Reddit rate limits the api pretty severely for the default user agent for the python shell. You need to set a unique user agent with your reddit username in it, like this:
User-Agent: super happy flair bot by /u/spladug
More info about the reddit api here https://github.com/reddit/reddit/wiki/API.

It's possible that Reddit is counting calls based on IP - which means that other applications on GAE which share your IP might already be exhausting the quota.
This might get better if you use Reddit API keys (I don't know if they issue them) or if they agree to rate limit API calls based on the app header.

Related

Why does this code not download the file and the downloader can download it successfully

The problem begins with this link
https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip
the file downloaded with the downloader is complete.
enter image description here
and I try to use python to download the file
from urllib import request
import sys
request.urlretrieve('https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip', '123.zip')
Traceback (most recent call last):
File "C:/Users/ssshooter/PycharmProjects/first/111.py", line 3, in <module>
request.urlretrieve('https://i1.pixiv.net/img-zip-ugoira/img/2017/04/05/00/24/41/62259492_ugoira600x600.zip', '123.zip')
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 248, in urlretrieve
with contextlib.closing(urlopen(url, data)) as fp:
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 532, in open
response = meth(req, response)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 642, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 570, in error
return self._call_chain(*args)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
result = func(*args)
File "C:\Users\ssshooter\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 650, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
It doesn't work.

The differences are:
You're using different SSL information: You're browser has a built-in set of certificate authorities. Python uses a set which comes with the OS. They differ & if the site you're accessing uses one know to your browser but not known to python, the python will throw an exception.
You're accessing using different User-Agents. Your browser is telling the server it's Chrome or IE or whatever. Python is telling the server it's python. For whatever reason, the server may decide it doesn't like that and return Forbidden.
The server may be working harder than you think: while it appears the request is for a simple file, you're really requesting a resource. It may be (though unlikely in this case) that the resource you're requesting results in multiple interactions between the server and your browser -- cookies, javascript, etc -- which are executed successfully in your browser, returned to the server & it then delivers the file. Your python request is not doing any of that.
Your browser (may) have existing state which your python does not. You say you can access the file using your browser, but it could be that works only because you've accessed other resources on the site, or logged in, or whatever. Your browser is communicating that information (perhaps a session_id via cookie?) with the server recognizes. Your python code states with no previous state, so the server forbids that.
Which is it in this case? You'll need to investigate. Can you get wget or curl to work? Debug your browser's access: what headers are being sent, what are you receiving in reply?

urllib2.HTTPError: HTTP Error 404: Not Found

My Error Message when running my python scripts using a raspberry pi
Traceback (most recent call last):>Traceback (most recent call last):
File "test.py", line 6, in (module)
import appengineauth
File "/home/pi/Downloads/google_appengine/appengineauth.py", line 30, in (module)
auth_resp = urllib2.urlopen(auth_req)
File "/usr/lib/python2.7/urllib2.py", line 154, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python2.7/urllib2.py", line 437, in open
response = meth(req, response)
File "/usr/lib/python2.7/urllib2.py", line 550, in http_response
'http', request, response, code, msg, hdrs)
File "/usr/lib/python2.7/urllib2.py", line 475, in error
return self._call_chain(*args)
File "/usr/lib/python2.7/urllib2.py", line 409, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 558, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 404: Not Found
I'm able to access the website. Not too sure what is the actual problem.

If you're using https://github.com/adafruit/Tweet-a-Watt/blob/master/appengineauth.py (you don't tell us where you got your appengineauth.py from, thus forcing us to guess), and its line
auth_uri = 'https://www.google.com/accounts/ClientLogin'
then you're likely running into the deprecation documented at https://developers.google.com/identity/protocols/AuthForInstalledApps , and I quote:
Important: ClientLogin has been officially deprecated since April 20, 2012 and is now no longer available. Requests to ClientLogin will fail with a HTTP 404 response. We encourage you to migrate to OAuth 2.0 as soon as possible.
I.e, the 404 you're getting would then be exactly the symptom the warning tells you about, now that ClientLogin has been removed, more than 3.5 years after the original deprecation warning.
Not sure how best to connect your Raspberry Pi to App Engine (or any other Google service requiring authentication) with OAuth 2.0 (since ClientLogin is not an option any more). http://guy.carpenter.id.au/gaugette/2012/11/06/using-google-oauth2-for-devices/ (written shortly after the deprecation but smartly avoiding reliance on the already-deprecated ClientLogin service) recommends an "OAuth2 for Devices" library and summarizes how to use it; I haven't tried that library myself (and I don't have a Raspberry Pi to try it on) but it does seem like a potentially fruitful avenue for you to explore.

Logging into a Coursera account Using Python

I had learned a lot of things from MOOCs so I wanted to return something back to them for this purpose I was thinking of designing a small app in kivy which thus requires python implementation, Actually the thing I wanted to achieve was to log in to my Coursera account via program and collect the information about the courses I am currently pursuing, for this first I have to log in to the coursera( https://accounts.coursera.org/signin?post_redirect=https%3A%2F%2Fwww.coursera.org%2F ), Upon searching the Web I came across this piece of code :
import urllib2, cookielib, urllib
username = "abcdef#abcdef.com"
password = "uvwxyz"
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
login_data = urllib.urlencode({'username' : username, 'password' : password})
info = opener.open("https://accounts.coursera.org/signin",login_data)
for line in info:
print line
and some similar codes as well, but none worked for me, every approach lead to me this type of error:
Traceback (most recent call last):
File "C:\Python27\Practice\web programming\coursera login.py", line 9, in <module>
info = opener.open("https://accounts.coursera.org/signin",login_data)
File "C:\Python27\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 404: Not Found
Is the error due to https protocol or there is something that I am missing?
I don't want to use any 3rd party libraries.

I'm using requests for this purpose and I think it is a great python library. Here is some example code how it could work:
import requests
from requests.auth import HTTPBasicAuth
credentials = HTTPBasicAuth('username', 'password')
response = requests.get("https://accounts.coursera.org/signin", auth=credentials)
print response.status_code
# if everything was fine then it prints
>>> 200
Here is the link to requests:
http://docs.python-requests.org/en/latest/

I think you need to use HTTPBasicAuthHandler module of urllib2. Check section 'Basic Authentication'. https://docs.python.org/2/howto/urllib2.html
And I strongly recommend you requests module. It will make your code better. http://docs.python-requests.org/en/latest/

dev-server HTTP Error 403: Forbidden

After updating from 1.7.5 (where everything worked fine) I'm getting a HTTP Error 403: Forbidden when trying to open any sites via localhost. Strange thing is I have pretty much the same setup at home as here at work and everything works there... Might be an issue with proxy server we're using at work, since that's the only difference I can think of? Here's the error log I'm getting, so if anyone knows what's going on please help (;
Traceback (most recent call last):
File "U:\Dev\GAE\lib\cherrypy\cherrypy\wsgiserver\wsgiserver2.py", line 1302, in communicate
req.respond()
File "U:\Dev\GAE\lib\cherrypy\cherrypy\wsgiserver\wsgiserver2.py", line 831, in respond
self.server.gateway(self).respond()
File "U:\Dev\GAE\lib\cherrypy\cherrypy\wsgiserver\wsgiserver2.py", line 2115, in respond
response = self.req.server.wsgi_app(self.env, self.start_response)
File "U:\Dev\GAE\google\appengine\tools\devappserver2\wsgi_server.py", line 246, in __call__
return app(environ, start_response)
File "U:\Dev\GAE\google\appengine\tools\devappserver2\request_rewriter.py", line 311, in _rewriter_middleware
response_body = iter(application(environ, wrapped_start_response))
File "U:\Dev\GAE\google\appengine\tools\devappserver2\python\request_handler.py", line 89, in __call__
self._flush_logs(response.get('logs', []))
File "U:\Dev\GAE\google\appengine\tools\devappserver2\python\request_handler.py", line 220, in _flush_logs
apiproxy_stub_map.MakeSyncCall('logservice', 'Flush', request, response)
File "U:\Dev\GAE\google\appengine\api\apiproxy_stub_map.py", line 94, in MakeSyncCall
return stubmap.MakeSyncCall(service, call, request, response)
File "U:\Dev\GAE\google\appengine\api\apiproxy_stub_map.py", line 320, in MakeSyncCall
rpc.CheckSuccess()
File "U:\Dev\GAE\google\appengine\api\apiproxy_rpc.py", line 156, in _WaitImpl
self.request, self.response)
File "U:\Dev\GAE\google\appengine\ext\remote_api\remote_api_stub.py", line 200, in MakeSyncCall
self._MakeRealSyncCall(service, call, request, response)
File "U:\Dev\GAE\google\appengine\ext\remote_api\remote_api_stub.py", line 226, in _MakeRealSyncCall
encoded_response = self._server.Send(self._path, encoded_request)
File "U:\Dev\GAE\google\appengine\tools\appengine_rpc.py", line 393, in Send
f = self.opener.open(req)
File "U:\Dev\Python\lib\urllib2.py", line 410, in open
response = meth(req, response)
File "U:\Dev\Python\lib\urllib2.py", line 523, in http_response
'http', request, response, code, msg, hdrs)
File "U:\Dev\Python\lib\urllib2.py", line 448, in error
return self._call_chain(*args)
File "U:\Dev\Python\lib\urllib2.py", line 382, in _call_chain
result = func(*args)
File "U:\Dev\Python\lib\urllib2.py", line 531, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden
INFO 2013-04-19 12:28:52,576 server.py:561] default: "GET / HTTP/1.1" 500 -
INFO 2013-04-19 12:28:52,619 server.py:561] default: "GET /favicon.ico HTTP/1.1" 304 -
Also, the launcher throws an error when closing:
Traceback (most recent call last):
File "launcher\mainframe.pyc", line 327, in OnStop
File "launcher\taskcontroller.pyc", line 167, in Stop
File "launcher\dev_appserver_task_thread.pyc", line 82, in stop
File "launcher\taskthread.pyc", line 107, in stop
File "launcher\platform.pyc", line 397, in KillProcess
pywintypes.error: (5, 'TerminateProcess', 'Access is denied.')

I had this very same issue with my MacOSX when using a proxy server using Google App Engine Launcher 1.8.6. Apparently there's an issue with "proxy_bypass" on "urllib2.py".
There are two possible solutions:
Downgrade to 1.7.5, but, who wants to downgrade?
Edit "[GAE Instalattion path]/google/appengine/tools/appengine_rpc.py" and look for the line that says
opener.add_handler(fancy_urllib.FancyProxyHandler())
In my computer it was line 578, and then put a hash (#) at the beginning of the line, like this:
`#opener.add_handler(fancy_urllib.FancyProxyHandler())`
Save the file, stop and then restart your application. Now dev_appserver.py shouldn't try to use any proxy server at all.
If your application uses any external resources like a SOAP Webservice or something like that and you can't reach the server without the proxy server, then you'll have to downgrade. Please keep in mind that external javascript files (like facebook SDK or similar) are loaded from your browser, not from your application.
Since I'm not using any external REST or SOAP services it worked for me!
Hopefully it will work for you as well.

Try either:
-Accessing it through a different proxy. I.E a . proxy within a proxy
-Accessing it through your local IP i.e 192.168.1.1

I faced the same issue with version 1.9.5. Seems that the API proxy is sending some RPCs to the proxy server, which are then being rejected with HTTP 403 (since proxy servers are generally configured to reject connection attempts to arbitrary ports). In my case I was using the urlfetch module in my app to access external web pages, so disabling the proxy server was not a choice for me.
This is how I worked around the issue some time back (most probably it was based on comments found under this issue, but I cannot remember the exact sources).
NOTE:
For this approach to work, you'll have to know the hostname/IP address and default port of your proxy server, and change them appropriately in the code if you happen to connect to a different proxy server.
When you are not behind the proxy server, you will have to revert the applied changes in order to return to a working state (if you want internet access inside your app).
Here it goes:
Disable proxy settings for the Python (Google App Engine Launcher) environment in some way. (In my case it was easy since I was launching the dev_appserver.py from a Terminal shell (on Linux), and the unset http_proxy and unset https_proxy commands did the trick.)
Edit {App Engine SDK root}/google/appengine/api/urlfetch_stub.py. Find the code block
if _CONNECTION_SUPPORTS_TIMEOUT:
connection = connection_class(host, timeout=deadline)
else:
connection = connection_class(host)
(lines 376-379 in my case) and replace it with:
if _CONNECTION_SUPPORTS_TIMEOUT:
if host[:9] == 'localhost' or host[:9] == '127.0.0.1':
connection = connection_class(host, timeout=deadline)
else:
connection = connection_class('your_proxy_host_goes_here', your_proxy_port_number_goes_here, timeout=deadline)
else:
if host[:9] == 'localhost' or host[:9] == '127.0.0.1':
connection = connection_class(host)
else:
connection = connection_class('your_proxy_host_goes_here', your_proxy_port_number_goes_here)
replacing the placeholders your_proxy_host_goes_here and your_proxy_port_number_goes_here with appropriate values.
(I believe this code can be written more elegantly, though... any Python geeks out there? :) )
In my case, I also had to delete the existing compiled file urlfetch_stub.pyc (located in the same directory as urlfetch_stub.py) because the SDK didn't seem to pick up the changes until I did so.
Now you can use dev_appserver to launch your app, and use urlfetch-backed services within the app, free from HTTP 403 errors.

Python's `urllib2`: Why do I get error 403 when I `urlopen` a Wikipedia page?

I have a strange bug when trying to urlopen a certain page from Wikipedia. This is the page:
http://en.wikipedia.org/wiki/OpenCola_(drink)
This is the shell session:
>>> f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
Traceback (most recent call last):
File "C:\Program Files\Wing IDE 4.0\src\debug\tserver\_sandbox.py", line 1, in <module>
# Used internally for debug sandbox under external interpreter
File "c:\Python26\Lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "c:\Python26\Lib\urllib2.py", line 397, in open
response = meth(req, response)
File "c:\Python26\Lib\urllib2.py", line 510, in http_response
'http', request, response, code, msg, hdrs)
File "c:\Python26\Lib\urllib2.py", line 435, in error
return self._call_chain(*args)
File "c:\Python26\Lib\urllib2.py", line 369, in _call_chain
result = func(*args)
File "c:\Python26\Lib\urllib2.py", line 518, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
This happened to me on two different systems in different continents. Does anyone have an idea why this happens?

Wikipedias stance is:
Data retrieval: Bots may not be used
to retrieve bulk content for any use
not directly related to an approved
bot task. This includes dynamically
loading pages from another website,
which may result in the website being
blacklisted and permanently denied
access. If you would like to download
bulk content or mirror a project,
please do so by downloading or hosting
your own copy of our database.
That is why Python is blocked. You're supposed to download data dumps.
Anyways, you can read pages like this in Python 2:
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib2.urlopen( req )
print con.read()
Or in Python 3:
import urllib
req = urllib.request.Request(url, headers={'User-Agent' : "Magic Browser"})
con = urllib.request.urlopen( req )
print(con.read())

To debug this, you'll need to trap that exception.
try:
f = urllib2.urlopen('http://en.wikipedia.org/wiki/OpenCola_(drink)')
except urllib2.HTTPError, e:
print e.fp.read()
When I print the resulting message, it includes the following
"English
Our servers are currently experiencing
a technical problem. This is probably
temporary and should be fixed soon.
Please try again in a few minutes. "

Often times websites will filter access by checking if they are being accessed by a recognised user agent. Wikipedia is just treating your script as a bot and rejecting it. Try spoofing as a browser. The following link takes to you an article to show you how.
http://wolfprojects.altervista.org/changeua.php

Some websites will block access from scripts to avoid 'unnecessary' usage of their servers by reading the headers urllib sends. I don't know and can't imagine why wikipedia does/would do this, but have you tried spoofing your headers?

As Jochen Ritzel mentioned, Wikipedia blocks bots.
However, bots will not get blocked if they use the PHP api.
To get the Wikipedia page titled "love":
http://en.wikipedia.org/w/api.php?format=json&action=query&titles=love&prop=revisions&rvprop=content

I made a workaround for this using php which is not blocked by the site I needed.
it can be accessed like this:
path='http://phillippowers.com/redirects/get.php?
file=http://website_you_need_to_load.com'
req = urllib2.Request(path)
response = urllib2.urlopen(req)
vdata = response.read()
This will return the html code to you

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.