I am trying to use Python to login to a website and gather information from several webpages and I get the following error:
Traceback (most recent call last):
File "extract_test.py", line 43, in <module>
response=br.open(v)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 203, in open
return self._mech_open(url, data, timeout=timeout)
File "/usr/local/lib/python2.7/dist-packages/mechanize/_mechanize.py", line 255, in _mech_open
raise response
mechanize._response.httperror_seek_wrapper: HTTP Error 429: Unknown Response Code
I used time.sleep() and it works, but it seems unintelligent and unreliable, is there any other way to dodge this error?
Here's my code:
import mechanize
import cookielib
import re
first=("example.com/page1")
second=("example.com/page2")
third=("example.com/page3")
fourth=("example.com/page4")
## I have seven URL's I want to open
urls_list=[first,second,third,fourth]
br = mechanize.Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Log in credentials
br.open("example.com")
br.select_form(nr=0)
br["username"] = "username"
br["password"] = "password"
br.submit()
for url in urls_list:
br.open(url)
print re.findall("Some String")
Receiving a status 429 is not an error, it is the other server "kindly" asking you to please stop spamming requests. Obviously, your rate of requests has been too high and the server is not willing to accept this.
You should not seek to "dodge" this, or even try to circumvent server security settings by trying to spoof your IP, you should simply respect the server's answer by not sending too many requests.
If everything is set up properly, you will also have received a "Retry-after" header along with the 429 response. This header specifies the number of seconds you should wait before making another call. The proper way to deal with this "problem" is to read this header and to sleep your process for that many seconds.
You can find more information on status 429 here: https://www.rfc-editor.org/rfc/rfc6585#page-3
Writing this piece of code when requesting fixed my problem:
requests.get(link, headers = {'User-agent': 'your bot 0.1'})
This works because sites sometimes return a Too Many Requests (429) error when there isn't a user agent provided. For example, Reddit's API only works when a user agent is applied.
As MRA said, you shouldn't try to dodge a 429 Too Many Requests but instead handle it accordingly. You have several options depending on your use-case:
1) Sleep your process. The server usually includes a Retry-after header in the response with the number of seconds you are supposed to wait before retrying. Keep in mind that sleeping a process might cause problems, e.g. in a task queue, where you should instead retry the task at a later time to free up the worker for other things.
2) Exponential backoff. If the server does not tell you how long to wait, you can retry your request using increasing pauses in between. The popular task queue Celery has this feature built right-in.
3) Token bucket. This technique is useful if you know in advance how many requests you are able to make in a given time. Each time you access the API you first fetch a token from the bucket. The bucket is refilled at a constant rate. If the bucket is empty, you know you'll have to wait before hitting the API again. Token buckets are usually implemented on the other end (the API) but you can also use them as a proxy to avoid ever getting a 429 Too Many Requests. Celery's rate_limit feature uses a token bucket algorithm.
Here is an example of a Python/Celery app using exponential backoff and rate-limiting/token bucket:
class TooManyRequests(Exception):
"""Too many requests"""
#task(
rate_limit='10/s',
autoretry_for=(ConnectTimeout, TooManyRequests,),
retry_backoff=True)
def api(*args, **kwargs):
r = requests.get('placeholder-external-api')
if r.status_code == 429:
raise TooManyRequests()
if response.status_code == 429:
time.sleep(int(response.headers["Retry-After"]))
Another workaround would be to spoof your IP using some sort of Public VPN or Tor network. This would be assuming the rate-limiting on the server at IP level.
There is a brief blog post demonstrating a way to use tor along with urllib2:
http://blog.flip-edesign.com/?p=119
I've found out a nice workaround to IP blocking when scraping sites. It lets you run a Scraper indefinitely by running it from Google App Engine and redeploying it automatically when you get a 429.
Check out this article
In many cases, continuing to scrape data from a website even when the server is requesting you not to is unethical. However, in the cases where it isn't, you can utilize a list of public proxies in order to scrape a website with many different IP addresses.
Related
I want to take some website's sources for a project. When i try to get response, program just stuck and wait for response. No matter how long i wait no timeout or response. Here is my code:
link = "https://eu.mouser.com/"
linkResponse = urllib.request.urlopen(link)
readedResponse = linkResponse.readlines()
writer = open("html.txt", "w")
for line in readedResponse:
writer.write(str(line))
writer.write("\n")
writer.close()
When i try to other websites, urlopen return their response. But when i try to get "eu.mouser.com" and "uk.farnell.com" not return their response. I ll skip their response, even urlopen not return a timeout. What is the problem there? Is there another way to take the website's sources? (Sorry for my bad english)
urllib.request.urlopen docs claims that
The optional timeout parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This actually only works
for HTTP, HTTPS and FTP connections.
without explaining how to find said default, I managed to provoke timeout after directly providing 5 (seconds) as timeout
import urllib.request
url = "https://uk.farnell.com"
urllib.request.urlopen(url, timeout=5)
gives
socket.timeout: The read operation timed out
There are some sites that protect themselves from automated crawlers by implementing mechanisms that detect such bots. These can be very diverse and also change over time. If you really want to do everything you can to get the page crawled automatically, this usually means that you have to implement steps yourself to circumvent these protective barriers.
One example of this is the header information that is provided with every request. This can be changed before making the request, e.g. via request's header customization. But there are probably more things to do here and there.
If you're interested in starting developing such a thing (leaving aside the question of whether this is allowed at all), you can take this as a starting point:
from collections import namedtuple
from contextlib import suppress
import requests
from requests import ReadTimeout
Link = namedtuple("Link", ["url", "filename"])
links = {
Link("https://eu.mouser.com/", "mouser.com"),
Link("https://example.com/", "example1.com"),
Link("https://example.com/", "example2.com"),
}
for link in links:
with suppress(ReadTimeout):
response = requests.get(link.url, timeout=3)
with open(f"html-{link.filename}.txt", "w", encoding="utf-8") as file:
file.write(response.text)
where such protected sites which lead to ReadTimeOut errors are simply ignored and with the possibility to go further - e.g. by enhancing requests.get(link.url, timeout=3) with a suitable headers parameter. But as I already mentioned, this is probably not the only customization which had to be done and the legal aspects should also be clarified.
Hay ! I am new here so let me describe clearly my issue,Please Ignore mistakes.
I am making request on a page which literlaly works on js.
Acually its the page of paytm payemnt response through UPI.
When ever i do the requests the response is {'POLL_STATUS':"STOP_POLLING"}
But the problem is the reqest is giving this response while the browser is giving another response with loaded html.
I tried everyting like stopeed redirects and printing raw content nothing works.
I just think may be urllib post request may be work but i do not know the uses.
Can anyone please tell me how to get the exact html response as the browser gives.
Note[0]:Please dont provide answer of selenium because this issue having in the middle of my script.
Note[1]:Friendly answer appriciated.
for i in range(0,15):
resp_check_transaction=self.s.post("https://secure.website.in/theia/upi/transactionStatus?MID="+str(Merchant_ID)+"&ORDER_ID="+str(ORDER_ID),headers=check_transaction(str(ORDER_ID)),data=check_transaction_payload(Merchant_ID,ORDER_ID,TRANSID,CASHIERID))
print(resp_check_transaction.text)
resp_check_transaction=resp_check_transaction.json()
if resp_check_transaction['POLL_STATUS']=="STOP_POLLING":
print("Breaking looop")
break
time.sleep(4)
self.clear_header()
parrms={
"MID": str(Merchant_ID),
"ORDER_ID": str(ORDER_ID)
}
resp_transaction_pass=requests.post("https://secure.website.in/theia/upi/transactionStatus",headers=transaction_pass(str(ORDER_ID)),data=transaction_pass_payload(CASHIERID,UPISTATUSURL,Merchant_ID,ORDER_ID,TRANSID,TXN_AMOUNT),params=parrms,allow_redirects=True)
print("Printing response")
print(resp_transaction_pass.text)
print(resp_transaction_pass.content)
And in the web browser its showing that Status Code: 302 Moved Temporarily in the bank response of Bank response. :(
About the 302 status code
You mention that the web browser is sends a 302 status code in response to the request. In the simplest terms the 302 status code is just the web servers way of saying "Hey I know what you're looking for but it is actually located at this other URL.".
Basically all modern browsers and HTTP request libraries like Python's Requests will automatically follow a 302 redirect and act as though you send the request to the new URL instead. (Your browser's developer tools may show that a 302 redirect has happened but as far as the JavaScript is concerned it just got a normal 200 response).
If you really want to see if your Python script receives a 302 status you can do so by setting the allow_redirects option to False, but this means you will manually have to get the stuff from the new URL.
import requests
r1 = requests.get('https://httpstat.us/302', allow_redirects=False)
r2 = requests.get('https://httpstat.us/302', allow_redirects=True)
print('No redirects:', r1.status_code) # 302
print('Redirects on:', r2.status_code) # 200 (status code of page it redirects to)
Note that allow_redirects is already set to True by default, I just wanted to make the example a bit more verbose so the difference is obvious.
So why is the response content different?
So even though the browser and the Requests library are both automatically following the 302 redirect the response they get is still different, you didn't share any screenshots of the browsers requests or responses so I can only give a few educated guesses but it boils down to the fact that the request made by your Python code is somehow different from the JavaScript loaded by the web browser.
Some things to consider:
Are you sure you are using the he correct HTTP method? Is the browser also making a POST request?
If so are you sure the body of the request is the same/of the same format as the one sent by the web browser?
Perhaps the browser has a session cookie it is sending along with the request (Note this usually not explicitly said in the JS but happens automatically).
Alternatively the JS might include some API key/credentials in the HTTP auth header (this should be explicitly visible in JS).
Although unlikely it could be that whatever API you're trying to query is trying to block reverse engineering attempts by blocking the Requests library's user agent string.
Luckily all of these differences can be easily examined with some print statements and your browser's developer tools :p.
I have a Django App that accepts messages from a remote device as a POST message.
This fits well with Django's framework! I used the generic View class (from django.views import View) and defined my own POST function.
But the remote device requires a special reply that I cannot generate in Django (yet). So, I use the Requests library to re-create the POST message and send it up to the manufacturer's cloud server.
That server processes the data, and responds with the special message in the body. Idealy, the entire HTML response message should go back to the remote device. If it does not get a valid reply, it will re-send the message. Which would be annoying!
I've been googling, but am having trouble getting a clear picture on how to either:
(a): Reply back in Django with the Requests.response object without any edits.
(b): Build a Django response and send it back.
Actually, I think I do know how to do (b), but its work. I would rather do (a) if its possible.
Thanks in Advance!
Rich.
Thanks for the comments and questions!
The perils of late night programming: you might over-think something, or miss the obvious. I was so focused on finding a way to return the request.response without any changes/edits I did not even sketch out what option (b) would be.
Well, it turns out its pretty simple:
s = Session()
# Populate POST to cloud with data from remote device request:
req = Request('POST', url, data=data, headers=headers)
prepped = req.prepare()
timeout = 10
retries = 3
while retries > 0:
try:
logger.debug("POST data to remote host")
resp = s.send(prepped, timeout=timeout)
break
except:
logger.debug("remote host connection failed, retry")
retries -= 1
logger.debug("retries left: %d", retries)
time.sleep(.3)
if retries == 0:
pass # There isn't anything I can do if this fails repeatedly...
# Build reply to remote device:
r = HttpResponse(resp.content,
content_type = resp.headers['Content-Type'],
status = resp.status_code,
reason = resp.reason,
)
r['Server'] = resp.headers['Server']
r['Connection'] = resp.headers['Connection']
logger.debug("Returning Server response to remote device")
return r
The Session "s" allows one to use "prepped" and "send", which allows one to monkey with the request object before its sent, and to re-try the send. I think at least some of it can be removed in a refactor; making this process even simpler.
There are 3 HTTP object at play here:
"req" is the POST I send up to the cloud server to get back a special (encrypted) reply.
"resp" is the reply back from the cloud server. The body (.content) contains the special reply.
"r" is the Django HTTP response I need to send back to the remote device that started this ball rolling by POSTing data to my view.
Its pretty simple to populate the response with the data, and set headers to the values returned by the cloud server.
I know this works because the remote device does not POST the same data twice! If there was a mistake anyplace in this process, it would re-send the same data over and over. I copied the While/try loop from a Socket repeater module. I don't know if that is really applicable to HTTP. I have been testing this on live hardware for over 48 hours and so far it has never failed. Timeouts are a question mark too, in that I know the remote device and cloud server have strict limits. So if there is an error in my "repeater", re-trying may not work if the process takes too long. It might be better to just discard/give up on the current POST. And wait for the remote device to re-try. Sorry, refactoring out loud...
Update:
I switched this back from answered as I tried the solution posed in cogent Nick's answer and switched to Google's urlfetch:
logging.debug("starting urlfetch for http://%s%s" % (self.host, self.url))
result = urlfetch.fetch("http://%s%s" % (self.host, self.url), payload=self.body, method="POST", headers=self.headers, allow_truncated=True, deadline=5)
logging.debug("finished urlfetch")
but unfortunately finished urlfetch is never printed - I see the timeout happen in the logs (it returns 200 after 5 seconds), but execution doesn't seem tor return.
Hi All-
I'm attempting to play around with Twitter's Streaming (aka firehose) API with Google App Engine (I'm aware this probably isn't a great long term play as you can't keep the connection perpetually open with GAE), but so far I haven't had any luck getting my program to actually parse the results returned by Twitter.
Some code:
logging.debug("firing up urllib2")
req = urllib2.Request(url="http://%s%s" % (self.host, self.url), data=self.body, headers=self.headers)
logging.debug("called urlopen for %s %s, about to call urlopen" % (self.host, self.url))
fobj = urllib2.urlopen(req)
logging.debug("called urlopen")
When this executes, unfortunately, my debug output never shows the called urlopen line printed. I suspect what's happening is that Twitter keeps the connection open and urllib2 doesn't return because the server doesn't terminate the connection.
Wireshark shows the request being sent properly and a response returned with results.
I tried adding Connection: close to my request header, but that didn't yield a successful result.
Any ideas on how to get this to work?
urllib on App Engine is a thin wrapper around the urlfetch API. You're right about what's happening: Twitter's streaming API never terminates its response, so it times out, and urlfetch throws an exception.
If you use urlfetch directly, you can set the timeout (up to 10 seconds), and set allow_truncated to True so you can get the partial result. The Twitter streaming API really isn't a good match for App Engine, though, because App Engine requests are limited to 30 seconds of execution time, and urlfetch requests can't send back results progressively, or take more than 10 seconds. Using Twitter's 'standard' API would be a better option.
InstaMapper is a GPS tracking service that updates the device's position more frequently when the device is being tracked live on the InstaMapper webpage. I'd like to have this happen all the time so I thought I'd write a python script to login to my account and access the page periodically.
import urllib2, urllib, cookielib
cj = cookielib.LWPCookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
urllib2.install_opener(opener)
params = urllib.urlencode(dict(username_hb='user', password_hb='hunter2'))
opener.open('http://www.instamapper.com/fe?action=login', params)
if not 'id' in [cookie.name for cookie in cj]:
raise ValueError, "Login failed"
# try secured page
resp = opener.open('http://www.instamapper.com/fe?page=track&device_key=abc')
print resp.read()
resp.close()
The ValueError is raised each time. If I remove this and read the response, the page thinks I have disabled cookies and blocks access to that page. Why isn't cj grabbing the InstaMapper cookie?
Are there better ways to make the tracking service think I'm viewing my account constantly?
action=login is part of the parameters, and should be treated accordingly:
params = urllib.urlencode(dict(action='login', username_hb='user', password_hb='hunter2'))
opener.open('http://www.instamapper.com/fe', params)
(Also, this particular username/password combination is invalid, I assume, that you actually use a valid username and password in your actual code, otherwise the login fails corretly.)
Have you looked at whether there is a cookie specifically designed to foil your attempts? I suggest using Wireshark or other inspector to see if there is a cookie that changes (via javascript, etc) when you manually log in.
(Ethical note: You may be violating the terms of service and incurring much more cost to the company than you are paying for. I used to run a service like this and every additional/unplanned location update was between $0.01 - $0.05 but I'm sure its come down.)