I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?
Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.
A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.
Related
I have the following python code that fetches data from a remote json file. The processing of the remote json file can sometimes be quick or sometimes take a little while. So I put the please wait print message before the post request. This works fine. However, I find that for the requests that are quick, the please wait is pointless. Is there a way I can display the please wait message if request is taking longer than x seconds?
try:
print("Please wait")
r = requests.post(url = "http://localhost/test.php")
r_data = r.json()
you can do it using multiple threads as follows:
import threading
from urllib import request
from asyncio import sleep
def th():
sleep(2) # if download takes more than 2 seconds
if not isDone:
print("Please wait...")
dl_thread = threading.Thread(target=th) # create new thread that executes function th when the thread is started
dl_thread.start() # start the thread
isDone = False # variable to track the request status
r = request.post(url="http://localhost/test.php")
isDone = True
r_data = r.json()
I am trying to validate what the throttle limit for an endpoint using python code.
Basically I have set Throttlelimit on the endpoint I am testing is 3calls/sec. The test does 4 calls and checks the status codes to have atleast 1 429 response.
The validation I have fails sometimes because it looks like the responses take more than a second to respond. The code I tried are:
Method1:
request = requests.Request(method='GET', url=GLOBALS["url"], params=context.payload, headers=context.headers)
context.upperlimit = int(GLOBALS["ThrottleLimit"]) + 1
reqs = [request for i in range(0, context.upperlimit)]
with BaseThrottler(name='base-throttler', reqs_over_time=(context.upperlimit, 1)) as bt:
throttled_requests = bt.multi_submit(reqs)
context.responses = [tr.response for tr in throttled_requests]
assert(429 in [ i.status_code for i in context.responses])
Method2:
request = requests.get(url=GLOBALS["url"], params=context.payload, headers=context.headers)
url = request.url
urls = set([])
for i in range(0, context.upperlimit):
urls.add(grequests.get(url))
context.responses = grequests.map(urls)
assert(429 in [ i.status_code for i in context.responses])
Is there a way that I can make sure all the responses came back in the same second and if not it should try again before failing the test.
I suppose you are using requests and grequests library. You can set a timeout as explained in the docs and also for grequests.
Plain requests
requests.get(url, timeout=1)
Using grequests
grequests.get(url, timeout=1)
Timeout value is "number of seconds"
Using timeout won't necessarily ensure the condition that you are looking for, which is that all 4 requests were received by the endpoint within one second (not that each individual response was received within one second of sending the request).
One quick and dirty way to solve this is to simply time the execution of the code, and ensure that all responses were received in less than a second (using the timeit module)
start_time = timeit.default_timer()
context.responses = grequests.map(urls)
elapsed = timeit.default_timer() - start_time
if elapsed < 1:
assert(429 in [ i.status_code for i in context.responses])
This is crude because it is checking round trip time, but will ensure that all requests were received within a second. If you need more specificity, or find that the condition is not met often enough, you could add a header to the response with the exact time the request was received by the endpoint, and then verify that all requests hit the endpoint within one second of each other.
I want to make a lot of url requets to a REST webserivce. Typically between 75-90k. However, I need to throttle the number of concurrent connections to the webservice.
I started playing around with grequests in the following manner, but quickly started chewing up opened sockets.
concurrent_limit = 30
urllist = buildUrls()
hdrs = {'Host' : 'hostserver'}
g_requests = (grequests.get(url, headers=hdrs) for url in urls)
g_responses = grequests.map(g_requests, size=concurrent_limit)
As this runs for a minute or so, I get hit with 'maximum number of sockets reached' errors.
As far as I can tell, each one of the requests.get calls in grequests uses it's own session which means a new socket is opened for each request.
I found a note on github referring how to make grequests use a single session. But this seems to effectively bottleneck all requests into a single shared pool. That seems to defeat the purpose of asynchronous http requests.
s = requests.session()
rs = [grequests.get(url, session=s) for url in urls]
grequests.map(rs)
Is is possible to use grequests or gevent.Pool in a way that creates a number of sessions?
Put another way: How can I make many concurrent http requests using either through queuing or connection pooling?
I ended up not using grequests to solve my problem. I'm still hopeful it might be possible.
I used threading:
class MyAwesomeThread(Thread):
"""
Threading wrapper to handle counting and processing of tasks
"""
def __init__(self, session, q):
self.q = q
self.count = 0
self.session = session
self.response = None
Thread.__init__(self)
def run(self):
"""TASK RUN BY THREADING"""
while True:
url, host = self.q.get()
httpHeaders = {'Host' : host}
self.response = session.get(url, headers=httpHeaders)
# handle response here
self.count+= 1
self.q.task_done()
return
q=Queue()
threads = []
for i in range(CONCURRENT):
session = requests.session()
t=MyAwesomeThread(session,q)
t.daemon=True # allows us to send an interrupt
threads.append(t)
## build urls and add them to the Queue
for url in buildurls():
q.put_nowait((url,host))
## start the threads
for t in threads:
t.start()
rs is a AsyncRequest list。each AsyncRequest have it's own session.
rs = [grequests.get(url) for url in urls]
grequests.map(rs)
for ar in rs:
print(ar.session.cookies)
Something like this:
NUM_SESSIONS = 50
sessions = [requests.Session() for i in range(NUM_SESSIONS)]
reqs = []
i = 0
for url in urls:
reqs.append(grequests.get(url, session=sessions[i % NUM_SESSIONS]
i+=1
responses = grequests.map(reqs, size=NUM_SESSIONS*5)
That should spread the requests over 50 different sessions.
I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!
In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))
I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)
I am trying to collect the retweets data from a Chinese microblog Sina Weibo, you can see the following code. However, I am suffering from the problem of IP request out of limit.
To solve this problem, I have to set time.sleep() for the code. You can see I attempted to add a line of ' time.sleep(10) # to opress the ip request limit' in the code. Thus python will sleep 10 secs after crawling a page of retweets (one page contains 200 retweets).
However, it still not sufficient to deal with the IP problem.
Thus, I am planning to more systematically make python sleep 60 secs after it has crawled every 20 pages. Your ideas will be appreciated.
ids=[3388154704688495, 3388154704688494, 3388154704688492]
addressForSavingData= "C:/Python27/weibo/Weibo_repost/repostOwsSave1.csv"
file = open(addressForSavingData,'wb') # save to csv file
for id in ids:
if api.rate_limit_status().remaining_hits >= 205:
for object in api.counts(ids=id):
repost_count=object.__getattribute__('rt')
print id, repost_count
pages= repost_count/200 +2 # why should it be 2? cuz python starts from 0
for page in range(1, pages):
time.sleep(10) # to opress the ip request limit
for object in api.repost_timeline(id=id, count=200, page=page): # get the repost_timeline of a weibo
"""1.1 reposts"""
mid = object.__getattribute__("id")
text = object.__getattribute__("text").encode('gb18030') # add encode here
"""1.2 reposts.user"""
user = object.__getattribute__("user") # for object in user
user_id = user.id
"""2.1 retweeted_status"""
rts = object.__getattribute__("retweeted_status")
rts_mid = rts.id # the id of weibo
"""2.2 retweeted_status.user"""
rtsuser_id = rts.user[u'id']
try:
w = csv.writer(file,delimiter=',',quotechar='|', quoting=csv.QUOTE_MINIMAL)
w.writerow(( mid,
user_id, rts_mid,
rtsuser_id, text)) # write it out
except: # Exception of UnicodeEncodeError
pass
elif api.rate_limit_status().remaining_hits < 205:
sleep_time=api.rate_limit_status().reset_time_in_seconds # time.time()
print sleep_time, api.rate_limit_status().reset_time
time.sleep(sleep_time+2)
file.close()
pass
Can you not just pace the script instead?
I suggest to make your script sleep in between each request instead of making a requests all at the same time. And say span over a minute.. This way you will also avoid any flooding bans and this is considered good behaviour.
Pacing your requests may also allow you to do things more quickly if the server does not time you out for sending too many requests.
If there is a limit to the IP sometimes their are no great and easy solutions. For example if you run apache http://opensource.adnovum.ch/mod_qos/ limits bandwidth and connections and specifically it limits;
The maximum number of concurrent requests
Limitation of the bandwidth such as the maximum allowed number of requests per second to an URL or the maximum/minimum of downloaded kbytes per second.
Limits the number of request events per second
Generic request line and header filter to deny unauthorized operations.
Request body data limitation and filtering
the maximum number of allowed connections from a single IP source address or dynamic keep-alive control.
You may want to start with these. You could send referrer URL's with your requests and make only single connections, not multiple connections.
You could also refer to this question
I figure out the solution:
first, give an integer, e.g 0
i = 0
second, in the for page loop, add the following code
for page in range(1, 300):
i += 1
if (i % 25 ==0):
print i, "find i which could be exactly divided by 25"