Gevent incremental process through non blocking joinall()

Gevent incremental process through non blocking joinall() - python

Here I want to make some modificatins for my setting.
I want response from multiple API calls within a single request made to my server. from all these API calls I want to combine results and return them as a response. Until here pretty much everything follows as given in examples of gevent documentation and over here. Now the catch here is that I want to pass response in incremental way, so if first API call has returned the result I will return this result to frontend in one long waited request and then wait for other API calls and pass them in same request to frontend.
I have tried to do this through code but I dont know how to proceed with this setting. The gevent .joinall() and .join() block untill all the greenlets are finished getting responses.
Any way I can procceed with gevent in this setting ?
Code I am using here is given on link https://bitbucket.org/denis/gevent/src/tip/examples/concurrent_download.py . Here the .joinall() in the last statement waits until all urls have complete giving responses, I want it to be non blocking so that I can process the responses in the callback function print_head() and return them incrementally.
#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""
urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
import urllib2
def print_head(url):
print ('Starting %s' % url)
data = urllib2.urlopen(url).read()
print ('%s: %s bytes: %r' % (url, len(data), data[:50]))
jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.joinall(jobs)

If you want to collect results from multiple greenlets, then modify print_head() to return the result and then use .get() method to collect them all.
Put this after joinall():
total_result = [x.get() for x in jobs]
Actually, joinall() is not even necessary in this case.
If print_head() looks like this:
def print_head(url):
print ('Starting %s' % url)
return urllib2.urlopen(url).read()
Then total_result will be a list of size 3 containing the responses from all the requests.

Related

Why would people use ThreadPoolExecutor instead of direct function call?

this code (snippet_1) is adapted from ThreadPoolExecutor Example in doc
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/']
# Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
print('%r page is %d bytes' % (url, len(data)))
print('after')
which works well, and gets
'http://www.foxnews.com/' page is 990869 bytes 'http://www.cnn.com/'
page is 990869 bytes 'http://www.bbc.co.uk/' page is 990869 bytes
'http://europe.wsj.com/' page is 990869 bytes after
this code is my own (snippet_2) to implement the same job with direct function call.
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/']
for url in URLS:
with urllib.request.urlopen(url, timeout=60) as conn:
print('%r page is %d bytes' % (url, len(data)))
print('after')
snippet_1 seems to be more common, but why?

When you are reading things from a network, your application will probably spend most of its time waiting on a reply.
Normally, the Global Interpreter Lock inside CPython (the Python implementation you are probably using) ensures that only one thread at a time is executing Python bytecode.
But when waiting for I/O (including network I/O) the GIL is released giving other threads opportunity to run. That means that multiple reads are effectively running in parallel instead of one after another, shortening overall execution time.
For a handful of URI's that won't make much of a difference. But the more URI's you use, the more noticable it gets.
So the ThreadPoolExecutor is mainly useful for running I/O operations in parallel. The ProcessPoolExecutor on the other hand is useful for running CPU intensive tasks in parallel. Since it uses multiple processes, the restriction of the GIL doesn't apply.

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?

Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)

Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.

In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.

From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

Filling the list with responses and processing them at the same time

I'm trying to download some products from a web page. This web page (according to robots.txt) allows me to send 2000req/minute. The problem is that sequential sending requests and then processing it is too much time-consuming.
I've realised that method which sends request can be moved into the pool which is much better according to time consume. It's probably because the processor don't need to wait to response and rather sends another request at the moment.
So I have a pool, the responses are being appended into the list RESPONSES.
Simple code:
from multiprocessing.pool import ThreadPool as Pool
import requests
RESPONSES = []
with open('products.txt') as f:
LINES = f.readlines()[:100]
def post_request(url):
html = requests.get(url).content
RESPONSES.append(html)
def parse_html_return_object(resp):
#some code here
pass
def insert_object_into_database():
pass
pool = Pool(100)
for line in LINES:
pool.apply_async(post_request,args=(line[:-1],))
pool.close()
pool.join()
The thing I want is to process those RESPONSES (HTMLS) so it would be popping responses from RESPONSE list and parsing it during the requesting.
So it could be like this (Time -->):
post_request(line1)->post_request(line2)->Response_line1->parse_html_return_object(response)->post_request...
Is there some simple way to do that?

Asynchronous JSON Requests in Python

I'm using an API for doing HTTP requests that return JSON. The calling of the api, however, depends on a start and an end page to be indicated, such as this:
def API_request(URL):
while(True):
try:
Response = requests.get(URL)
Data = Response.json()
return(Data['data'])
except Exception as APIError:
print(APIError)
continue
break
def build_orglist(start_page, end_page):
APILink = ("http://sc-api.com/?api_source=live&system=organizations&action="
"all_organizations&source=rsi&start_page={0}&end_page={1}&items_"
"per_page=500&sort_method=&sort_direction=ascending&expedite=1&f"
"ormat=json".format(start_page, end_page))
return(API_request(APILink))
The only way to know if you're not longer at an existing page is when the JSON will be null, like this.
If I wanted to do multiple build_orglist going over every single page asynchronously until I reach the end (Null JSON) how could I do so?

I went with a mix of #LukasGraf's answer of using sessions to unify all of my HTTP connections into a single session as well as made use of grequests for making the group of HTTP requests in parallel.

Python Flask + nginx fcgi - output large response?

I'm using Python Flask + nginx with FCGI.
On some requests, I have to output large responses. Usually those responses are fetched from a socket. Currently I'm doing the response like this:
response = []
while True:
recv = s.recv(1024)
if not recv: break
response.append(recv)
s.close()
response = ''.join(response)
return flask.make_response(response, 200, {
'Content-type': 'binary/octet-stream',
'Content-length': len(response),
'Content-transfer-encoding': 'binary',
})
The problem is I actually do not need the data. I also have a way to determine the exact response length to be fetched from the socket. So I need a good way to send the HTTP headers, then start outputing directly from the socket, instead of collecting it in memory and then supplying to nginx (probably by some sort of a stream).
I was unable to find the solution to this seemingly common issue. How would that be achieved?
Thank you!

if response in flask.make_response is an iterable, it will be iterated over to produce the response, and each string is written to the output stream on it's own.
what this means is that you can also return a generator which will yield the output when iterated over. if you know the content length, then you can (and should) pass it as header.
a simple example:
from flask import Flask
app = Flask(__name__)
import sys
import time
import flask
#app.route('/')
def generated_response_example():
n = 20
def response_generator():
for i in range(n):
print >>sys.stderr, i
yield "%03d\n" % i
time.sleep(.2)
print >>sys.stderr, "returning generator..."
gen = response_generator()
# the call to flask.make_response is not really needed as it happens imlicitly
# if you return a tuple.
return flask.make_response(gen ,"200 OK", {'Content-length': 4*n})
if __name__ == '__main__':
app.run()
if you run this and try it in a browser, you should see a nice incemental count...
(the content type is not set because it seems if i do that my browser waits until the whole content has been streamed before rendering the page. wget -qO - localhost:5000 doesn't have this problems.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Gevent incremental process through non blocking joinall() - python

Related

Why would people use ThreadPoolExecutor instead of direct function call?

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Filling the list with responses and processing them at the same time

Asynchronous JSON Requests in Python

Python Flask + nginx fcgi - output large response?

Categories

Resources