I have a function get_data(request) that requests some data to a server. Every time this function is called, it request data to a different server. All of them should return the same response.
I would like to get the response as soon as possible. I need to create a function that calls get_data several times, and returns the first response it gets.
EDIT:
I came up with an idea of using multithreading.Pipe(), but I have the feeling this is a very bad way to solve it, what do you think?:
def get_data(request, pipe):
data = # makes the request to a server, this can take a random amount of time
pipe.send(data)
def multiple_requests(request, num_servers):
my_pipe, his_pipe = multithreading.Pipe()
for i in range(num_servers):
Thread(target = get_data, args = (request,his_pipe)).start()
return my_pipe.recv()
multiple_requests("the_request_string", 6)
I think this is a bad way of doing it because you are passing the same pipe to all threads, and I don't really know but I guess that has to be very unsafe.
I think redis rq will be good for it. get_data is a job what you put in the queue six times. Jobs executes async, in the docs your also can read how to operate with results.
Related
What is the best and fastest pythonic way to program multithreading for a put request that is within a for loop? Now, as it is synchronous, it takes too long time to run the code. Therefore, we would like to include multithreading, to improve time.
Synchronous:
def econ_post_customers(self, file, data):
try:
for i in range(0, len(file['collection'])):
rp = requests.put(url=self.url, headers=self.headers, params=self.params, data=data)
except StopIteration:
pass
We attempted to make threading, but starting threads on iterations just seems unnecessary, and we have 1000's of iterations, and we might run up on much more, so that would become a big mess with threads. Maybe including pools would solve the problem, but this is where i am stuck.
Anyone who has an idea on how to solve this?
Parallel:
def econ_post_customers(self, file, data):
try:
for i in range(0, len(file['collection'])):
threading.Thread(target=lambda: request_put(url, self.headers, self.params, data)).start()
except StopIteration:
pass
def request_put(url, headers, params, single):
return requests.put(url=url, headers=headers, params=params, data=single)
Any help is highly appreciated. Thank you for your time!
If you want to use multithreading, then the following should work. However, I am a bit confused about a few things. You seem to be doing PUT requests in a loop but all with the same exact arguments. And I don't quite see how you can get a StopIteration exception in the code you posted. Also using a lambda expression as your target argument rather than just specifying the function name and then passing the arguments as a separate tuple or list (as is done below) is a bit unusual. Assuming that loop variable i in reality is being used to index one value that actually varies in the call to request_put, then function map could be a better choice than apply_async. It probably does not matter significantly for multithreading, but could make a performance difference for multiprocessing if you had a very large list of elements on which you were looping.
from multiprocessing.pool import ThreadPool
def econ_post_customers(self, file, data):
MAX_THREADS = 100 # some suitable value
n_tasks = len(file['collection'])
pool_size = min(MAX_THREADS, n_tasks)
pool = ThreadPool(pool_size)
for i in range(n_tasks):
pool.apply_async(request_put, args=(url, self.headers, self.params, data))
# wait for all tasks to complete:
pool.close()
pool.join()
def request_put(url, headers, params, single):
return requests.put(url=url, headers=headers, params=params, data=single)
Do try grequests module which works with gevent(requests is not designed for async).
If you see this you will get great results.
(If this is not working pls do say).
I am trying to execute a time-consuming back-end job, executed by a front-end call. This back-end job should execute a callback method when it is completed, which will release a semaphore. The front end shouldn't have to wait for the long process to finish in order to get a response from the call to kick off the job.
I'm trying to use the Pool class from the multiprocessing library to solve this issue, but I'm running into some issues. Namely that it seems like the only way to actually execute the method passed into apply_async is to call the .get() method in the ApplyResult object that is returned by the apply_async call.
In order to solve this, I thought to create a Process object with the target being apply_result.get. But this doesn't seem to work.
Is there a basic understanding that I'm missing here? What would you folks suggest to solve this issue.
Here is a snippet example of what I have right now:
p = Pool(1)
result = p.apply_async(long_process, args=(config, requester), callback=complete_long_process)
Process(target=result.get).start()
response = {'status': 'success', 'message': 'Job started for {0}'.format(requester)}
return jsonify(response)
Thanks for the help in advance!
I don't quite understand why you would need a Process object here. Look at this snippet:
#!/usr/bin/python
from multiprocessing import Pool
from multiprocessing.managers import BaseManager
from itertools import repeat
from time import sleep
def complete_long_process(foo):
print "completed", foo
def long_process(a,b):
print a,b
sleep(10)
p = Pool(1)
result = p.apply_async(long_process, args=(1, 42),
callback=complete_long_process)
print "submitted"
sleep(20)
If I understand what you are trying to achieve, this does exactly that. As soon as you call apply_async, it launches long_process function and execution of the main program continues. As soon as it completes, complete_long_process is called. There is no need to use get method to execute long_process, and the code does not block and wait anything.
If your long_process does not appear to run, I assume your problem is somewhere within long_process.
Hannu
So I was following a guide at http://tavendo.com/blog/post/going-asynchronous-from-flask-to-twisted-klein/ to create an asynchronous web service.
in my code, I had a function that will send out the request like
def query(text):
resp = yield treq.get("http://api.Iwanttoquery")
content = yield treq.content(resp)
returnValue(content)
#inlineCallbacks
def caller():
output1 = yield query("one")
output2 = yield query("two")
Since each query to the api usually take about 3 seconds, with my current code the result comes back after 6 seconds. I wonder is there a way to send out two queries at the same time so after 3 seconds I can get the content of both output1 and output2? Thanks.
What you need to do is use a DeferredList instead of inlineCallbacks. Basically you provide a list of deferreds and after each one completes, a final callback with the results of all the deferreds is executed.
import treq
from twisted.internet import defer, reactor
def query(text):
get = treq.get('http://google.com')
get.addCallback(treq.content)
return get
output1 = query('one')
output2 = query('two')
final = defer.DeferredList([output1, output2]) # wait for both queries to finish
final.addCallback(print) # print the results from all the queries in the list
reactor.run()
Each query() function will execute requests concurrently then return a Deferred. This happens almost immediately, so basically output1 and output2 are executing at the same time. Then you append the deferreds (ie. output1 and output2) inside a list and pass it in DeferredList, which itself returns a Deferred. Finally, you add a callback to the DeferredList to do something with the results (in this case I just print them). This is all done without the use of threads, which is the best part in my opinion! Hope this makes sense and please comment if it doesn't.
PS
If you need further help with Klein, I'm working on revamping the documentation here https://github.com/notoriousno/klein-basics (hopefully I'll make a blog post one of these days). Please take a look at some of the docs (the files with .rst). My shameless plug is now concluded :D
Here is the situation.
I need to send a ajax request to a Django view function every second and this view function will send some asynchronous requests to a third party API to get some data by grequests. These data will render to HTML after this view function returned.
here show the code
desc_ip_list=['58.222.24.253', '58.222.17.38']
reqs = [grequests.get('%s%s' % ('http://int.dpool.sina.com.cn/iplookup/iplookup.php?format=json&ip=', desc_ip))
for desc_ip in desc_ip_list]
response = grequests.map(reqs)
When I runserver django and send this ajax request, the amount of threads of python is always increasing until error "can't start new thread" happened.
enter image description here
error: can't start new thread
<Greenlet at 0x110473b90: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x1103fd1d0>>(stream=False)>
failed with error
How can I control the amount of threads? I have no idea of it because I'm a beginner pythoner.
Thanks a lot.
Maybe your desc_ip_list is too long, and thus for let's say, a hundred IPs, you'd be spawning 100 requests, made by 100 threads!
See here in the grequests code.
What you should do:
You should probably specify the size param in the map() call to a reasonable number,
probably (2*n+1) where n is the number of cores in your CPU, at max. It will make sure that you don't process all the IPs in the desc_ip_list at the same time, thereby spawning as many threads.
EDIT: More info, from a gevent doc page:
The Pool class, which is a subclass of Group, provides a way to limit concurrency: its spawn method blocks if the number of greenlets in the pool has already reached the limit, until there is a free slot.
Why am I mentioning this?
Let's trace it back from grequests:
In map(), we have from lineno 113-114:
pool = Pool(size) if size else None
jobs = [send(r, pool, stream=stream) for r in requests]
And in lineno 85 in send(), we have:
return gevent.spawn(r.send, stream=stream)
This is the return statement that will be executed from send(),
because its param pool will be None, because in map(), you didn't specify size. Now go back a few lines above and read what I quoted from the gevent docs.
I want to move to ndb, and have been wondering whether to use async urlfetch tasklets. I'm not sure I fully understand how it works, as the documentation is somewhat poor, but it seems quite promising for this particular use case.
Currently I use async urlfetch like this. It is far from actual threading or parallel code, but it has still improved performance quite significantly, compared to just sequential requests.
def http_get(url):
rpc = urlfetch.create_rpc(deadline=3)
urlfetch.make_fetch_call(rpc,url)
return rpc
rpcs = []
urls = [...] # hundreds of urls
while rpcs < 10:
rpcs.append(http_get(urls.pop()))
while rpcs:
rpc = rpcs.pop(0)
result = rpc.get_result()
if result.status_code == 200:
# append another item to rpcs
# process result
else:
# re-append same item to rpcs
Please note that this code is simplified. The actual code catches exceptions, has some additional checks, and only tries to re-append the same item a few times. It makes no difference for this case.
I should add that processing the result does not involve any db operations.
Actually yes, it's a good idea to use async urlfetch here. How it's working (rough explanation):
- your code reach the point of async call. It triggers long background task and doesn't wait for it's result, but continue to execute.
- task works in background, and when result is ready — it stores result somwhere, until you ask for it.
Simple example:
def get_fetch_all():
urls = ["http://www.example.com/", "http://mirror.example.com/"]
ctx = ndb.get_context()
futures = [ctx.urlfetch(url) for url in urls]
results = ndb.Future.wait_all(futures)
# do something with results here
If you want to store result in ndb and make it more optimal — it's good idea to write custom tasklet for this.
#ndb.tasklet
def get_data_and_store(url):
ctx = ndb.get_context()
# until we don't receive result here, this function is "paused", allowing other
# parallel tasks to work. when data will be fetched, control will be returned
result = yield ctx.urlfetch("http://www.google.com/")
if result.status_code == 200:
store = Storage(data=result.content)
# async job to put data
yield store.put_async()
raise ndb.Return(True)
else:
raise ndb.Return(False)
And you can use this tasklet combined with loop in first sample. You should get list of ther/false values, indicating success of fetch.
I'm not sure, how much this will boost overall productivity (it depends on google side), but it should.