Are my Python threads tripping over each other with requests?

Are my Python threads tripping over each other with requests? - python

I have a program that needs to access data from an API. I need to get a list from it and then for every item in that list, request more data from the API. When I get a list, I get them in batches of 50 when there are around 600 items on this list. I thought that I could do this using requests and threading. Here is what it looks like:
I have essentially a helper method to call the API:
call_api_method(method, token, params={}):
params_to_send = params.copy()
params_to_send['auth'] = token
response = requests.get('{0}/rest{1}'.format(DOMAIN, method), params = params_to_send)
return response.json()
I then have a recursive threading function to get all info. I thought I could use threads to go ahead and request the next batch of info while making threads to request the info per item:
def import_item_info(auth_token, start = None):
if start is None:
start = 0
threads = []
result = call_api_method('get_list', auth_token, {'start': start})
#the call returns next which is the index of the next batch
if result['next']:
thread = threading.Thread(target=import_item_info, args=(auth_token, result['next'])
thread.start()
threads.append(thread)
for list_item in result['result']:
thread = threading.Thread(target=get_item_info, args=(auth_token, item['ID'])
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
This is get_item_info, which makes a call to the api using the id of the item to get the specific details about the item:
def get_item_info(auth_token, item_id):
item = call_api_method('get_item', auth_token, {'id': item_id})
print(item['key'])
I've abstracted a lot of the info, but essentially what is going on is that sometimes the requests.get returns something slightly garbled and I get a JSONDecodeError: Expecting value: line 1 column 1 (char 0).
I highly suspect that this is a threading problem because the first request goes through just fine. I can't seem to find what I'm doing wrong.

Okay. Sorry... I thought I checked but apparently I hit the query limit and that's why it is starting to do weird things.

Related

Celery wait for a task from another request

Let's say there is a long task that takes 1 minute. When a user makes a request /get-info and waiting for the response it should return a result. I'm using delay(), wait() and everything works. Now I want if another 5 users make same request /get-info I want them 'connect' to the same task and get result once the task is finished. I'm trying to save task id in redis. But so far I'm having 2 problems.
If I use AsyncResult() and wait() the second request hangs.
If I use AsyncResult() and state, the first request hangs. How can I implement that?
#main.route('/get-info', methods=['POST'])
def get_info():
if redis.exists('getInfoTaskId'):
taks_id = redis.get('getInfoTaskId')
task = add_together.AsyncResult(taks_id)
result = task.wait()
# result = task.state - if uncomment and comment the line above the first req hangs
else:
task = add_together.delay(23, 42)
redis.set('getInfoTaskId', task.id, ex=600)
result = task.wait()
redis.delete('getInfoTaskId')
return f"task result is {result}"

Python muiltithreading is mixing the data of different request in django

I am using python muiltithreading for achieving a task which is like 2 to 3 mins long ,i have made one api endpoint in django project.
Here is my code--
from threading import Thread
def myendpoint(request):
print("hello")
lis = [ *args ]
obj = Model.objects.get(name =" jax")
T1 = MyThreadClass(lis, obj)
T1.start()
T1.deamon = True
return HttpResponse("successful", status=200)
Class MyThreadClass(Thread):
def __init__(self,lis,obj):
Thread.__init__(self)
self.lis = lis
self.obj = obj
def run(self):
for i in lis:
res =Func1(i)
self.obj.someattribute = res
self.obj.save()
def Func1(i):
'''Some big codes'''
context =func2(*args)
return context
def func2(*args):
"' some codes "'
return res
By this muiltithreading i can achieve the quick response from the django server on calling the endpoint function as the big task is thrown in another tread and execution of the endpoint thread is terminated on its return statement without keeping track of the spawned thread.
This part works for me correctly if i hit the url once , but if i hit the url 2 times as soon as 1st execution starts then on 2nd request i can see my request on console. But i cant get any response from it.
And if i hit the same url from 2 different client at the same time , both the individual datas are getting mixed up and i see few records of one client's request on other client data.
I am testing it to my local django runserver.
So guys please help , and i know about celery so dont recommend celery. Just tell me why this thing is happening or can it be fixed . As my task is not that long to use celery. I want to achieve it by muiltithreading.

Combining semaphore and time limiting in python-trio with asks http request

I'm trying to use Python in an async manner in order to speed up my requests to a server. The server has a slow response time (often several seconds, but also sometimes faster than a second), but works well in parallel. I have no access to this server and can't change anything about it. So, I have a big list of URLs (in the code below, pages) which I know beforehand, and want to speed up their loading by making NO_TASKS=5 requests at a time. On the other hand, I don't want to overload the server, so I want a minimum pause between every request of 1 second (i. e. a limit of 1 request per second).
So far I have successfully implemented the semaphore part (five requests at a time) using a Trio queue.
import asks
import time
import trio
NO_TASKS = 5
asks.init('trio')
asks_session = asks.Session()
queue = trio.Queue(NO_TASKS)
next_request_at = 0
results = []
pages = [
'https://www.yahoo.com/',
'http://www.cnn.com',
'http://www.python.org',
'http://www.jython.org',
'http://www.pypy.org',
'http://www.perl.org',
'http://www.cisco.com',
'http://www.facebook.com',
'http://www.twitter.com',
'http://www.macrumors.com/',
'http://arstechnica.com/',
'http://www.reuters.com/',
'http://abcnews.go.com/',
'http://www.cnbc.com/',
]
async def async_load_page(url):
global next_request_at
sleep = next_request_at
next_request_at = max(trio.current_time() + 1, next_request_at)
await trio.sleep_until(sleep)
next_request_at = max(trio.current_time() + 1, next_request_at)
print('start loading page {} at {} seconds'.format(url, trio.current_time()))
req = await asks_session.get(url)
results.append(req.text)
async def producer(url):
await queue.put(url)
async def consumer():
while True:
if queue.empty():
print('queue empty')
return
url = await queue.get()
await async_load_page(url)
async def main():
async with trio.open_nursery() as nursery:
for page in pages:
nursery.start_soon(producer, page)
await trio.sleep(0.2)
for _ in range(NO_TASKS):
nursery.start_soon(consumer)
start = time.time()
trio.run(main)
However, I'm missing the implementation of the limiting part, i. e. the implementation of max. 1 request per second. You can see above my attempt to do so (first five lines of async_load_page), but as you can see when you execute the code, this is not working:
start loading page http://www.reuters.com/ at 58097.12261669573 seconds
start loading page http://www.python.org at 58098.12367392373 seconds
start loading page http://www.pypy.org at 58098.12380622773 seconds
start loading page http://www.macrumors.com/ at 58098.12389389973 seconds
start loading page http://www.cisco.com at 58098.12397854373 seconds
start loading page http://arstechnica.com/ at 58098.12405119873 seconds
start loading page http://www.facebook.com at 58099.12458010273 seconds
start loading page http://www.twitter.com at 58099.37738939873 seconds
start loading page http://www.perl.org at 58100.37830828273 seconds
start loading page http://www.cnbc.com/ at 58100.91712723473 seconds
start loading page http://abcnews.go.com/ at 58101.91770178373 seconds
start loading page http://www.jython.org at 58102.91875295573 seconds
start loading page https://www.yahoo.com/ at 58103.91993155273 seconds
start loading page http://www.cnn.com at 58104.48031027673 seconds
queue empty
queue empty
queue empty
queue empty
queue empty
I've spent some time searching for answers but couldn't find any.

One of the ways to achieve your goal would be using a mutex acquired by a worker before sending a request and released in a separate task after some interval:
async def fetch_urls(urls: Iterator, responses, n_workers, throttle):
# Using binary `trio.Semaphore` to be able
# to release it from a separate task.
mutex = trio.Semaphore(1)
async def tick():
await trio.sleep(throttle)
mutex.release()
async def worker():
for url in urls:
await mutex.acquire()
nursery.start_soon(tick)
response = await asks.get(url)
responses.append(response)
async with trio.open_nursery() as nursery:
for _ in range(n_workers):
nursery.start_soon(worker)
If a worker gets response sooner than after throttle seconds, it will block on await mutex.acquire(). Otherwise the mutex will be released by the tick and another worker will be able to acquire it.
This is similar to how leaky bucket algorithm works:
Workers waiting for the mutex are like water in a bucket.
Each tick is like a bucket leaking at a constant rate.
If you add a bit of logging just before sending a request you should get an output similar to this:
0.00169 started
0.001821 n_workers: 5
0.001833 throttle: 1
0.002152 fetching https://httpbin.org/delay/4
1.012 fetching https://httpbin.org/delay/2
2.014 fetching https://httpbin.org/delay/2
3.017 fetching https://httpbin.org/delay/3
4.02 fetching https://httpbin.org/delay/0
5.022 fetching https://httpbin.org/delay/2
6.024 fetching https://httpbin.org/delay/2
7.026 fetching https://httpbin.org/delay/3
8.029 fetching https://httpbin.org/delay/0
9.031 fetching https://httpbin.org/delay/0
10.61 finished

Using trio.current_time() for this is much too complicated IMHO.
The easiest way to do rate limiting is a rate limiter, i.e. a separate task that basically does this:
async def ratelimit(queue,tick, task_status=trio.TASK_STATUS_IGNORED):
with trio.open_cancel_scope() as scope:
task_status.started(scope)
while True:
await queue.put()
await trio.sleep(tick)
Example use:
async with trio.open_nursery() as nursery:
q = trio.Queue(0) # can use >0 for burst modes
limiter = await nursery.start(ratelimit, q, 1)
while whatever:
await q.get(None) # will return at most once per second
do_whatever()
limiter.cancel()
in other words, you start that task with
q = trio.Queue(0)
limiter = await nursery.start(ratelimit, q, 1)
and then you can be sure that at most one call of
await q.put(None)
per second will return, as the zero-length queue acts as a rendezvous point. When you're done, call
limiter.cancel()
to stop the rate limiting task, otherwise your nursery won't exit.
If your use case includes starting sub-tasks which you need to finish before the limiter gets cancelled, the easiest way to do that is to rin them in another nursery, i.e. instead of
while whatever:
await q.put(None) # will return at most once per second
do_whatever()
limiter.cancel()
you'd use something like
async with trio.open_nursery() as inner_nursery:
await start_tasks(inner_nursery, q)
limiter.cancel()
which would wait for the tasks to finish before touching the limiter.
NB: You can easily adapt this for "burst" mode, i.e. allow a certain number of requests before the rate limiting kicks in, by simply increasing the queue's length.

Motivation and origin of this solution
Some months have passed since I asked this question. Python has improved since then, so has trio (and my knowledge of them). So I thought it was time for a little update using Python 3.6 with type annotations and trio-0.10 memory channels.
I developed my own improvement of the original version, but after reading #Roman Novatorov's great solution, adapted it again and this is the result. Kudos to him for the main structure of the function (and the idea to use httpbin.org for illustration purposes). I chose to use memory channels instead of a mutex to be able to take out any token re-release logic out of the worker.
Explanation of solution
I can rephrase the original problem like this:
I want to have a number of workers that start the request independently of each other (thus, they will be realized as asynchronous functions).
There is zero or one token released at any point; any worker starting a request to the server consumes a token, and the next token will not be issued until a minimum time has passed. In my solution, I use trio's memory channels to coordinate between the token issuer and the token consumers (workers)
In case your not familiar with memory channels and their syntax, you can read about them in the trio doc. I think the logic of async with memory_channel and memory_channel.clone() can be confusing in the first moment.
from typing import List, Iterator
import asks
import trio
asks.init('trio')
links: List[str] = [
'https://httpbin.org/delay/7',
'https://httpbin.org/delay/6',
'https://httpbin.org/delay/4'
] * 3
async def fetch_urls(urls: List[str], number_workers: int, throttle_rate: float):
async def token_issuer(token_sender: trio.abc.SendChannel, number_tokens: int):
async with token_sender:
for _ in range(number_tokens):
await token_sender.send(None)
await trio.sleep(1 / throttle_rate)
async def worker(url_iterator: Iterator, token_receiver: trio.abc.ReceiveChannel):
async with token_receiver:
for url in url_iterator:
await token_receiver.receive()
print(f'[{round(trio.current_time(), 2)}] Start loading link: {url}')
response = await asks.get(url)
# print(f'[{round(trio.current_time(), 2)}] Loaded link: {url}')
responses.append(response)
responses = []
url_iterator = iter(urls)
token_send_channel, token_receive_channel = trio.open_memory_channel(0)
async with trio.open_nursery() as nursery:
async with token_receive_channel:
nursery.start_soon(token_issuer, token_send_channel.clone(), len(urls))
for _ in range(number_workers):
nursery.start_soon(worker, url_iterator, token_receive_channel.clone())
return responses
responses = trio.run(fetch_urls, links, 5, 1.)
Example of logging output:
As you see, the minimum time between all page requests is one second:
[177878.99] Start loading link: https://httpbin.org/delay/7
[177879.99] Start loading link: https://httpbin.org/delay/6
[177880.99] Start loading link: https://httpbin.org/delay/4
[177881.99] Start loading link: https://httpbin.org/delay/7
[177882.99] Start loading link: https://httpbin.org/delay/6
[177886.20] Start loading link: https://httpbin.org/delay/4
[177887.20] Start loading link: https://httpbin.org/delay/7
[177888.20] Start loading link: https://httpbin.org/delay/6
[177889.44] Start loading link: https://httpbin.org/delay/4
Comments on the solution
As not untypical for asynchronous code, this solution does not maintain the original order of the requested urls. One way to solve this is to associate an id to the original url, e. g. with a tuple structure, put the responses into a response dictionary and later grab the responses one after the other to put them into a response list (saves sorting and has linear complexity).

You need to increment next_request_at by 1 every time you come into async_load_page. Try using next_request_at = max(trio.current_time() + 1, next_request_at + 1). Also I think you only need to set it once. You may get into trouble if you're setting it around awaits, where you're giving the opportunity for other tasks to change it before examining it again.

How to open Post Urls in multithreads in python

I am using python 2.7 on Windows machine. I have an array of urls accompanied by data and headers, so POST method is required.
In simple execution it works well:
rescodeinvalid =[]
success = []
for i in range(0,len(HostArray)):
data = urllib.urlencode(post_data)
req = urllib2.Request(HostArray[i], data)
response = urllib2.urlopen(req)
rescode=response.getcode()
if responsecode == 400:
rescodeinvalid.append(HostArray[i])
if responsecode == 200:
success.append(HostArray[i])
My question is if HostArray length is very large, then it is taking much time in loop.
So, how to check each url of HostArray in a multithread. If response code of each url is 200, then I am doing different operation. I have arrays to store 200 and 400 responses.
So, how to do this in multithread in python

If you want to do each one in a separate thread you could do something like:
rescodeinvalid =[]
success = []
def post_and_handle(url,post_data)
data = urllib.urlencode(post_data)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
rescode=response.getcode()
if responsecode == 400:
rescodeinvalid.append(url) # Append is thread safe
elif responsecode == 200:
success.append(url) # Append is thread safe
workers = []
for i in range(0,len(HostArray)):
t = threading.Thread(target=post_and_handle,args=(HostArray[i],post_data))
t.start()
workers.append(t)
# Wait for all of the requests to complete
for t in workers:
t.join()
I'd also suggest using requests: http://docs.python-requests.org/en/latest/
as well as a thread pool:
Threading pool similar to the multiprocessing Pool?
Thread pool usage:
from multiprocessing.pool import ThreadPool
# Done here because this must be done in the main thread
pool = ThreadPool(processes=50) # use a max of 50 threads
# do this instead of Thread(target=func,args=args,kwargs=kwargs))
pool.apply_async(func,args,kwargs)
pool.close() # I think
pool.join()

scrapy uses twisted library to call multiple urls in parallel without the overhead of opening a new thread per request, it also manage internal queue to accumulate and even prioritize them as a bonus you can also restrict number of parallel requests by settings maximum concurrent requests, you can either launch a scrapy spider as an external process or from your code, just set spider start_urls = HostArray

Your case (basically processing a list into another list) looks like an ideal candidate for concurrent.futures (see for example this answer) or you may go all the way to Executor.map. And of course use ThreadPoolExecutor to limit the number of concurrently running threads to something reasonable.

GAE - How can I combine the results of several asynchronous url fetches?

I have a Google AppEngine (in Python) application where I need to perform 4 to 5 url fetches, and then combine the data before I print it out to the response.
I can do this without any problems using a synchronous workflow, but since the urls that I am fetching are not related or dependent on each other, performing this asynchronously would be the most ideal (and quickest).
I have read and re-read the documentation here, but I just can't figure out how to get read the contents for each url. I've also searched the web for a small example (which is really what I am in need of). I have seen this SO question, but again, here they don't mention anything about reading the contents of these individual asynchronous url fetches.
Does anyone have any simple examples of how to perform 4 or 5 asynchronous url fetches with AppEngine? And then combine the results before I print it to the response?
Here is what I have so far:
rpcs = []
for album in result_object['data']:
total_facebook_photo_count = total_facebook_photo_count + album['count']
facebook_albumid_array.append(album['id'])
#Get the photos in the photo album
facebook_photos_url = 'https://graph.facebook.com/%s/photos?access_token=%s&limit=1000' % (album['id'], access_token)
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, facebook_photos_url)
rpcs.append(rpc)
for rpc in rpcs:
result = rpc.get_result()
self.response.out.write(result.content)
However, it still looks like the line: result = rpc.get_result() is forcing it to wait for the first request to finish, then the second, then the third, and so forth. Is there a way to simply put the results in a variables as they are received?
Thanks!

In the example, text = result.content is where you get the content (body).
To do url fetches in parallell, you could set them up, add to a list and check results afterwards. Expanding on the example already mentioned, it could look something like:
from google.appengine.api import urlfetch
futures = []
for url in urls:
rpc = urlfetch.create_rpc()
urlfetch.make_fetch_call(rpc, url)
futures.append(rpc)
contents = []
for rpc in futures:
try:
result = rpc.get_result()
if result.status_code == 200:
contents.append(result.content)
# ...
except urlfetch.DownloadError:
# Request timed out or failed.
# ...
concatenated_result = '\n'.join(contents)
In this example, we assemble the body of all the requests that returned status code 200, and concatenate with linebreak between them.
Or with ndb, my personal preference for anything async on GAE, something like:
#ndb.tasklet
def get_urls(urls):
ctx = ndb.get_context()
result = yield map(ctx.urlfetch, urls)
contents = [r.content for r in result if r.status_code==200]
raise ndb.Return('\n'.join(contents))

I use this code (implmented before I learned about ndb tasklets):
while rpcs:
rpc = UserRPC.wait_any(rpcs)
result = rpc.get_result()
# process result here
rpcs.remove(rpc)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Are my Python threads tripping over each other with requests? - python

Okay. Sorry... I thought I checked but apparently I hit the query limit and that's why it is starting to do weird things.

Related

Celery wait for a task from another request

Python muiltithreading is mixing the data of different request in django

Combining semaphore and time limiting in python-trio with asks http request

How to open Post Urls in multithreads in python

GAE - How can I combine the results of several asynchronous url fetches?

Categories

Resources