How to do parallel concurrent HTTP requests - python

I have a list of 100 ids, and I need to do a lookup for each one of them. The lookup takes approximate 3s to run. Here is the sequential code that would be needed to run it:
ids = [102225077, 102225085, 102225090, 102225097, 102225105, ...]
for id in ids:
run_updates(id)
I would like to run ten (10) of these concurrently at a time, using either gevent or multiprocessor. How would I do this? Here is what I tried for gevent but it's quite slow:
def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
ids = [102225077, 102225085, 102225090, 102225097, 102225105, ...]
if __name__ == '__main__':
for list_of_ids in list(chunks(ids, 10)):
jobs = [gevent.spawn(run_updates(id)) for id in list_of_ids]
gevent.joinall(jobs, timeout=200)
What would be the correct way to split up the ids list and run ten-at-a-time? I would even be open to using multiprocessor or gevent (not too familiar with either).
Doing it sequentially takes 364 seconds for 100 ids.
Using multiprocessor takes about 207 seconds on 100 ids, doing 5 at a time:
pool = Pool(processes=5)
pool.map(run_updates, list_of_apple_ids)
Using gevent takes somewhere in between the two:
jobs = [gevent.spawn(run_updates, apple_id) for apple_id in list_of_apple_ids]
Is there any way I can get better performance than the Pool.map? I have a pretty decent computer here with a fast internet connection, it should be able to do it much quicker...

Check out the grequests library. You can do something like:
import grequests
for list_of_ids in list(chunks(ids, 10)):
urls = [''.join(('http://www.example.com/id?=', id)) for id in list_of_ids]
requests = (grequests.get(url) for url in urls)
responses = grequests.map(requests)
for response in responses:
print response.content
I know this breaks your model somewhat because you have your request encapsulated in a run_updates method, but I think it may be worth exploring nonetheless.

from multiprocessing import Process
from random import Random.random
ids = [random() for _ in range(100)] # make some fake ids, whatever
def do_thing(arg):
print arg # Here's where you'd do lookup
while ids:
curs, ids = ids[:10], ids[10:]
procs = [Process(target=do_thing, args=(c,)) for c in curs]
for proc in procs:
proc.run()
This is roughly how I'd do it, I guess.

Related

Multithreading for similarity test in Python

Hello I've been working on a huge csv file which needs similarity tests done. There is 1.16million rows and to test similarity between each rows it takes approximately 7 hours. I want to use multiple threads to reduce the time it takes to do so. My function which does the similarity test is:
def similarity():
for i in range(0, 1000):
for j in range(i+1, 1000):
longestSentence = 0
commonWords = 0
row1 = dff['Product'].iloc[i]
row2 = dff['Product'].iloc[j]
wordsRow1 = row1.split()
wordsRow2 = row2.split()
# iki tumcedede esit olan sozcukler
common = list(set(wordsRow1).intersection(wordsRow2))
if len(wordsRow1) > len(wordsRow2):
longestSentence = len(wordsRow1)
commonWords = calculate(common, wordsRow1)
else:
longestSentence = len(wordsRow2)
commonWords = calculate(common, wordsRow2)
print(i, j, (commonWords / longestSentence) * 100)
def calculate(common, longestRow):#esit sozcuklerin bulunmasi
sum = 0
for word in common:
sum += longestRow.count(word)
return sum
I am using ThreadPoolExecutor to do multithreading and the code to do so is:
with ThreadPoolExecutor(max_workers=500) as executor:
for result in executor.map(similarity()):
print(result)
But even if I set max_workers to incredible amounts the code runs the same. How can I make it so the code runs faster? Is there any other way?
I tried to do it with threading library but it doesn't work because it just starts the threads to do the same job over and over again. So if I do 10 threads it just starts the function 10 times to do the same thing. Thanks in advance for any help.
ThreadPoolExecutor will not actually help a lot because ThreadPool is more for IO tasks. Let's say you would do 500 API calls this would work but since you are doing heavy CPU tasks it does not work. You should use ProcessPoolExecutor but also point attention that making max_workers numbers greater than the number of your cores will not do anything as well.
Also, your syntax is incorrect because you are running the same function inside your pool.
But I think you need to change your algorithm to make this work properly. There is definitely something wrong with your time compexity.
from concurrent.futures import ProcessPoolExecutor
from time import sleep
values = [3,4,5,6]
def cube(x):
print(f'Cube of {x}:{x*x*x}')
if __name__ == '__main__':
result =[]
with ProcessPoolExecutor(max_workers=5) as exe:
exe.submit(cube,2)
# Maps the method 'cube' with a iterable
result = exe.map(cube,values)
for r in result:
print(r)

Fastest way to build a list by iterating through generator objects

I am using Python gitlab to get a list of gitlab projects returned as generators in batches of 100. If a project has a tag of "snow" I want to add it to a list that will get converted to a json object. Here is the code I have that does this:
gl_prj_list = gl_conn.projects.list(as_list=False)
for p in gl_prj_list:
if "snow" in p.tag_list:
prj = {"id": p.id}
prj["name"] = p.path_with_namespace
gl_data.append(prj)
return json.dumps(gl_data), 200, {'Content-Type': 'text/plain'}
So ultimately I want a result that might look like this: (only 2 of the 100 projects had the snow tag)
[{"id": 7077, "name": "robr/snow-cli"}, {"id": 4995, "name": "test/prod-deploy-iaas-spring-starter"}]
This works fine and all but seems a bit slow. The response time is usually between 3.5-5 seconds. And since I will have to do this over 10-20 batches I'd like to improve on the response time.
Is there a better way to check for the "snow" value in the tag_list attribute of the generator and return the result?
Assuming that the bottleneck is not the API call, you can use multiprocessing.Pool() for this.
from multiprocessing import Pool
def f(p):
if "snow" in p.tag_list:
return {"id":p.id, "name":p.path_with_namespace}
return False
gl_prj_list = gl_conn.projects.list(as_list=False)
with Pool(10) as pool: #10 processes in parallel (change this with the number of cores you have available)
gl_data = pool.map(f, gl_prj_list)
gl_data = [i for i in gl_data if i] #get rid of the False items
json.dumps(gl_data), 200, {'Content-Type': 'text/plain'}
If the bottleneck is the API call and you want to call the API multiple times, then add the call inside f() and use the same trick. You will call the API 10 times in parallel instead of sequentially.

Replacing foreach with threading

My program basically has to get around 6000 items from the DB and calls an external API for each item. This almost takes 30 min to complete. I just thought of using threads here where i could create multi threads and split the process and reduce the time. So i came up with some thing like this. But I have two questions here. How do i store the response from the API that is processed by the function.
api = externalAPI()
for x in instruments:
response = api.getProcessedItems(x.symbol, days, return_perc);
if(response > float(return_perc)):
return_response.append([x.trading_symbol, x.name, response])
So in the above example the for loop runs for 6000 times(len(instruments) == 6000)
Now lets take i have splited the 6000 items to 2 * 3000 items and do something like this
class externalApi:
def handleThread(self, symbol, days, perc):
//I call the external API and process the items
// how do i store the processed data
def getProcessedItems(self,symbol, days, perc):
_thread.start_new_thread(self.handleThread, (symbol, days, perc))
_thread.start_new_thread(self.handleThread, (symbol, days, perc))
return self.thread_response
I am just starting out with thread. would be helpful if i know this is the right thing to do to reduce the time here.
P.S : Time is important here. I want to reduce it to 1 min from 30 min.
I suggest using worker-queue pattern like so...
you will have a queue of jobs, each worker will take a job and work on it, the result it will put at another queue, when all workers are done, the result queue will be read and process the results
def worker(pool, result_q):
while True:
job = pool.get()
result = handle(job) #handle job
result_q.put(result)
pool.task_done()
q = Queue.Queue()
res_q = Queue.Queue()
for i in range(num_worker_threads):
t = threading.Thread(target=worker, args=(q, res_q))
t.setDaemon(True)
t.start()
for job in jobs:
q.put(job)
q.join()
while not res_q.empty():
result = res_q.get()
# do smth with result
The worker-queue pattern suggested in shahaf's answer works fine, but Python provides even higher level abstractions, in concurret.futures. Namely a ThreadPoolExecutor, which will take care of the queueing and starting of threads for you:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=30)
responses = executor.map(process_item, (x.symbol for x in instruments))
The main complication with using the excutor.map() is that it can only map over one argument, meaning that there can be only one input to proces_item namely symbol).
However, if more arguments are needed, it is possible to define a new function, which will fixate all arguments but one. This can either be done manually or using the special Python partial call, found in functools:
from functools import partial
process_item = partial(api.handleThread, days=days, perc=return_perc)
Applying the ThreadPoolExecutor strategy to your current probelm would then have a solution similar to:
from concurrent.futures import ThreadPoolExecutor
from functools import partial
class Instrument:
def __init__(self, symbol, name):
self.symbol = symbol
self.name = name
instruments = [Instrument('SMB', 'Name'), Instrument('FNK', 'Funky')]
class externalApi:
def handleThread(self, symbol, days, perc):
# Call the external API and process the items
# Example, to give something back:
if symbol == 'FNK':
return days*3
else:
return days
def process_item_generator(api, days, perc):
return partial(api.handleThread, days=days, perc=perc)
days = 5
return_perc = 10
api = externalApi()
process_item = process_item_generator(api, days, return_perc)
executor = ThreadPoolExecutor(max_workers=30)
responses = executor.map(process_item, (x.symbol for x in instruments))
return_response = ([x.symbol, x.name, response]
for x, response in zip(instruments, responses)
if response > float(return_perc))
Here I have assumed that x.symbol is the same as x.trading_symbol and I have made a dummy implementation of your API call, to get some type of return value, but it should give a good idea of how to do this. Due to this, the code is a bit longer, but then again, it becomes a runnable example.

Python multiprocessing Pool.map not faster than calling the function once

I have a very large list of strings (originally from a text file) that I need to process using python. Eventually I am trying to go for a map-reduce style of parallel processing.
I have written a "mapper" function and fed it to multiprocessing.Pool.map(), but it takes the same amount of time as simply calling the mapper function with the full set of data. I must be doing something wrong.
I have tried multiple approaches, all with similar results.
def initial_map(lines):
results = []
for line in lines:
processed = # process line (O^(1) operation)
results.append(processed)
return results
def chunks(l, n):
for i in xrange(0, len(l), n):
yield l[i:i+n]
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
partitions = chunks(lines, len(lines)/8)
results = pool.map(initial_map, partitions, 1)
So the chunks function makes a list of sublists of the original set of lines to give to the pool.map(), then it should hand these 8 sublists to 8 different processes and run them through the mapper function. When I run this I can see all 8 of my cores peak at 100%. Yet it takes 22-24 seconds.
When I simple run this (single process/thread):
lines = list(open("../../log.txt", 'r'))
results = initial_map(results)
It takes about the same amount of time. ~24 seconds. I only see one process getting to 100% CPU.
I have also tried letting the pool split up the lines itself and have the mapper function only handle one line at a time, with similar results.
def initial_map(line):
processed = # process line (O^(1) operation)
return processed
if __name__ == "__main__":
lines = list(open("../../log.txt", 'r'))
pool = Pool(processes=8)
pool.map(initial_map, lines)
~22 seconds.
Why is this happening? Parallelizing this should result in faster results, shouldn't it?
If the amount of work done in one iteration is very small, you're spending a big proportion of the time just communicating with your subprocesses, which is expensive. Instead, try to pass bigger slices of your data to the processing function. Something like the following:
slices = (data[i:i+100] for i in range(0, len(data), 100)
def process_slice(data):
return [initial_data(x) for x in data]
pool.map(process_slice, slices)
# and then itertools.chain the output to flatten it
(don't have my comp. so can't give you a full working solution nor verify what I said)
Edit: or see the 3rd comment on your question by #ubomb.

Python urllib3 and proxy

I am trying to figure out how to use proxy and multithreading.
This code works:
requester = urllib3.PoolManager(maxsize = 10, headers = self.headers)
thread_pool = workerpool.WorkerPool()
thread_pool.map(grab_wrapper, [item['link'] for item in products])
thread_pool.shutdown()
thread_pool.wait()
Then in grab_wrapper
requested_page = requester.request('GET', url, assert_same_host = False, headers = self.headers)
Headers consist of: Accept, Accept-Charset, Accept-Encoding, Accept-Language and User-Agent
But this does not work in production, since it has to pass proxy, no authorization is required.
I tried different things (passing proxies to request, in headers, etc.). The only thing that works is this:
requester = urllib3.proxy_from_url(self._PROXY_URL, maxsize = 7, headers = self.headers)
thread_pool = workerpool.WorkerPool(size = 10)
thread_pool.map(grab_wrapper, [item['link'] for item in products])
thread_pool.shutdown()
thread_pool.wait()
Now, when I run the program, it will make 10 requests (10 threads) and then... stop. No error, no warning whatsoever. This is the only way I can bypass proxy, but it seems like its not possible to use proxy_from_url and WorkerPool together.
Any ideas how to combine those two into a working code? I would rather avoid rewriting it into scrappy, etc. due to time limitation
Regards
first of all I would suggest to avoid urllib like the plague and instead use requests, which has really easy support for proxies: http://docs.python-requests.org/en/latest/user/advanced/#proxies
Next to that, I haven't use it with multi-threading but with multi-processing and that worked really well, the only thing that you have to figure out is whether you have a dynamic queue or a fairly fixed list that you can spread over workers, an example of the latter which spreads a list of urls evenly over x processes:
# *** prepare multi processing
nr_processes = 4
chunksize = int(math.ceil(total_nr_urls / float(nr_processes)))
procs = []
# *** start up processes
for i in range(nr_processes):
start_row = chunksize * i
end_row = min(chunksize * (i + 1), total_nr_store)
p = multiprocessing.Process(
target=url_loop,
args=(start_row, end_row, str(i), job_id_input))
procs.append(p)
p.start()
# *** Wait for all worker processes to finish
for p in procs:
p.join()
every url_loop process writes away its own sets of data to tables in a database, so I don't have to worry about joining it together in python.
Edit: On sharing data between processes ->
For details see: http://docs.python.org/2/library/multiprocessing.html?highlight=multiprocessing#multiprocessing
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
But as you see, basically these special types (Value & Array) enable sharing of data between the processes.
If you instead look for a queue to do a roundrobin like process, you can use JoinableQueue.
Hope this helps!
It seems you are discarding the result of the call to thread_pool.map()
Try assigning it to a variable:
requester = urllib3.proxy_from_url(PROXY, maxsize=7)
thread_pool = workerpool.WorkerPool(size=10)
def grab_wrapper(url):
return requester.request('GET', url)
results = thread_pool.map(grab_wrapper, LINKS)
thread_pool.shutdown()
thread_pool.wait()
Note:
If you are using a python 3.2 or greater you can use concurrent.futures.ThreadPoolExecutor. It is quote similar to workerpool but is included in the standard library.

Categories

Resources