I am trying to figure out how to use proxy and multithreading.
This code works:
requester = urllib3.PoolManager(maxsize = 10, headers = self.headers)
thread_pool = workerpool.WorkerPool()
thread_pool.map(grab_wrapper, [item['link'] for item in products])
thread_pool.shutdown()
thread_pool.wait()
Then in grab_wrapper
requested_page = requester.request('GET', url, assert_same_host = False, headers = self.headers)
Headers consist of: Accept, Accept-Charset, Accept-Encoding, Accept-Language and User-Agent
But this does not work in production, since it has to pass proxy, no authorization is required.
I tried different things (passing proxies to request, in headers, etc.). The only thing that works is this:
requester = urllib3.proxy_from_url(self._PROXY_URL, maxsize = 7, headers = self.headers)
thread_pool = workerpool.WorkerPool(size = 10)
thread_pool.map(grab_wrapper, [item['link'] for item in products])
thread_pool.shutdown()
thread_pool.wait()
Now, when I run the program, it will make 10 requests (10 threads) and then... stop. No error, no warning whatsoever. This is the only way I can bypass proxy, but it seems like its not possible to use proxy_from_url and WorkerPool together.
Any ideas how to combine those two into a working code? I would rather avoid rewriting it into scrappy, etc. due to time limitation
Regards
first of all I would suggest to avoid urllib like the plague and instead use requests, which has really easy support for proxies: http://docs.python-requests.org/en/latest/user/advanced/#proxies
Next to that, I haven't use it with multi-threading but with multi-processing and that worked really well, the only thing that you have to figure out is whether you have a dynamic queue or a fairly fixed list that you can spread over workers, an example of the latter which spreads a list of urls evenly over x processes:
# *** prepare multi processing
nr_processes = 4
chunksize = int(math.ceil(total_nr_urls / float(nr_processes)))
procs = []
# *** start up processes
for i in range(nr_processes):
start_row = chunksize * i
end_row = min(chunksize * (i + 1), total_nr_store)
p = multiprocessing.Process(
target=url_loop,
args=(start_row, end_row, str(i), job_id_input))
procs.append(p)
p.start()
# *** Wait for all worker processes to finish
for p in procs:
p.join()
every url_loop process writes away its own sets of data to tables in a database, so I don't have to worry about joining it together in python.
Edit: On sharing data between processes ->
For details see: http://docs.python.org/2/library/multiprocessing.html?highlight=multiprocessing#multiprocessing
from multiprocessing import Process, Value, Array
def f(n, a):
n.value = 3.1415927
for i in range(len(a)):
a[i] = -a[i]
if __name__ == '__main__':
num = Value('d', 0.0)
arr = Array('i', range(10))
p = Process(target=f, args=(num, arr))
p.start()
p.join()
print num.value
print arr[:]
But as you see, basically these special types (Value & Array) enable sharing of data between the processes.
If you instead look for a queue to do a roundrobin like process, you can use JoinableQueue.
Hope this helps!
It seems you are discarding the result of the call to thread_pool.map()
Try assigning it to a variable:
requester = urllib3.proxy_from_url(PROXY, maxsize=7)
thread_pool = workerpool.WorkerPool(size=10)
def grab_wrapper(url):
return requester.request('GET', url)
results = thread_pool.map(grab_wrapper, LINKS)
thread_pool.shutdown()
thread_pool.wait()
Note:
If you are using a python 3.2 or greater you can use concurrent.futures.ThreadPoolExecutor. It is quote similar to workerpool but is included in the standard library.
Related
I'm performing analyses of time-series of simulations. Basically, it's doing the same tasks for every time steps. As there is a very high number of time steps, and as the analyze of each of them is independant, I wanted to create a function that can multiprocess another function. The latter will have arguments, and return a result.
Using a shared dictionnary and the lib concurrent.futures, I managed to write this :
import concurrent.futures as Cfut
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# function : function that is running in parallel
# param_list : list of items
# group_size : size of the groups
# Nworkers : number of group/items running in the same time
# **param_fixed : passing parameters
manager = mlp.Manager()
dic = manager.dict()
executor = Cfut.ProcessPoolExecutor(Nworkers)
futures = [executor.submit(function, param, dic, *args)
for param in grouper(param_list, group_size)]
Cfut.wait(futures)
return [dic[i] for i in sorted(dic.keys())]
Typically, I can use it like this :
def read_file(files, dictionnary):
for file in files:
i = int(file[4:9])
#print(str(i))
if 'bz2' in file:
os.system('bunzip2 ' + file)
file = file[:-4]
dictionnary[i] = np.loadtxt(file)
os.system('bzip2 ' + file)
Map = np.array(multiprocess_loop_grouped(read_file, list_alti, Group_size, N_thread))
or like this :
def autocorr(x):
result = np.correlate(x, x, mode='full')
return result[result.size//2:]
def find_lambda_finger(indexes, dic, Deviation):
for i in indexes :
#print(str(i))
# Beach = Deviation[i,:] - np.mean(Deviation[i,:])
dic[i] = Anls.find_first_max(autocorr(Deviation[i,:]), valmax = True)
args = [Deviation]
Temp = Rescal.multiprocess_loop_grouped(find_lambda_finger, range(Nalti), Group_size, N_thread, *args)
Basically, it is working. But it is not working well. Sometimes it crashes. Sometimes it actually launches a number of python processes equal to Nworkers, and sometimes there is only 2 or 3 of them running at a time while I specified Nworkers = 15.
For example, a classic error I obtain is described in the following topic I raised : Calling matplotlib AFTER multiprocessing sometimes results in error : main thread not in main loop
What is the more Pythonic way to achieve what I want ? How can I improve the control this function ? How can I control more the number of running python process ?
One of the basic concepts for Python multi-processing is using queues. It works quite well when you have an input list that can be iterated and which does not need to be altered by the sub-processes. It also gives you a good control over all the processes, because you spawn the number you want, you can run them idle or stop them.
It is also a lot easier to debug. Sharing data explicitly is usually an approach that is much more difficult to setup correctly.
Queues can hold anything as they are iterables by definition. So you can fill them with filepath strings for reading files, non-iterable numbers for doing calculations or even images for drawing.
In your case a layout could look like that:
import multiprocessing as mp
import numpy as np
import itertools as it
def worker1(in_queue, out_queue):
#holds when nothing is available, stops when 'STOP' is seen
for a in iter(in_queue.get, 'STOP'):
#do something
out_queue.put({a: result}) #return your result linked to the input
def worker2(in_queue, out_queue):
for a in iter(in_queue.get, 'STOP'):
#do something differently
out_queue.put({a: result}) //return your result linked to the input
def multiprocess_loop_grouped(function, param_list, group_size, Nworkers, *args):
# your final result
result = {}
in_queue = mp.Queue()
out_queue = mp.Queue()
# fill your input
for a in param_list:
in_queue.put(a)
# stop command at end of input
for n in range(Nworkers):
in_queue.put('STOP')
# setup your worker process doing task as specified
process = [mp.Process(target=function,
args=(in_queue, out_queue), daemon=True) for x in range(Nworkers)]
# run processes
for p in process:
p.start()
# wait for processes to finish
for p in process:
p.join()
# collect your results from the calculations
for a in param_list:
result.update(out_queue.get())
return result
temp = multiprocess_loop_grouped(worker1, param_list, group_size, Nworkers, *args)
map = multiprocess_loop_grouped(worker2, param_list, group_size, Nworkers, *args)
It can be made a bit more dynamic when you are afraid that your queues will run out of memory. Than you need to fill and empty the queues while the processes are running. See this example here.
Final words: it is not more Pythonic as you requested. But it is easier to understand for a newbie ;-)
My program basically has to get around 6000 items from the DB and calls an external API for each item. This almost takes 30 min to complete. I just thought of using threads here where i could create multi threads and split the process and reduce the time. So i came up with some thing like this. But I have two questions here. How do i store the response from the API that is processed by the function.
api = externalAPI()
for x in instruments:
response = api.getProcessedItems(x.symbol, days, return_perc);
if(response > float(return_perc)):
return_response.append([x.trading_symbol, x.name, response])
So in the above example the for loop runs for 6000 times(len(instruments) == 6000)
Now lets take i have splited the 6000 items to 2 * 3000 items and do something like this
class externalApi:
def handleThread(self, symbol, days, perc):
//I call the external API and process the items
// how do i store the processed data
def getProcessedItems(self,symbol, days, perc):
_thread.start_new_thread(self.handleThread, (symbol, days, perc))
_thread.start_new_thread(self.handleThread, (symbol, days, perc))
return self.thread_response
I am just starting out with thread. would be helpful if i know this is the right thing to do to reduce the time here.
P.S : Time is important here. I want to reduce it to 1 min from 30 min.
I suggest using worker-queue pattern like so...
you will have a queue of jobs, each worker will take a job and work on it, the result it will put at another queue, when all workers are done, the result queue will be read and process the results
def worker(pool, result_q):
while True:
job = pool.get()
result = handle(job) #handle job
result_q.put(result)
pool.task_done()
q = Queue.Queue()
res_q = Queue.Queue()
for i in range(num_worker_threads):
t = threading.Thread(target=worker, args=(q, res_q))
t.setDaemon(True)
t.start()
for job in jobs:
q.put(job)
q.join()
while not res_q.empty():
result = res_q.get()
# do smth with result
The worker-queue pattern suggested in shahaf's answer works fine, but Python provides even higher level abstractions, in concurret.futures. Namely a ThreadPoolExecutor, which will take care of the queueing and starting of threads for you:
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor(max_workers=30)
responses = executor.map(process_item, (x.symbol for x in instruments))
The main complication with using the excutor.map() is that it can only map over one argument, meaning that there can be only one input to proces_item namely symbol).
However, if more arguments are needed, it is possible to define a new function, which will fixate all arguments but one. This can either be done manually or using the special Python partial call, found in functools:
from functools import partial
process_item = partial(api.handleThread, days=days, perc=return_perc)
Applying the ThreadPoolExecutor strategy to your current probelm would then have a solution similar to:
from concurrent.futures import ThreadPoolExecutor
from functools import partial
class Instrument:
def __init__(self, symbol, name):
self.symbol = symbol
self.name = name
instruments = [Instrument('SMB', 'Name'), Instrument('FNK', 'Funky')]
class externalApi:
def handleThread(self, symbol, days, perc):
# Call the external API and process the items
# Example, to give something back:
if symbol == 'FNK':
return days*3
else:
return days
def process_item_generator(api, days, perc):
return partial(api.handleThread, days=days, perc=perc)
days = 5
return_perc = 10
api = externalApi()
process_item = process_item_generator(api, days, return_perc)
executor = ThreadPoolExecutor(max_workers=30)
responses = executor.map(process_item, (x.symbol for x in instruments))
return_response = ([x.symbol, x.name, response]
for x, response in zip(instruments, responses)
if response > float(return_perc))
Here I have assumed that x.symbol is the same as x.trading_symbol and I have made a dummy implementation of your API call, to get some type of return value, but it should give a good idea of how to do this. Due to this, the code is a bit longer, but then again, it becomes a runnable example.
I have a nested for loop of the form
while x<lat2[0]:
while y>lat3[1]:
if (is_inside_nepal([x,y])):
print("inside")
else:
print("not")
y = y - (1/150.0)
y = lat2[1]
x = x + (1/150.0)
#here lat2[0] represents a large number
Now this normally takes around 50s for executing.
And I have changed this loop to a multiprocessing code.
def v1find_coordinates(q):
while not(q.empty()):
x1 = q.get()
x2 = x1 + incfactor
while x1<x2:
def func(x1):
while y>lat3[1]:
if (is_inside([x1,y])):
print x1,y,"inside"
else:
print x1,y,"not inside"
y = y - (1/150.0)
func(x1)
y = lat2[1]
x1 = x1 + (1/150.0)
incfactor = 0.7
xvalues = drange(x,lat2[0],incfactor)
#this drange function is to get list with increment factor as decimal
cores = mp.cpu_count()
q = Queue()
for i in xvalues:
q.put(i)
for i in range(0,cores):
p = Process(target = v1find_coordinates,args=(q,) )
p.start()
p.Daemon = True
processes.append(p)
for i in processes:
print ("now joining")
i.join()
This multiprocessing code also takes around 50s execution time. This means there is no difference of time between the two.
I also have tried using pools. I have also managed the chunk size. I have googled and searched through other stackoverflow. But can't find any satisfying answer.
The only answer I could find was time was taken in process management to make both the result same. If this is the reason then how can I get the multiprocessing work to obtain faster results?
Will implementing in C from Python give faster results?
I am not expecting drastic results but by common sense one can tell that running on 4 cores should be a lot faster than running in 1 core. But I am getting similar results. Any kind of help would be appreciated.
You seem to be using a thread Queue (from Queue import Queue). This does not work as expected as Process uses fork() and it clones the entire Queue into each worker Process
Use:
from multiprocessing import Queue
I have a list of 100 ids, and I need to do a lookup for each one of them. The lookup takes approximate 3s to run. Here is the sequential code that would be needed to run it:
ids = [102225077, 102225085, 102225090, 102225097, 102225105, ...]
for id in ids:
run_updates(id)
I would like to run ten (10) of these concurrently at a time, using either gevent or multiprocessor. How would I do this? Here is what I tried for gevent but it's quite slow:
def chunks(l, n):
""" Yield successive n-sized chunks from l.
"""
for i in xrange(0, len(l), n):
yield l[i:i+n]
ids = [102225077, 102225085, 102225090, 102225097, 102225105, ...]
if __name__ == '__main__':
for list_of_ids in list(chunks(ids, 10)):
jobs = [gevent.spawn(run_updates(id)) for id in list_of_ids]
gevent.joinall(jobs, timeout=200)
What would be the correct way to split up the ids list and run ten-at-a-time? I would even be open to using multiprocessor or gevent (not too familiar with either).
Doing it sequentially takes 364 seconds for 100 ids.
Using multiprocessor takes about 207 seconds on 100 ids, doing 5 at a time:
pool = Pool(processes=5)
pool.map(run_updates, list_of_apple_ids)
Using gevent takes somewhere in between the two:
jobs = [gevent.spawn(run_updates, apple_id) for apple_id in list_of_apple_ids]
Is there any way I can get better performance than the Pool.map? I have a pretty decent computer here with a fast internet connection, it should be able to do it much quicker...
Check out the grequests library. You can do something like:
import grequests
for list_of_ids in list(chunks(ids, 10)):
urls = [''.join(('http://www.example.com/id?=', id)) for id in list_of_ids]
requests = (grequests.get(url) for url in urls)
responses = grequests.map(requests)
for response in responses:
print response.content
I know this breaks your model somewhat because you have your request encapsulated in a run_updates method, but I think it may be worth exploring nonetheless.
from multiprocessing import Process
from random import Random.random
ids = [random() for _ in range(100)] # make some fake ids, whatever
def do_thing(arg):
print arg # Here's where you'd do lookup
while ids:
curs, ids = ids[:10], ids[10:]
procs = [Process(target=do_thing, args=(c,)) for c in curs]
for proc in procs:
proc.run()
This is roughly how I'd do it, I guess.
I have been fiddling with Python's multiprocessing functionality for upwards of an hour now, trying to parallelize a rather complex graph traversal function using multiprocessing.Process and multiprocessing.Manager:
import networkx as nx
import csv
import time
from operator import itemgetter
import os
import multiprocessing as mp
cutoff = 1
exclusionlist = ["cpd:C00024"]
DG = nx.read_gml("KeggComplete.gml", relabel=True)
for exclusion in exclusionlist:
DG.remove_node(exclusion)
# checks if 'memorizedPaths exists, and if not, creates it
fn = os.path.join(os.path.dirname(__file__),
'memorizedPaths' + str(cutoff+1))
if not os.path.exists(fn):
os.makedirs(fn)
manager = mp.Manager()
memorizedPaths = manager.dict()
filepaths = manager.dict()
degreelist = sorted(DG.degree_iter(),
key=itemgetter(1),
reverse=True)
def _all_simple_paths_graph(item, DG, cutoff, memorizedPaths, filepaths):
source = item[0]
uniqueTreePaths = []
if cutoff < 1:
return
visited = [source]
stack = [iter(DG[source])]
while stack:
children = stack[-1]
child = next(children, None)
if child is None:
stack.pop()
visited.pop()
elif child in memorizedPaths:
for path in memorizedPaths[child]:
newPath = (tuple(visited) + tuple(path))
if (len(newPath) <= cutoff) and
(len(set(visited) & set(path)) == 0):
uniqueTreePaths.append(newPath)
continue
elif len(visited) < cutoff:
if child not in visited:
visited.append(child)
stack.append(iter(DG[child]))
if visited not in uniqueTreePaths:
uniqueTreePaths.append(tuple(visited))
else: # len(visited) == cutoff:
if (visited not in uniqueTreePaths) and
(child not in visited):
uniqueTreePaths.append(tuple(visited + [child]))
stack.pop()
visited.pop()
# writes the absolute path of the node path file into the hash table
filepaths[source] = str(fn) + "/" + str(source) + "path.txt"
with open (filepaths[source], "wb") as csvfile2:
writer = csv.writer(csvfile2, delimiter=" ", quotechar="|")
for path in uniqueTreePaths:
writer.writerow(path)
memorizedPaths[source] = uniqueTreePaths
############################################################################
if __name__ == '__main__':
start = time.clock()
for item in degreelist:
test = mp.Process(target=_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
test.start()
test.join()
end = time.clock()
print (end-start)
Currently - though luck and magic - it works (sort of). My problem is I'm only using 12 of my 24 cores.
Can someone explain why this might be the case? Perhaps my code isn't the best multiprocessing solution, or is it a feature of my architecture Intel Xeon CPU E5-2640 # 2.50GHz x18 running on Ubuntu 13.04 x64?
EDIT:
I managed to get:
p = mp.Pool()
for item in degreelist:
p.apply_async(_all_simple_paths_graph,
args=(DG, cutoff, item, memorizedPaths, filepaths))
p.close()
p.join()
Working, however, it's VERY SLOW! So I assume I'm using the wrong function for the job. hopefully it helps clarify exactly what I'm trying to accomplish!
EDIT2: .map attempt:
partialfunc = partial(_all_simple_paths_graph,
DG=DG,
cutoff=cutoff,
memorizedPaths=memorizedPaths,
filepaths=filepaths)
p = mp.Pool()
for item in processList:
processVar = p.map(partialfunc, xrange(len(processList)))
p.close()
p.join()
Works, is slower than singlecore. Time to optimize!
Too much piling up here to address in comments, so, where mp is multiprocessing:
mp.cpu_count() should return the number of processors. But test it. Some platforms are funky, and this info isn't always easy to get. Python does the best it can.
If you start 24 processes, they'll do exactly what you tell them to do ;-) Looks like mp.Pool() would be most convenient for you. You pass the number of processes you want to create to its constructor. mp.Pool(processes=None) will use mp.cpu_count() for the number of processors.
Then you can use, for example, .imap_unordered(...) on your Pool instance to spread your degreelist across processes. Or maybe some other Pool method would work better for you - experiment.
If you can't bash the problem into Pool's view of the world, you could instead create an mp.Queue to create a work queue, .put()'ing nodes (or slices of nodes, to reduce overhead) to work on in the main program, and write the workers to .get() work items off that queue. Ask if you need examples. Note that you need to put sentinel values (one per process) on the queue, after all the "real" work items, so that worker processes can test for the sentinel to know when they're done.
FYI, I like queues because they're more explicit. Many others like Pools better because they're more magical ;-)
Pool Example
Here's an executable prototype for you. This shows one way to use imap_unordered with Pool and chunksize that doesn't require changing any function signatures. Of course you'll have to plug in your real code ;-) Note that the init_worker approach allows passing "most of" the arguments only once per processor, not once for every item in your degreeslist. Cutting the amount of inter-process communication can be crucial for speed.
import multiprocessing as mp
def init_worker(mps, fps, cut):
global memorizedPaths, filepaths, cutoff
global DG
print "process initializing", mp.current_process()
memorizedPaths, filepaths, cutoff = mps, fps, cut
DG = 1##nx.read_gml("KeggComplete.gml", relabel = True)
def work(item):
_all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths)
def _all_simple_paths_graph(DG, cutoff, item, memorizedPaths, filepaths):
pass # print "doing " + str(item)
if __name__ == "__main__":
m = mp.Manager()
memorizedPaths = m.dict()
filepaths = m.dict()
cutoff = 1 ##
# use all available CPUs
p = mp.Pool(initializer=init_worker, initargs=(memorizedPaths,
filepaths,
cutoff))
degreelist = range(100000) ##
for _ in p.imap_unordered(work, degreelist, chunksize=500):
pass
p.close()
p.join()
I strongly advise running this exactly as-is, so you can see that it's blazing fast. Then add things to it a bit a time, to see how that affects the time. For example, just adding
memorizedPaths[item] = item
to _all_simple_paths_graph() slows it down enormously. Why? Because the dict gets bigger and bigger with each addition, and this process-safe dict has to be synchronized (under the covers) among all the processes. The unit of synchronization is "the entire dict" - there's no internal structure the mp machinery can exploit to do incremental updates to the shared dict.
If you can't afford this expense, then you can't use a Manager.dict() for this. Opportunities for cleverness abound ;-)