I'm curious if threading is the right approach for my use case. I'm working on a genetic algorithm, which needs to evaluate the fitness of genes 1...n. The evaluation of each is independent of others for the most part. Yet, each gene will be passed through the same function, eval(gene.
My intention is that once all genes have been evaluated, I will sort by fitness, and only retain top x.
From this tutorial, it seems that I should be able to do the following, where the specifics of eval are out of scope for this question, but suppose each function call updates a common dictionary of form, {gene : fitness}:
for gene in gene_pool:
thread_i = threading.Thread(target=eval(gene), name=f"fitness gene{i}")
thread_i.start()
for i in range(len(genes)):
thread_i.join()
In the tutorial, I don't see the function actually invoking the function eval(), but rather just referencing its name, eval. I'm not sure if this will problematic for my use case.
My first question is: Is this the right approach? Should I consider a different approach to threading?
I don't believe that I will need to account for race conditions or locks because, while every thread will update the same dictionary, the keys and values will be independent.
And my last question: Does multiprocessing generally a better bet? It seems that it's a bit higher level, which might be ideal for someone new to parallelization.
In Python, threading is constrained by the GIL, so that it is very limited performance-wise. In case of IO-bound code (reading/writing files, requests on the network, ...) async is the way to go.
But from what you explain, your code is rather CPU-bound (computing many things). Then if you want your code to go fast then you need to circumvent the Python GIL. There is two main ways :
multiprocessing (having multiple different Python processes in parallel)
or calling code written in lower-level languages (Cython, C, ...), typically wrapped in a nice library
If you want something simple, stick to multiprocessing : at the start create a pool whose size is the number of competing genes (N), then at each iteration submit to the pool N new tasks to it and wait for their results (the pool.map function), repeat as many times as you want.
I think it is the simplest way to get a full parallelization, which should give you decent speed.
Related
I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()
From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange
A quick question about parallel processing in Python. Lets say I have a big shared data structure and want to apply many functions on it in parallel. These functions are read only on the data structure but perform mutation in a result object:
def compute_heavy_task(self):
big_shared_object = self.big_shared_object
result_refs = self.result_refs
for ref in result_refs:
some_expensive_task(ref, big_shared_object)
How do I do these in parallel, say 5 at a time, or 10 at a time. How how about number of processors at a time?
You cannot usefully do this with threads in Python (at least not the CPython implementation you're probably using). The Global Interpreter Lock means that, instead of the near-800% efficiency you'd like out of 8 cores, you only get 90%.
But you can do this with separate processes. There are two options for this built into the standard library: concurrent.futures and multiprocessing. In general, futures is simpler in simple cases and often easier to compose; multiprocessing is more flexible and powerful in general. futures also only comes with Python 3.2 or later, but there's a backport for 2.5-3.1 at PyPI.
One of the cases where you want the flexibility of multiprocessing is when you have a big shared data structure. See Sharing state between processes and the sections directly above, below, and linked from it for details.
If your data structure is really simple, like a giant array of ints, this is pretty simple:
class MyClass(object):
def __init__(self, giant_iterator_of_ints):
self.big_shared_object = multiprocessing.Array('i', giant_iterator_of_ints)
def compute_heavy_task(self):
lock = multiprocessing.Lock()
def subtask(my_range):
return some_expensive_task(self.big_shared_object, lock, my_range)
pool = multiprocessing.pool.Pool(5)
my_ranges = split_into_chunks_appropriately(len(self.big_shared_object)
results = pool.map_async(subtask, my_ranges)
pool.close()
pool.join()
Note that the some_expensive_task function now takes a lock object—it has to make sure to acquire the lock around every access to the shared object (or, more often, every "transaction" made up of one or more accesses). Lock discipline can be tricky, but there's really no way around it if you want to use direct data sharing.
Also note that it takes a my_range. If you just call the same function 5 times on the same object, it'll do the same thing 5 times, which probably isn't very useful. One common way to parallelize things is to give each task a sub-range of the overall data set. (Besides being usually simple to describe, if you're careful with this, with the right kinds of algorithms, you can even avoid a lot of locking this way.)
If you instead want to map a bunch of different functions to the same dataset, you obviously need some collection of functions to work on, rather than just using some_expensive_task repeatedly. You can then, e.g., iterate over these functions calling apply_async on each one. But you can also just turn it around: write a single applier function, as a closure around the data, that takes takes a function and applies it to the data. Then, just map that function over the collection of functions.
I've also assumed that your data structure is something you can define with multiprocessing.Array. If not, you're going to have to design the data structure in C style, implement it as a ctypes Array of Structures or vice-versa, and then use the multiprocessing.sharedctypes stuff.
I've also moved the result object into results that just get passed back. If they're also huge and need to be shared, use the same trick to make them sharable.
Before going further with this, you should ask yourself whether you really do need to share the data. Doing things this way, you're going to spend 80% of your debugging, performance-tuning, etc. time adding and removing locks, making them more or less granular, etc. If you can get away with passing immutable data structures around, or work on files, or a database, or almost any other alternative, that 80% can go toward the rest of your code.
while True:
Number = len(SomeList)
OtherList = array([None]*Number)
for i in xrange(Number):
OtherList[i] = (Numpy Array Calculation only using i_th element of arrays, Array_1, Array_2, and Array_3.)
'Number' number of elements in OtherList and other arrays can be calculated seperately.
However, as the program is time-dependent, we cannot proceed further job until every 'Number' number of elements are processed.
Will multiprocessing be a good solution for this operation?
I should to speed up this process maximally.
If it is better, please suggest the code please.
It is possible to use numpy arrays with multiprocessing but you shouldn't do it yet.
Read A beginners guide to using Python for performance computing and its Cython version: Speeding up Python (NumPy, Cython, and Weave).
Without knowing what are specific calculations or sizes of the arrays here're generic guidelines in no particular order:
measure performance of your code. Find hot-spots. Your code might load input data longer than all calculations. Set your goal, define what trade-offs are acceptable
check with automated tests that you get expected results
check whether you could use optimized libraries to solve your problem
make sure algorithm has adequate time complexity. O(n) algorithm in pure Python can be faster than O(n**2) algorithm in C for large n
use slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.
rewrite places that need optimization using weave, f2py, cython or similar. Provide type information. Explore compiler options. Decide whether the speedup worth it to keep C extensions.
minimize allocation and data copying. Make it cache friendly.
explore whether multiple threads might be useful in your case e.g., cython.parallel.prange(). Release GIL.
Compare with multiprocessing approach. The link above contains an example how to compute different slices of an array in parallel.
Iterate
Since you have a while True clause there I will assume you will run a lot if iterations so the potential gains will eventually outweigh the slowdown from the spawning of the multiprocessing pool. I will also assume you have more than one logical core on your machine for obvious reasons. Then the question becomes if the cost of serializing the inputs and de-serializing the result is offset by the gains.
Best way to know if there is anything to be gained, in my experience, is to try it out. I would suggest that:
You pass on any constant inputs at start time. Thus, if any of Array_1, Array_2, and Array_3 never changes, pass it on as the args when calling Process(). This way you reduce the amount of data that needs to be picked and passed on via IPC (which is what multiprocessing does)
You use a work queue and add to it tasks as soon as they are available. This way, you can make sure there is always more work waiting when a process is done with a task.
this time i'm facing a "design" problem. Using Python, I have a implement a mathematical algorithm which uses 5 parameters. To find the best combination of these 5 parameters, i used 5-layer nested loop to enumerate all possible combinations in a given range. The time it takes to finish appeared to be beyond my expectation. So I think it's the time to use multithreading...
The task in the core of nested loops are calculation and saving. In current code, result from every calculation is appended to a list and the list will be written to a file at the end of program.
since I don't have too much experience of multithreading in any language, not to mention Python, I would like to ask for some hints on what should the structure be for this problem. Namely, how should the calculations be assigned to the threads dynamically and how should the threads save results and later combine all results into one file. I hope the number of threads can be adjustable.
Any illustration with code will be very helpful.
thank you very much for your time, I appreciate it.
#
update of 2nd Day:
thanks for all helpful answers, now I know that it is multiprocessing instead of multithreading. I always confuse with these two concepts because I think if it is multithreaded then the OS will automatically use multiple processor to run it when available.
I will find time to have some hands-on with multiprocessing tonight.
You can try using jug, a library I wrote for very similar problems. Your code would then look something like
from jug import TaskGenerator
evaluate = TaskGenerator(evaluate)
for p0 in [1,2,3]:
for p1 in xrange(10):
for p2 in xrange(10,20):
for p3 in [True, False]:
for p4 in xrange(100):
results.append(evaluate(p0,p1,p2,p3,p4))
Now you could run as many processes as you'd like (even across a network if you have access to a computer cluster).
Multithreading in Python won't win you anything in this kind of problem, since Python doesn't execute threads in parallel (it uses them for I/O concurrency, mostly).
You want multiprocessing instead, or a friendly wrapper for it such as joblib:
from joblib import Parallel, delayed
# -1 == use all available processors
results = Parallel(n_jobs=-1)(delayed(evaluate)(x) for x in enum_combinations())
print best_of(results)
Where enum_combinations would enumerate all combinations of your five parameters; you can likely implement it by putting a yield at the bottom of your nested loop.
joblib distributes the combinations over multiple worker processes, taking care of some load balancing.
Assuming this is a calculation-heavy problem (and thus CPU-bound), multi-threading won't help you much in Python due to the GIL.
What you can, however, do is split the calculation across multiple processes to take advantage of extra CPU cores. The easiest way to do this is with the multiprocessing library.
There are a number of examples for how to use multiprocessing on the docs page for it.
I'm working in the Google App Engine environment and programming in Python. I am creating a function that essentially generates a random number/letter string and then stores to the memcache.
def generate_random_string():
# return a random 6-digit long string
def check_and_store_to_memcache():
randomstring = generate_random_string()
#check against memcache
#if ok, then store key value with another value
#if not ok, run generate_random_string() again and check again.
Does creating two functions instead of just one big one affect performance? I prefer two, as it better matches how I think, but don't mind combining them if that's "best practice".
Focus on being able to read and easily understand your code.
Once you've done this, if you have a performance problem, then look into what might be causing it.
Most languages, python included, tend to have fairly low overhead for making method calls. Putting this code into a single function is not going to (dramatically) change the performance metrics - I'd guess that your random number generation will probably be the bulk of the time, not having 2 functions.
That being said, splitting functions does have a (very, very minor) impact on performance. However, I'd think of it this way - it may take you from going 80 mph on the highway to 79.99mph (which you'll never really notice). The important things to watch for are avoiding stoplights and traffic jams, since they're going to make you have to stop altogether...
In almost all cases, "inlining" functions to increase speed is like getting a hair cut to lose weight.
Reed is right. For the change you're considering, the cost of a function call is a small number of cycles, and you'd have to be doing it 10^8 or so times per second before you'd notice.
However, I would caution that often people go to the other extreme, and then it is as if function calls were costly. I've seen this in over-designed systems where there were many layers of abstraction.
What happens is there is some human psychology that says if something is easy to call, then it is fast. This leads to writing more function calls than strictly necessary, and when this occurs over multiple layers of abstraction, the wastage can be exponential.
Following Reed's driving example, a function call can be like a detour, and if the detour contains detours, and if those also contain detours, soon there is tremendous time being wasted, for no obvious reason, because each function call looks innocent.
Like others have said, I wouldn't worry about it in this particular scenario. The very small overhead involved in function calls would pale in comparison to what is done inside each function. And as long as these functions don't get called in rapid succession, it probably wouldn't matter much anyway.
It is a good question though. In some cases it's best not to break code into multiple functions. For example, when working with math intensive tasks with nested loops it's best to make as few function calls as possible in the inner loop. That's because the simple math operations themselves are very cheap, and next to that the function-call-overhead can cause a noticeable performance penalty.
Years ago I discovered the hypot (hypotenuse) function in the math library I was using in a VC++ app was very slow. It seemed ridiculous to me because it's such a simple set of functionality -- return sqrt(a * a + b * b) -- how hard is that? So I wrote my own and managed to improve performance 16X over. Then I added the "inline" keyword to the function and made it 3X faster than that (about 50X faster at this point). Then I took the code out of the function and put it in my loop itself and saw yet another small performance increase. So... yeah, those are the types of scenarios where you can see a difference.