I'm using python to write an ideal gas simulator, and right now the collision detection is the most intensive part of the program. At the moment though, I'm only using one of my 8 cores. (I'm using an i7 3770 # 3.4GHz)
After minimal googling I found the multiprocessing module for python (2.7.4). And I've tried it. With a bit of thought I've realised the only thing I can really run in parallel is here, where I loop through all the particles to detect collisions:
for ball in self.Objects:
if not foo == ball:
foo.CollideBall(ball, self.InternalTimestep)
Here foo is the particle that I'm testing against all the others.
So I tried doing this:
for ball in self.Objects:
if not foo == ball:
p = multiprocessing.Process(target=foo.CollideBall, args=(ball, self.InternalTimestep))
p.start()
Although the program does run a little faster, it's still only using 1.5 cores to their fullest extent, the rest are just in idle and it's not detecting any collisions either! I've read that if you create too many processes at once (more than the number of cores) then you get a backlog (this is a loop of 196 particles), so this might explain the lower speed than I was expecting, but it doesn't explain the fact I'm still not using all my cores!
Either way it's too slow!!! So is there a way I can create 8 processes, and only create a new one when there are less than 8 processes running already? Will that even solve my problem? And how do I use all of my cores/why is this code not already?
I only found out about multiprocessing in python yesterday, so I'm afraid any answers will have to be spelt out to me.
Thanks for any help though!
---EDIT---
In response to Carson, I tried adding p.join directly after p.start and this slowed the programme right down. Instead of taking o.2 seconds per cycle it's taking 24 seconds per cycle!
As far as I understand, you test one particle against all others and then perform that operation on each particle in turn. Based on this, I'd say your problem is that you try to optimize your code to work on all cores without trying to optimize your code itself.
Instead you could partitionate your particles so that you only check those that are close to each other. One possible mean to do so is a quad tree: See http://en.wikipedia.org/wiki/Quadtree.
In a second step you can parallelize everything. For quad trees you resolve the upmost level by hand and create a new process for each sub tree. By this, the processes are independent from each other and don't block. I'd expect a quadratic speed up (think of square root of your current run time) by the quad tree and the enabling of a further linear speed up (divide by number of processes) through parallelization.
Sorry, I can't spell it out in Python.
With a working quad tree, you could set up a thread pool (as a class) and define jobs (another class) that are allocated to individual threads (yet another class, if possible from a threading framework). In your case a job contains a list of quad tree nodes that have to be inspected. Initially each top level quad tree node (4 in 2D / 8 in 3D) resides in its own job.
So you can have up to 4 (respective 8) threads, each of which inspecting an independent subtree of the quadtree. If you need more threads to fully use your machines processing power you can have threads put back part of their jobs to the thread pool, if they encounter many deep subtrees.
For this, I'd use a BFS (breadth first search) with the list of quadtree nodes from the job. If the list gets longer than expected, I'd put part of it back to the thread pool. Knowledge in maths/statistics/stochastics helps finding a good parameterization for what length is to be expected.
I've also written a quad tree implementation that parameterizes itself according to expected number of objects given the "world" size and calculating the average object size.
Search for the open source project d-collide. Although it's in C++ there should be some usefull sample code. But please regard its licensing, which is not asked much as it's BSD style.
I added this as a second answer, because the first one was about optimizing your code to achieve your implied goal: better run time (although it's via better efficiency)
This second answer is about achieving your written goal: stronger parallelization.
However the quad tree enables this second step, but don't expect the second speed up to be as much as the first. Especially when it comes to many objects, nothing beats an optimized algorithm. But don't lose yourself in micro optimizations: see the runtime discussion in Cancelling a Task is throwing an exception
Related
I'm curious if threading is the right approach for my use case. I'm working on a genetic algorithm, which needs to evaluate the fitness of genes 1...n. The evaluation of each is independent of others for the most part. Yet, each gene will be passed through the same function, eval(gene.
My intention is that once all genes have been evaluated, I will sort by fitness, and only retain top x.
From this tutorial, it seems that I should be able to do the following, where the specifics of eval are out of scope for this question, but suppose each function call updates a common dictionary of form, {gene : fitness}:
for gene in gene_pool:
thread_i = threading.Thread(target=eval(gene), name=f"fitness gene{i}")
thread_i.start()
for i in range(len(genes)):
thread_i.join()
In the tutorial, I don't see the function actually invoking the function eval(), but rather just referencing its name, eval. I'm not sure if this will problematic for my use case.
My first question is: Is this the right approach? Should I consider a different approach to threading?
I don't believe that I will need to account for race conditions or locks because, while every thread will update the same dictionary, the keys and values will be independent.
And my last question: Does multiprocessing generally a better bet? It seems that it's a bit higher level, which might be ideal for someone new to parallelization.
In Python, threading is constrained by the GIL, so that it is very limited performance-wise. In case of IO-bound code (reading/writing files, requests on the network, ...) async is the way to go.
But from what you explain, your code is rather CPU-bound (computing many things). Then if you want your code to go fast then you need to circumvent the Python GIL. There is two main ways :
multiprocessing (having multiple different Python processes in parallel)
or calling code written in lower-level languages (Cython, C, ...), typically wrapped in a nice library
If you want something simple, stick to multiprocessing : at the start create a pool whose size is the number of competing genes (N), then at each iteration submit to the pool N new tasks to it and wait for their results (the pool.map function), repeat as many times as you want.
I think it is the simplest way to get a full parallelization, which should give you decent speed.
How do you write (and run) a correct micro-benchmark in Java?
I'm looking for some code samples and comments illustrating various things to think about.
Example: Should the benchmark measure time/iteration or iterations/time, and why?
Related: Is stopwatch benchmarking acceptable?
Tips about writing micro benchmarks from the creators of Java HotSpot:
Rule 0: Read a reputable paper on JVMs and micro-benchmarking. A good one is Brian Goetz, 2005. Do not expect too much from micro-benchmarks; they measure only a limited range of JVM performance characteristics.
Rule 1: Always include a warmup phase which runs your test kernel all the way through, enough to trigger all initializations and compilations before timing phase(s). (Fewer iterations is OK on the warmup phase. The rule of thumb is several tens of thousands of inner loop iterations.)
Rule 2: Always run with -XX:+PrintCompilation, -verbose:gc, etc., so you can verify that the compiler and other parts of the JVM are not doing unexpected work during your timing phase.
Rule 2.1: Print messages at the beginning and end of timing and warmup phases, so you can verify that there is no output from Rule 2 during the timing phase.
Rule 3: Be aware of the difference between -client and -server, and OSR and regular compilations. The -XX:+PrintCompilation flag reports OSR compilations with an at-sign to denote the non-initial entry point, for example: Trouble$1::run # 2 (41 bytes). Prefer server to client, and regular to OSR, if you are after best performance.
Rule 4: Be aware of initialization effects. Do not print for the first time during your timing phase, since printing loads and initializes classes. Do not load new classes outside of the warmup phase (or final reporting phase), unless you are testing class loading specifically (and in that case load only the test classes). Rule 2 is your first line of defense against such effects.
Rule 5: Be aware of deoptimization and recompilation effects. Do not take any code path for the first time in the timing phase, because the compiler may junk and recompile the code, based on an earlier optimistic assumption that the path was not going to be used at all. Rule 2 is your first line of defense against such effects.
Rule 6: Use appropriate tools to read the compiler's mind, and expect to be surprised by the code it produces. Inspect the code yourself before forming theories about what makes something faster or slower.
Rule 7: Reduce noise in your measurements. Run your benchmark on a quiet machine, and run it several times, discarding outliers. Use -Xbatch to serialize the compiler with the application, and consider setting -XX:CICompilerCount=1 to prevent the compiler from running in parallel with itself. Try your best to reduce GC overhead, set Xmx(large enough) equals Xms and use UseEpsilonGC if it is available.
Rule 8: Use a library for your benchmark as it is probably more efficient and was already debugged for this sole purpose. Such as JMH, Caliper or Bill and Paul's Excellent UCSD Benchmarks for Java.
I know this question has been marked as answered but I wanted to mention two libraries that help us to write micro benchmarks
Caliper from Google
Getting started tutorials
http://codingjunkie.net/micro-benchmarking-with-caliper/
http://vertexlabs.co.uk/blog/caliper
JMH from OpenJDK
Getting started tutorials
Avoiding Benchmarking Pitfalls on the JVM
Using JMH for Java Microbenchmarking
Introduction to JMH
Important things for Java benchmarks are:
Warm up the JIT first by running the code several times before timing it
Make sure you run it for long enough to be able to measure the results in seconds or (better) tens of seconds
While you can't call System.gc() between iterations, it's a good idea to run it between tests, so that each test will hopefully get a "clean" memory space to work with. (Yes, gc() is more of a hint than a guarantee, but it's very likely that it really will garbage collect in my experience.)
I like to display iterations and time, and a score of time/iteration which can be scaled such that the "best" algorithm gets a score of 1.0 and others are scored in a relative fashion. This means you can run all algorithms for a longish time, varying both number of iterations and time, but still getting comparable results.
I'm just in the process of blogging about the design of a benchmarking framework in .NET. I've got a couple of earlier posts which may be able to give you some ideas - not everything will be appropriate, of course, but some of it may be.
jmh is a recent addition to OpenJDK and has been written by some performance engineers from Oracle. Certainly worth a look.
The jmh is a Java harness for building, running, and analysing nano/micro/macro benchmarks written in Java and other languages targetting the JVM.
Very interesting pieces of information buried in the sample tests comments.
See also:
Avoiding Benchmarking Pitfalls on the JVM
Discussion on the main strengths of jmh.
Should the benchmark measure time/iteration or iterations/time, and why?
It depends on what you are trying to test.
If you are interested in latency, use time/iteration and if you are interested in throughput, use iterations/time.
Make sure you somehow use results which are computed in benchmarked code. Otherwise your code can be optimized away.
If you are trying to compare two algorithms, do at least two benchmarks for each, alternating the order. i.e.:
for(i=1..n)
alg1();
for(i=1..n)
alg2();
for(i=1..n)
alg2();
for(i=1..n)
alg1();
I have found some noticeable differences (5-10% sometimes) in the runtime of the same algorithm in different passes..
Also, make sure that n is very large, so that the runtime of each loop is at the very least 10 seconds or so. The more iterations, the more significant figures in your benchmark time and the more reliable that data is.
There are many possible pitfalls for writing micro-benchmarks in Java.
First: You have to calculate with all sorts of events that take time more or less random: Garbage collection, caching effects (of OS for files and of CPU for memory), IO etc.
Second: You cannot trust the accuracy of the measured times for very short intervals.
Third: The JVM optimizes your code while executing. So different runs in the same JVM-instance will become faster and faster.
My recommendations: Make your benchmark run some seconds, that is more reliable than a runtime over milliseconds. Warm up the JVM (means running the benchmark at least once without measuring, that the JVM can run optimizations). And run your benchmark multiple times (maybe 5 times) and take the median-value. Run every micro-benchmark in a new JVM-instance (call for every benchmark new Java) otherwise optimization effects of the JVM can influence later running tests. Don't execute things, that aren't executed in the warmup-phase (as this could trigger class-load and recompilation).
It should also be noted that it might also be important to analyze the results of the micro benchmark when comparing different implementations. Therefore a significance test should be made.
This is because implementation A might be faster during most of the runs of the benchmark than implementation B. But A might also have a higher spread, so the measured performance benefit of A won't be of any significance when compared with B.
So it is also important to write and run a micro benchmark correctly, but also to analyze it correctly.
To add to the other excellent advice, I'd also be mindful of the following:
For some CPUs (e.g. Intel Core i5 range with TurboBoost), the temperature (and number of cores currently being used, as well as thier utilisation percent) affects the clock speed. Since CPUs are dynamically clocked, this can affect your results. For example, if you have a single-threaded application, the maximum clock speed (with TurboBoost) is higher than for an application using all cores. This can therefore interfere with comparisons of single and multi-threaded performance on some systems. Bear in mind that the temperature and volatages also affect how long Turbo frequency is maintained.
Perhaps a more fundamentally important aspect that you have direct control over: make sure you're measuring the right thing! For example, if you're using System.nanoTime() to benchmark a particular bit of code, put the calls to the assignment in places that make sense to avoid measuring things which you aren't interested in. For example, don't do:
long startTime = System.nanoTime();
//code here...
System.out.println("Code took "+(System.nanoTime()-startTime)+"nano seconds");
Problem is you're not immediately getting the end time when the code has finished. Instead, try the following:
final long endTime, startTime = System.nanoTime();
//code here...
endTime = System.nanoTime();
System.out.println("Code took "+(endTime-startTime)+"nano seconds");
http://opt.sourceforge.net/ Java Micro Benchmark - control tasks required to determine the comparative performance characteristics of the computer system on different platforms. Can be used to guide optimization decisions and to compare different Java implementations.
The problem is that I'm finding it difficult to understand how DFBB works, what the parameters and output should be for this case.
I'm working on creating an AI for the game StarCraft 2 that will handle the build order in the game (for team Terran). I was planning to follow the approach described in the link (see below) which followed a very similar thing that I was going for. To summarize what I'm planning to do:
A list of different type of buildings that need to be built will be given to me. Buildings cost minerals and gas (this is the currency in the game), some buildings have prerequisites (meaning other buildings need to be built before it's possible to build it) and they take a certain amount of time to build.
In the article they used Depth-First Branch and Bound to figure out the optimal build order, meaning the fastest way possible to build the buildings in that list. This was their pseudocode:
Where the state S is represented by S = (current game time, resources available, actions in progress but not completed, worker income data). How S´ is derived is described article and it is done through three functions so that bit I understand.
As mentioned earlier I'm struggling to understand what the starting status S, goal G, time limit t and bound b should be represented by in the pseudocode that they are describing.
I only know three things for sure: the list of buildings that needs to be built, what consumables I have at the moment (minerals and gas), resources (that is buildings I already have in the game). This should then be applied to the algorithm somehow, but it is unclear what the input should be to the function. The output should be a list sorted in the right order so if I where to building the buildings in the order they come in then it should all work out and it should be the optimal possible time it can be done in.
For example should I iterate through the list buildings and run DFBB on every element with the goal then being seeing if the building can be built. But what should the time limit be set too and what does bound mean in this case? Is it simply the cost?
Please explain how this function should be run on the list in order to find the optimal path of building it. The article is fairly easy to read, but I need some help understanding how it is meant to work and how I can apply it to my problem.
Link to article: https://ai.dmi.unibas.ch/research/reading_group/churchill-buro-aiide2011.pdf
Starting Status S is the initial state at the start of the game. I believe you have 100 minearls and Command center and 12? SCVs, so that's your start.
The Goal here is the list of building you want to have. The satisfies condition is are all building in goal also in S.
The time limit is the amount of time you are willing to spend to get the result. If yous set it to 5 seconds it will probably give you a sub-optimal solution, but it will do it in 5 seconds. If the algorithm finishes the search it will return earlier. If you don't care leave it out, but make sure you write solutions to a file in case something happens.
Bound b is the in-game time limit for building everything. You initially set it to infinite or some obvious value (like 10 minutes?). When you find a solution the b gets updated so every new solution you find MUST be faster (in-game) than the previous one.
A few notes. Make sure that the possible action (children in step 9) includes doing nothing (wait for more resources) and building an SCV.
Another thing that might be missing is a correct modelling of SCV movement speed. The units need to move to a place to build something and it also takes time for them to get back to mining.
I have a structure, looking a lot like a graph but I can 'sort' it. Therefore I can have two graphs, that are equivalent, but one is sorted and not the other. My goal is to compute a minimal dominant set (with a custom algorithm that fits my specific problem, so please do not link to other 'efficient' algorithms).
The thing is, I search for dominant sets of size one, then two, etc until I find one. If there isn't a dominant set of size i, using the sorted graph is a lot more efficient. If there is one, using the unsorted graph is much better.
I thought about using threads/multiprocessing, so that both graphs are explored at the same time and once one finds an answer (no solution or a specific solution), the other one stops and we go to the next step or end the algorithm. This didn't work, it just makes the process much slower (even though I would expect it to just double the time required for each step, compared to using the optimal graph without threads/multiprocessing).
I don't know why this didn't work and wonder if there is a better way, that maybe doesn't even required the use of threads/multiprocessing, any clue?
If you don't want an algorithm suggestion, then lazy evaluation seems like the way to go.
Setup the two in a data structure such that with a class_instance.next_step(work_to_do_this_step) where a class instance is a solver for one graph type. You'll need two of them. You can have each graph move one "step" (whatever you define a step to be) forward. By careful selection (possibly dynamically based on how things are going) of what a step is, you can efficiently alternate between how much work/time is being spent on the sorted vs unsorted graph approaches. Of course this is only useful if there is at least a chance that either algorithm may finish before the other.
In theory if you can independently define what those steps are, then you could split up the work to run them in parallel, but it's important that each process/thread is doing roughly the same amount of "work" so they all finish about the same time. Though writing parallel algorithms for these kinds of things can be a bit tricky.
Sounds like you're not doing what you describe. Possibly you're waiting for BOTH to finish somehow? Try doing that, and seeing if the time changes.
I am trying to run a sort of simulations where there are fixed parameters i need to iterate on and find out the combinations which has the least cost.I am using python multiprocessing for this purpose but the time consumed is too high.Is there something wrong with my implementation?Or is there better solution.Thanks in advance
import multiprocessing
class Iters(object):
#parameters for iterations
iters['cwm']={'min':100,'max':130,'step':5}
iters['fx']={'min':1.45,'max':1.45,'step':0.01}
iters['lvt']={'min':106,'max':110,'step':1}
iters['lvw']={'min':9.2,'max':10,'step':0.1}
iters['lvk']={'min':3.3,'max':4.3,'step':0.1}
iters['hvw']={'min':1,'max':2,'step':0.1}
iters['lvh']={'min':6,'max':7,'step':1}
def run_mp(self):
mps=[]
m=multiprocessing.Manager()
q=m.list()
cmain=self.iters['cwm']['min']
while(cmain<=self.iters['cwm']['max']):
t2=multiprocessing.Process(target=mp_main,args=(cmain,iters,q))
mps.append(t2)
t2.start()
cmain=cmain+self.iters['cwm']['step']
for mp in mps:
mp.join()
r1=sorted(q,key=lambda x:x['costing'])
returning=[r1[0],r1[1],r1[2],r1[3],r1[4],r1[5],r1[6],r1[7],r1[8],r1[9],r1[10],r1[11],r1[12],r1[13],r1[14],r1[15],r1[16],r1[17],r1[18],r1[19]]
self.counter=len(q)
return returning
def mp_main(cmain,iters,q):
fmain=iters['fx']['min']
while(fmain<=iters['fx']['max']):
lvtmain=iters['lvt']['min']
while (lvtmain<=iters['lvt']['max']):
lvwmain=iters['lvw']['min']
while (lvwmain<=iters['lvw']['max']):
lvkmain=iters['lvk']['min']
while (lvkmain<=iters['lvk']['max']):
hvwmain=iters['hvw']['min']
while (hvwmain<=iters['hvw']['max']):
lvhmain=iters['lvh']['min']
while (lvhmain<=iters['lvh']['max']):
test={'cmain':cmain,'fmain':fmain,'lvtmain':lvtmain,'lvwmain':lvwmain,'lvkmain':lvkmain,'hvwmain':hvwmain,'lvhmain':lvhmain}
y=calculations(test,q)
lvhmain=lvhmain+iters['lvh']['step']
hvwmain=hvwmain+iters['hvw']['step']
lvkmain=lvkmain+iters['lvk']['step']
lvwmain=lvwmain+iters['lvw']['step']
lvtmain=lvtmain+iters['lvt']['step']
fmain=fmain+iters['fx']['step']
def calculations(test,que):
#perform huge number of calculations here
output={}
output['data']=test
output['costing']='foo'
que.append(output)
x=Iters()
x.run_thread()
From a theoretical standpoint:
You're iterating every possible combination of 6 different variables. Unless your search space is very small, or you wanted just a very rough solution, there's no way you'll get any meaningful results within reasonable time.
i need to iterate on and find out the combinations which has the least cost
This very much sounds like an optimization problem.
There are many different efficient ways of dealing with these problems, depending on the properties of the function you're trying to optimize. If it has a straighforward "shape" (it's injective), you can use a greedy algorithm such as hill climbing, or gradient descent. If it's more complex, you can try shotgun hill climbing.
There are a lot more complex algorithms, but these are the basic, and will probably help you a lot in this situation.
From a more practical programming standpoint:
You are using very large steps - so large, in fact, that you'll only probe the function 19,200. If this is what you want, it seems very feasible. In fact, if I comment the y=calculations(test,q), this returns instantly on my computer.
As you indicate, there's a "huge number of calculations" there - so maybe that is your real problem, and not the code you're asking for help with.
As to multiprocessing, my honest advise is to not use it until you already have your code executing reasonably fast. Unless you're running a supercomputing cluster (you're not programming a supercomputing cluster in python, are you??), parallel processing will get you speedups of 2-4x. That's absolutely negligible, compared to the gains you get by the kind of algorithmic changes I mentioned.
As an aside, I don't think I've ever seen that many nested loops in my life (excluding code jokes). If don't want to switch to another algorithm, you might want to consider using itertools.product together with numpy.arange