Problem: How long does it take to generate a Python list of prime numbers from 1 to N? Plot a graph of time taken against N.
I used SymPy to generate the list of primes.
I expected the time to increase monotonically.
But why is there a dip?
import numpy as np
import matplotlib.pyplot as plt
from time import perf_counter as timer
from sympy import sieve
T = []
tic=timer()
N= np.logspace(1,8,30)
for Nup in N:
tic = timer()
A=list(sieve.primerange(1,Nup))
toc = timer()
T.append(toc-tic)
plt.loglog(N,T,'x-')
plt.grid()
plt.show()
Time taken to generate primes up to N
The sieve itself requires an exponential amount of time to compute ever larger numbers of primes, so plotting the pure runtime of a sieve should come out to roughly a straight line for large numbers.
In your copy of the plot, it looks like it's actually getting a bit worse over time, but when I run your script it's not perfectly straight, but close to a straight line on the log scale towards the end. However, there is a bit of a bend at the start, as with your result.
This makes sense because the sieve caches previous results, but initially it gets little benefit from that and there's the small overhead of setting up the cache and increasing its size which goes down over time, and more importantly there's the overhead of the actual call to the sieve routine. Also, this type of performance measurement is very sensitive to anything else going on on your system, including whatever Python and your IDE are doing
Here's your code with some added code to loop over various initial runs, warming the cache of the sieve before every run - it shows pretty clearly what the effect is:
import numpy as np
import matplotlib.pyplot as plt
from time import perf_counter as timer, sleep
from sympy import sieve
for warmup_step in range(0, 5):
warmup = 100 ** warmup_step
sieve._reset() # this resets the internal cache of the sieve
_ = list(sieve.primerange(1, warmup)) # warming the sieve's cache
_ = timer() # avoid initial delays from other elements of the code
sleep(3)
print('Start')
times = []
tic = timer()
numbers = np.logspace(1, 8, 30)
for n in numbers:
tic = timer()
_ = list(sieve.primerange(1, n))
toc = timer()
times.append(toc - tic)
print(toc, n) # provide some visual feedback of speed
plt.loglog(numbers, times, 'x-')
plt.title(f'Warmup: {warmup}')
plt.ylim(1e-6, 1e+1) # fix the y-axis, so the charts are easily comparable
plt.grid()
plt.show()
The lesson to be learned here is that you need to consider overhead. Of your own code and the libraries you use, but also the entire system that sits around it: the Python VM, your IDE, whatever is running on your workstation, the OS that runs it, the hardware.
The test above is better, but if you want really nice results, run the whole thing a dozen times and average out the results over runs.
Results:
Related
I have a bunch of independent N body sims I want to run in parallel in python. The walltime for individual sims is going to vary dramatically depending on the parameters of the bodies in the sims. It seemed like the best way to do this would be to build pool of processes with the multiprocessing module, give them the sim jobs with the starmap() function, and have them save the results to separate files based on the process ID. However, I've getting awful parallel performance. There is no speedup between 2 and 4 processes (I have 4 CPU on my laptop) and the unix time utility seems to think that the CPU usage percentage is ~150% which is terrible. Below is my code:
import rebound
import numpy as np
import multiprocessing as mp
def two_orbits_one_pool(orbit1, orbit2):
#######################################
print('process number', mp.current_process().name)
#######################################
# build simulation
sim = rebound.Simulation()
# add sun
sim.add(m=1.)
# add two overlapping orbits
sim.add(primary=sim.particles[0], m=orbit1['m'], a=orbit1['a'], e=orbit1['e'], inc=orbit1['i'], \
pomega=orbit1['lop'], Omega=orbit1['lan'], M=orbit1['M'])
sim.add(primary=sim.particles[0], m=orbit2['m'], a=orbit2['a'], e=orbit2['e'], inc=orbit2['i'], \
pomega=orbit2['lop'], Omega=orbit2['lan'], M=orbit2['M'])
sim.move_to_com()
# integrate for 10 orbits of orbit1
P = 2.*np.pi * np.sqrt(orbit1['a']**3)
sim.automateSimulationArchive("archive-{}.bin".format(mp.current_process().name), interval=P)
sim.integrate(10.*P)
if __name__ == "__main__":
# orbit definitions
N_M = 10
N_lop = 10
m = 1e-6
a, e = 1., 0.3
inc, lop, lan = 0., 0., 0.
M = np.linspace(0., 2*np.pi, endpoint=False, num=N_M)
dlop = np.linspace(0., 0.05, num=N_lop)
# orbit dictionaries
args = []
for i in range(dlop.shape[0]):
for j in range(M.shape[0]):
for k in range(M.shape[0]):
args.append( ( {'m':m, 'a':a, 'e':e, 'i':inc, \
'lop':lop, 'lan':lan, 'M':M[j]},
{'m':m, 'a':a, 'e':e, 'i':inc, \
'lop':lop+dlop[i], 'lan':lan, 'M':M[k]} ) )
# fill the pool with orbit jobs
with mp.Pool() as pool:
pool.starmap(two_orbits_one_pool, args)
Could someone explain why this is performing so poorly? I'm much more used to OpenMP and MPI; I'm not that familiar with parallel programming in Python. Overall, I've been quite disappointed in the multiprocessing module. I think maybe I should try using the numba module instead?
EDIT:
In response to Roland Smith's response, I profiled the integration and save time for my code. Here is a stripplot showing the results. As you can see, both Roland Smith's and J_H's suggestions were true. There is a subset of initial conditions that result in extremely long integration times due to close encounters between the bodes. However, in general, the save time was about 5 times longer than the integration time. The job suffers from stragglers and is disk i/o bound.
If there is no discernable speedup, then probably your code is not CPU-bound.
In general, writing to a disk (even an SSD) is much slower than running code on the CPU.
If several worker processes are writing significant amounts of data to disk, that might be the bottleneck.
To diagnose the problem, you have to measure.
You should separate the calculations from the saving of the data; e.g. run sim.integrate() followed by sim.simulationarchive_snapshot() 10 times, and sandwich each of those calls between time.monotonic() calls. Then return the average time of the integration step and the snapshot steps as shown below.
import time
def two_orbits_one_pool(orbit1, orbit2):
#######################################
print('process number', mp.current_process().name)
#######################################
# build simulation
sim = rebound.Simulation()
# add sun
sim.add(m=1.)
# add two overlapping orbits
sim.add(primary=sim.particles[0], m=orbit1['m'], a=orbit1['a'], e=orbit1['e'], inc=orbit1['i'], \
pomega=orbit1['lop'], Omega=orbit1['lan'], M=orbit1['M'])
sim.add(primary=sim.particles[0], m=orbit2['m'], a=orbit2['a'], e=orbit2['e'], inc=orbit2['i'], \
pomega=orbit2['lop'], Omega=orbit2['lan'], M=orbit2['M'])
sim.move_to_com()
# integrate for 10 orbits of orbit1
P = 2.*np.pi * np.sqrt(orbit1['a']**3)
arname = "archive-{}.bin".format(mp.current_process().name)
itime, stime = 0.0, 0.0
for k in range(10):
start = time.monotonic()
sim.integrate(P)
itime += time.monotonic() - start
start = time.monotonic()
sim.simulationarchive_snapshot(arname)
stime += time.monotonic() - start
return (mp.current_process().name, itime/10, stime/10)
# Run the calculations
with mp.Pool() as pool:
data = pool.starmap(two_orbits_one_pool, args)
# Print the times that it took.
for name, itime, stime in data:
print(f"worker {name}: itime {itime} s, stime {stime} s")
That should tell you what the bottleneck is.
Possible solutions if writing to disk is the bottleneck;
Use an SSD to store the simulation results.
Use a RAM-disk to store the simulation results. (Although compared to an SSD not a huge performance boost.)
Check if you can tune your OS for maximum write performance.
Edit1: Given your measurement result, the obvious performance improvement is to save less often.
Another option that might be worth looking at is staggering the writes. That only makes sense if there is significant overlap between the writes from different processes, and if those concurrent writes can saturate the disk I/O subsystem. So you'd have to measure that first.
If there is overlap, create a Lock object in the parent process. Then acquire the lock before (explicitly) saving, and release it after. This won't work with automateSimulationArchive.
A last option is to write your own save function using mmap. Using mmap is somewhat clunky compared to normal file handling in Python. But it can significantly improve performance. However I am unsure that the gains justify the effort in this case.
The straggler effect can have a big impact on such jobs.
straggler effect
Suppose you have N tasks for N cores,
and each task has a different duration.
Order by duration to find min_time and max_time.
All N cores will be busy up through min_time,
but then they go idle, one by one.
Just before max_time, only a single "straggler" core is being used.
predictions
If you can make a decent guess about task duration beforehand,
use that to sort them in descending order.
For T tasks > N cores, schedule the long tasks first.
Then N tasks run for a while, the shortest of those completes,
and the now-idle core picks up a task of "medium" duration.
By the time we get to the T-th task, each core has a random
amount of work still to do, and we're scheduling a "short" task.
So cores are mostly busy doing useful work, right up till near the end.
logging
If you cannot make a useful duration estimate a priori,
at least record the start times and durations.
Use that to figure out whether stragglers are causing you grief,
or if it's something else like L3 cache thrashing.
I am having an issue with my attempt in speeding up the computation of my program. In the serialized python version of my code, I'm computing the values of a function f(x), which returns a float, for sliding windows of the NumPy array as can be seen below:
a = np.array([i for i in range(1, 10000000)]) # Some data here
N = 100
result = []
for i in range(N, len(a)):
result.append(f(a[i - N:i]))
Since the NumPy array is really large and f(x) runtime is high, I've tried to apply multiprocessing to speed up my code. Through my research, I found that charm4py might be a great solution and it has a Pool feature, which breaks up an array in chunks and distributes work between spawned processes. I've implemented charm4py's multiprocessing example and then, translated it to my case:
# Split an array into subarrays for sequential processing (takes only 5 seconds)
a = np.array([a[i - N:i] for i in range(N, len(a))])
result = charm.pool.map(f, a, chunksize=512, ncores=-1)
# I'm running this code through "charmrun +p18 example.py"
The issue that I've encountered is that code was running a lot slower, despite being executed on a more powerful instance (18 physical cores vs 6 physical cores).
I've expected to see ~3x improvement, but it didn't happen. While searching for solutions I've learned that there is some overhead for expensive deserialization/spinning up new processes, but I am not sure if this is the case.
I would really appreciate any feedback or suggestions on how one can implement fast parallel processing of a NumPy array (assuming that function f(x) is not vectorized, takes a pretty long time to compute, and internally makes a large number of specific/individual calls that cannot be parallelized)?
Thank you!
It sounds like you're trying to parallelize this operation with either Charm or Ray (it's not clear how you would use both together).
If you choose to use Ray, and your data is a numpy array, you can take advantage of zero-copy reads to avoid any deserialization overhead.
You may want to optimize your sliding window function a bit, but it will likely look like this:
#ray.remote
def apply_rolling(f, arr, start, end, window_size):
results_arr = []
for i in range(start, end - window_size):
results_arr.append(f(arr[i : i + windows_size])
return np.array(results_arr)
note that this structure lets us call f multiple times within a single task (aka batching).
To use our function:
# Some small setup
big_arr = np.arange(10000000)
big_arr_ref = ray.put(big_arr)
batch_size = len(big_arr) // ray.available_resources()["CPU"]
window_size = 100
# Kick off our tasks
result_refs = []
for i in range(0, big_arr, batch_size):
end_point = min(i + batch_size, len(big_arr))
ref = apply_rolling.remote(f, big_arr_ref, i, end_point)
result_refs.append(ref)
# Handle the results
flattened = []
for section in ray.get(result_refs):
flattened.extend(section)
I'm sure you'll want to customize this code, but here are 2 important and nice properties that you'll likely want to maintain.
Batching: We're utilizing batching to avoid starting too many tasks. In any system, parallelizing incurs overhead, so we always want to be careful and make sure we don't start too many tasks. Furthermore, we are calculating batch_size = len(big_arr) // ray.available_resources()["CPU"] to make sure we use exactly the same number of batches as we have CPUs.
Shared memory: Since Ray's object store supports zero copy reads from numpy arrays, calling ray.get or reading from a numpy array is pretty much free (on a single machine where there are no network costs). There is some overhead in serializing/calling ray.put though, so this approach only calls put (the expensive operation) once, and ray.get (which is implicitly called) many times.
Tip: Be careful when passing arrays as parameters directly into remote functions. It will call ray.put multiple times, even if you pass the same object.
Here's an example based off of your code snippet that uses Ray to parallelize the array computations.
Note that the best way to do this will depend on what your function f looks like.
import numpy as np
import ray
import time
ray.init()
N = 100000
a = np.arange(10**7)
a_id = ray.put(a)
#ray.remote
def f(array, index):
# Do processing
time.sleep(0.2)
return 1
result_ids = []
for i in range(len(a) // N):
result_ids.append(f.remote(a_id, i))
results = ray.get(result_ids)
My class has been learning Python's Turtle module recently (which I gather uses tkinter), and I was wondering if there was a way to adjust the rate at which tkinter/turtle executes its code, because it doesn't seem (from my limited understanding) to be limited by the computational abilities of my computer. I say that because in task manager (I'm on Windows if that affects anything), the python shell only uses a small percentage of the CPU's limits (~2%) and likewise for the GPU, RAM, disc etc. Additionally, increasing it's operational priority neither affects how much of my CPU is used, nor does it increase the rate it executes its code.
Note that I'm not referring to the speed that the Turtle executes each action as determined by turtle.speed(), I've already got that at '0' such that it's effectively instantaneous, my problem instead lies with what seems to be the time taken between each step which appears to be limited to 80 actions per second (more on this later).
For example, the following code draws an approximation of a parabola, given some precision. The higher the precision, the better the approximation but the longer it takes to draw, as it's taking more, smaller steps.
precision=0.1
t.penup()
t.goto(-250,150)
t.pendown()
for n in range(800*precision):
t.setheading(math.degrees(math.atan(0.02*n-8)))
t.fd(1)
Effectively, for precisions close to or above 1, it takes far longer than I would like, and in general, drawing fine curves in Tkinter is too slow, so I want to know if there's a way to adjust this speed.
Part of my difficulty when trying to research a solution has been that I simply don't know what the relevant terminology is‚ so I've tried using vaguely related terms including some hardware-based analogues along with various other things that are kind of analogous eg:
clock speed
refresh rate
frame rate
tick speed (Minecraft ftw?)
step-through rate
execution rate
actions per second
steps per second
But all to no avail, attempting to describe the issue in Google fails too.
Additionally, I simply don't understand what the underlying bottleneck is (or even if there is a single bottleneck) that's causing it to be so slow, which makes the issue difficult to solve.
I've noticed that if a command for the turtle takes a significant amount of time to calculate (for example by forcing it to do ridiculous amounts of calculations to work out a simple value), then it does simply take longer to execute each step, suggesting that maybe it is just a hardware limitation. However, when using the python timeit decorator to time the execution, it seems to always execute precisely some number of actions per second for any function, regardless of the complexity of the individual action, up to a point, beyond which complexity begins to slow it down. So it's as though there's some limit on the rate it can occur. Though additionally, this specific limit seems to occasionally change suggesting that the computer's state does influence it to some degree.
Also, just in case, this is the timeit setup I used:
import timeit
mysetup="""
import math
import turtle as t
def DefaultDerivative(x):
return 2*x-x
def GeneralEquation(precision=1,XShift=0,YShift=0,Derivative=DefaultDerivative):
t.penup()
t.goto(XShift,YShift)
t.pendown()
for n in range(0,int(800*precision)):
t.setheading((math.degrees(math.atan(Derivative(((0.01*n)-(4*precision))/precision)))))
t.fd(1/precision)
def equation1(x):
return (2*(x**2))+(2*x)
def equation2(x):
return x**2
def equation3(x):
return math.cos(x)
def equation4(x):
return 2*x
t.speed(0)
"""
mycode="""
GeneralEquation(5,-350,300,equation4)
"""
print("time: "+str(timeit.timeit(setup=mysetup,stmt=mycode,number=10)))
Anyway, this is my first question so I hope I explained myself well enough.
Thank you.
Is this quick enough for your purpose:
import timeit
mysetup = """
import turtle
from math import atan, cos
def DefaultDerivative(x):
return 2 * x - x
def GeneralEquation(precision=1, XShift=0, YShift=0, Derivative=DefaultDerivative):
turtle.radians()
turtle.tracer(False)
turtle.penup()
turtle.goto(XShift, YShift)
turtle.pendown()
for n in range(0, int(800 * precision)):
turtle.setheading(atan(Derivative((0.01 * n - 4 * precision) / precision)))
turtle.forward(1 / precision)
turtle.tracer(True)
def equation1(x):
return 2 * x ** 2 + 2 * x
def equation2(x):
return x ** 2
def equation3(x):
return cos(x)
def equation4(x):
return 2 * x
"""
mycode = """
GeneralEquation(5, -350, 300, equation4)
"""
print("time: " + str(timeit.timeit(setup=mysetup, stmt=mycode, number=10)))
Basically, I've turned off turtle's attempts at animation. I also threw in a command to make turtle think in radians so you don't need to call the degrees() function over and over. If you want to see some animation, you can tweak the argument to tracer(), eg. turtle.tracer(20).
Hi I am trying to estimate the run time for my fft code form numpy. With different input length N. The following is my code.
import cmath
import math
from random import uniform
from numpy.fft import fft
import time
for i in range(3,10):
N = 2**i
x = [uniform(-32768,32767) for i in range(N)]
t0 = time.clock()
X = fft(x)
t1 = time.clock()
print t1-t0
This is the result I got, the first line with input length N=3 should be the fastest one, but no matter how many times I run, the first one is always the largest one. I guess this is a problem with timer, however I don't know the exact reason for it. Can anyone explain this to me?
Output:
4.8e-05
3e-05
1.7e-05
6e-05
3.1e-05
5.4e-05
9.6e-05
The time interval is too small to be accurately measured by time.clock(), as there is latency jitter in the OS call. Instead, do enough work (loop each fft a few thousand or million times) until the work to be measured takes a few seconds. Also repeat each measurement several times and take an average, as there may be other system overheads (cache flushes, process switches, etc.) that can vary the performance.
I want to generate square clock waveform to external device.
I use python 2.7 with Windows 7 32bit on an old PC with a LPT1 port.
The code is simple:
import parallel
import time
p = parallel.Parallel() # open LPT1
x=0
while (x==0):
p.setData(0xFF)
time.sleep(0.0005)
p.setData(0x00)
I do see the square wave (using scope) but with not expected time period.
I will be gratefull for any help
It gives an expected performance for a while... Continue to reduce times
import parallel
import time
x=0
while (x<2000):
p = parallel.Parallel()
time.sleep(0.01) # open LPT1
p.setData(0xFF)
p = parallel.Parallel() # open LPT1
time.sleep(0.01)
p.setData(0x00)
x=x+1
To generate signals like that is hard. To mention one reason why it is hard might be that the process gets interrupted returns when the sleep time is exceeded.
Found this post about sleep precision with an accepted answer that is great:
How accurate is python's time.sleep()?
another source of information: http://www.pythoncentral.io/pythons-time-sleep-pause-wait-sleep-stop-your-code/
What the information tells you is that Windows will be able to do a sleep for a minimum ~10ms, in Linux the time is approximately 1ms, but may vary.
Update
I made function that make possible to sleep less then 10ms. But the precision is very sketchy.
In the attached code I included a test that presents how the precision behaves. If you want higher precision, I strongly recommend you read the links I attached in my original answer.
from time import time, sleep
import timeit
def timer_sleep(duration):
""" timer_sleep() sleeps for a given duration in seconds
"""
stop_time = time() + duration
while (time() - stop_time) < 0:
# throw in something that will take a little time to process.
# According to measurements from the comments, it will take aprox
# 2useconds to handle this one.
sleep(0)
if __name__ == "__main__":
for u_time in range(1, 100):
u_constant = 1000000.0
duration = u_time / u_constant
result = timeit.timeit(stmt='timer_sleep({time})'.format(time=duration),
setup="from __main__ import timer_sleep",
number=1)
print('===== RUN # {nr} ====='.format(nr=u_time))
print('Returns after \t{time:.10f} seconds'.format(time=result))
print('It should take\t{time:.10f} seconds'.format(time=duration))
Happy hacking