Running an action x times during t seconds uniformly

Running an action x times during t seconds uniformly - python

I want to repeat a task several times during an interval of time and I want to do that in an uniform way so if I want to do the task 4 times in 1 second it should be executed at t = 0, 0.25, 0.5 and 0.75.
So now I am doing:
import math
import socket
s = socket.socket(...) #not important
time_step = 1./num_times_executed
for _ in num_times_executed:
now = time.time()
s.sendto(...) #action i do
time.sleep(max(0,time_step-(time.time()-now)))
However there is a lot of overhead, the bigger the loop is the more drift I get. For example with num_times_executed = 800, it takes 1.1 seconds so ~ 10% wrong...
Is there a way to do that with a good precision ?

time_step = 1./num_times_executed
start = time.time()
for i in num_times_executed:
s.sendto(...) #action i do
next_send_time = start + (i+1) * time_step
time.sleep(max(0,next_send_time - time.time()))
Now you're not going to get any drift because the time steps are firmly set values off of a start time. Previously the little calculations happening before setting now = time.time() would cause a tiny drift, but now so long as your time_step is long enough to execute the s.sendto(...) command, you shouldn't have any drift.

Related

Python inefficient loop despite its simplicity

I've tried to run this little code, it just takes random points (here 50k, near of what I have in reality) and returns the 10th nearest points for each point randomly selected.
But unfortunately this is (really !) long because of the loop for sure.
As I'm pretty new in 'code optimization', is there a trick to make this mush faster ? (Faster at Python scale I know I'm not coding in C++).
Here is a reproducible example with data size close to what I have:
import time
import numpy as np
from numpy import random
from scipy.spatial import distance
# USEFUL FUNCTION
start_time = time.time()
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum("ij,ij->i", deltas, deltas)
ndx = dist_2.argsort()
return data[ndx[:10]]
# REPRODUCIBLE DATA
mean = np.array([0.0, 0.0, 0.0])
cov = np.array([[1.0, -0.5, 0.8], [-0.5, 1.1, 0.0], [0.8, 0.0, 1.0]])
data = np.random.multivariate_normal(mean, cov, 500000)
# START RUNNING
points = data[np.random.choice(data.shape[0], int(np.round(0.1 * len(data), 0)))]
print(len(points))
for w in points:
closest_node(w, data)
print("--- %s seconds ---" % (time.time() - start_time))

The time it takes to run argsort every loop on your 500000 element array is huge. The only improvement I can think of is to use something that can return the smallest 10 elements without fully sorting the whole array.
A fast way to find the largest N elements in an numpy array
So instead of
ndx = dist_2.argsort()
return data[ndx[:10]]
It would be
ndx = np.argpartition(dist_2, 10)[:10]
return data[ndx[:10]]
I only benchmarked on 500 points because it already took quite some time to run on my PC.
N=500
Using argsort: 25.625439167022705 seconds
Using argpartition: 6.637120485305786 seconds

You would be probably best off analyzing the slowest points via profiler: How do I find out what parts of my code are inefficient in Python
One thing that might be possible at first glance, is that you should probably move as much as possible outside loop. If you are going to convert points via np.asarray(), it possibly might be better to do it once for all points before the loop, and use the result in function, rather than doing np.asarray() in each loop run.

Numba CUDA speedup seems to low

Newbie starting with Numba/cuda here.
I wrote this little test script to compare between #jit and #cuda.jit. speeds, just to get a feel for it. It calculates 10M steps of a logistic equation for 256 separate instances.
The cuda part takes approximately 1.2s to finish.
The cpu 'jitted' part finishes in close to 5s (just one thread used on the cpu).
So there is a speedup of about x4, from going to the GPU (a dedicated GTX1080TI not doing anything else). I expected the cuda part, doing all 256 instances in parallel, to be much faster. What am I doing wrong?
Here is the working example:
#!/usr/bin/python3
#logistic equation on gpu/cpu comparison
import os,sys
#Set environment variables (needed for numba 0.42 to find lvvm)
os.environ['NUMBAPRO_NVVM'] = '/usr/lib/x86_64-linux-gnu/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/lib/nvidia-cuda-toolkit/libdevice/'
from time import time
from scipy import *
from numba import cuda, jit
from numba import int64,int32, float64
#cuda.jit
def logistic_cuda(array_in,array_out):
pos = cuda.grid(1)
x = array_in[pos]
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
#jit
def logistic_cpu(array_in,array_out):
for pos,x in enumerate(array_in):
for k in range(10*1000*1000):
x = 3.9 * x * (1 - x)
array_out[pos] = x
if __name__ == '__main__':
N=256
in_ary = random.uniform(low=0.2,high=0.9,size=N).astype('float32')
out_ary = zeros(N,dtype='float32')
t0 = time()
#explicit copying. not really needed
d_in_ary = cuda.to_device(in_ary)
d_out_ary = cuda.to_device(out_ary)
t1 = time()
logistic_cuda[1,N](d_in_ary,d_out_ary)
cuda.synchronize()
t2 = time()
out_ary = d_out_ary.copy_to_host()
t3 = time()
print(out_ary)
print('Total time cuda: %g seconds.'%(t3-t0))
out_ary2 = zeros(N)
t4 = time()
logistic_cpu(in_ary,out_ary2)
t5 = time()
print('Total time cpu: %g seconds.'%(t5-t4))
print('\nDifference:')
print(out_ary2-out_ary)
#Total time cuda: 1.19364 seconds.
#Total time cpu: 5.01788 seconds.
Thanks!

The problem likely comes from the very small amount of data and the loop dependency. Indeed, modern Nvidia GPUs can execute thousands of CUDA threads simultaneously (packed in warps of 32 threads) thanks to the large amount of CUDA cores. In your case, each thread performs a computation on one cell of array_out using a sequential loop. However, there are only 256 cells. Thus, at most 256 threads (8 warps) can run simultaneously - only a tiny faction of the number of simultaneous threads that your GPU should be able to manage. As a result, if you want a better speed-up, you need to provide more parallelism to the GPU (for example by increasing the data size or by computing multiple regression at the same time).

how to make this simulating of a distribution of sample mean faster

I am simulating flipping 999 coins 1000 times, and draw a distribution of sample mean, which may take a long time (about 21 seconds). Is there a better way to do this? a faster way to run for loop, for instance. will vectorizing be useful?
import datetime
import numpy as np
sample_mean_dis = []
start_time = datetime.datetime.now()
# to draw a distribution of sample mean
for i in range(1000):
if not (i%100):
print('iterate: ', i)
sums_1000coins = []
# simulate 1k repetition of experiment_1
# and consider this opertation as a sample
# and compute the sample mean
for i in range(1000):
# this is simulating experiment_1 which flip 999 coins
# and sum heads
coins = np.random.randint(2, size=999)
sums_1000coins.append(np.sum(1 == coins))
sample_mean_dis.append(np.mean(sums_1000coins))
end_time = datetime.datetime.now()
elapsedTime = end_time - start_time
print("Elapsed time: %d seconds" % (elapsedTime.total_seconds()))

To flip 999 coins and see which come up heads, read 999 bits of random data (a bit can either be 0 or 1 with probability 50/50, just like a coin) and then count how many bits are set to 1.
import random
bin(random.getrandbits(999)).count("1")
the above will probably return a number close to 499.5
To flip 999 coins 1000 times do the above in a for loop:
num_heads = [bin(random.getrandbits(999)).count("1") for _ in range(1000)]
num_heads will be a list of 1000 integers normally distributed around 499.5 (999/2).

Memory consumption when using multiprocessing in python

I am using python's multiprocessing module to launch some Monte-Carlo simulation in order to speed up the computation. The code I have looks like this:
def main():
(various parameters are being set up...)
start = 0
end = 10
count = int(1e4)
time = np.linspace(start, end, num=count)
num_procs = 12
mc_iterations_per_proc = int(1e5)
mc_iterations = num_procs * mc_iterations_per_proc
mean_estimate, mean_estimate_variance = np.zeros(count), np.zeros(count)
pool = multiprocessing.Pool(num_procs)
for index, (estimate, estimate_variance) in enumerate(pool.imap_unordered(mc_linear_estimate,
((disorder_mean, intensity, wiener_std, time) for index in xrange(mc_iterations)), chunksize=mc_iterations_per_proc)):
delta = estimate - mean_estimate
mean_estimate = mean_estimate + delta / float(index + 1)
mean_estimate_variance = mean_estimate_variance + delta * (estimate - mean_estimate)
mean_estimate_variance = mean_estimate_variance / float(index)
Ok, now mc_linear_estimate is a function taking *args and creating additional variables inside it. It looks like this:
def mc_linear_estimate(*args):
disorder_mean, intensity, wiener_std, time = args[0]
theta_process = source_process(time, intensity, disorder_mean)
xi_process = observed_process(time, theta_process, wiener_std)
gamma = error_estimate(time, intensity, wiener_std, disorder_mean)
estimate = signal_estimate(time, intensity, wiener_std, disorder_mean, gamma, xi_process)
estimate_variance = (estimate - theta_process) ** 2
return estimate, estimate_variance
As you could see, the number of iterations is pretty large (1.2M), and the size of all the arrays is 10K doubles, therefore I use Welford's algorithm to compute mean and variance, due to the fact that it does not require the store every element of the considered sequences in memory. However, this does not help.
The problem: I run out of memory. When I launch the application, 12 processes emerge (as seen using top program on my Linux machine). They instantly start consuming a lot of memory, but as the Linux machine I'm using has 49G of RAM, things are OK for some time. Then, as each of the processes takes up to around 4G of RAM, one of them fails and shows as <defunct> in top. Then another process falls off, and this happens until only one process is left, which ultimately fails with "Out of memory" exception.
The questions:
What am I possibly doing wrong?
How could I improve the code so that it wouldn't consume all the memory?

Rewriting a for loop in pure NumPy to decrease execution time

I recently asked about trying to optimise a Python loop for a scientific application, and received an excellent, smart way of recoding it within NumPy which reduced execution time by a factor of around 100 for me!
However, calculation of the B value is actually nested within a few other loops, because it is evaluated at a regular grid of positions. Is there a similarly smart NumPy rewrite to shave time off this procedure?
I suspect the performance gain for this part would be less marked, and the disadvantages would presumably be that it would not be possible to report back to the user on the progress of the calculation, that the results could not be written to the output file until the end of the calculation, and possibly that doing this in one enormous step would have memory implications? Is it possible to circumvent any of these?
import numpy as np
import time
def reshape_vector(v):
b = np.empty((3,1))
for i in range(3):
b[i][0] = v[i]
return b
def unit_vectors(r):
return r / np.sqrt((r*r).sum(0))
def calculate_dipole(mu, r_i, mom_i):
relative = mu - r_i
r_unit = unit_vectors(relative)
A = 1e-7
num = A*(3*np.sum(mom_i*r_unit, 0)*r_unit - mom_i)
den = np.sqrt(np.sum(relative*relative, 0))**3
B = np.sum(num/den, 1)
return B
N = 20000 # number of dipoles
r_i = np.random.random((3,N)) # positions of dipoles
mom_i = np.random.random((3,N)) # moments of dipoles
a = np.random.random((3,3)) # three basis vectors for this crystal
n = [10,10,10] # points at which to evaluate sum
gamma_mu = 135.5 # a constant
t_start = time.clock()
for i in range(n[0]):
r_frac_x = np.float(i)/np.float(n[0])
r_test_x = r_frac_x * a[0]
for j in range(n[1]):
r_frac_y = np.float(j)/np.float(n[1])
r_test_y = r_frac_y * a[1]
for k in range(n[2]):
r_frac_z = np.float(k)/np.float(n[2])
r_test = r_test_x +r_test_y + r_frac_z * a[2]
r_test_fast = reshape_vector(r_test)
B = calculate_dipole(r_test_fast, r_i, mom_i)
omega = gamma_mu*np.sqrt(np.dot(B,B))
# write r_test, B and omega to a file
frac_done = np.float(i+1)/(n[0]+1)
t_elapsed = (time.clock()-t_start)
t_remain = (1-frac_done)*t_elapsed/frac_done
print frac_done*100,'% done in',t_elapsed/60.,'minutes...approximately',t_remain/60.,'minutes remaining'

One obvious thing you can do is replace the line
r_test_fast = reshape_vector(r_test)
with
r_test_fast = r_test.reshape((3,1))
Probably won't make any big difference in performance, but in any case it makes sense to use the numpy builtins instead of reinventing the wheel.
Generally speaking, as you probably have noticed by now, the trick with optimizing numpy is to express the algorithm with the help of numpy whole-array operations or at least with slices instead of iterating over each element in python code. What tends to prevent this kind of "vectorization" is so-called loop-carried dependencies, i.e. loops where each iteration is dependent on the result of a previous iteration. Looking briefly at your code, you have no such thing, and it should be possible to vectorize your code just fine.
EDIT: One solution
I haven't verified this is correct, but should give you an idea of how to approach it.
First, take the cartesian() function, which we'll use. Then
def calculate_dipole_vect(mus, r_i, mom_i):
# Treat each mu sequentially
Bs = []
omega = []
for mu in mus:
rel = mu - r_i
r_norm = np.sqrt((rel * rel).sum(1))
r_unit = rel / r_norm[:, np.newaxis]
A = 1e-7
num = A*(3*np.sum(mom_i * r_unit, 0)*r_unit - mom_i)
den = r_norm ** 3
B = np.sum(num / den[:, np.newaxis], 0)
Bs.append(B)
omega.append(gamma_mu * np.sqrt(np.dot(B, B)))
return Bs, omega
# Transpose to get more "natural" ordering with row-major numpy
r_i = r_i.T
mom_i = mom_i.T
t_start = time.clock()
r_frac = cartesian((np.arange(n[0]) / float(n[0]),
np.arange(n[1]) / float(n[1]),
np.arange(n[2]) / float(n[2])))
r_test = np.dot(r_frac, a)
B, omega = calculate_dipole_vect(r_test, r_i, mom_i)
print 'Total time for vectorized: %f s' % (time.clock() - t_start)
Well, in my testing, this is in fact slightly slower than the loop-based approach I started from. The thing is, in the original version in the question, it was already vectorized with whole-array operations over arrays of shape (20000, 3), so any further vectorization doesn't really bring much further benefit. In fact, it may worsen the performance, as above, maybe due to big temporary arrays.

If you profile your code, you'll see that 99% of the running time is in calculate_dipole so reducing the time for this looping really won't give a noticeable reduction in execution time. You still need to focus on calculate_dipole if you want to make this faster. I tried my Cython code for calculate_dipole on this and got a reduction by about a factor of 2 in the overall time. There might be other ways to improve the Cython code too.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Running an action x times during t seconds uniformly - python

Related

Python inefficient loop despite its simplicity

Numba CUDA speedup seems to low

how to make this simulating of a distribution of sample mean faster

Memory consumption when using multiprocessing in python

Rewriting a for loop in pure NumPy to decrease execution time

Categories

Resources