Related
I'm trying to do an operation on each pair of rows of distance n, and get the minimum (also maximum and mean) of the results for each n from 0 to n-1. For example, if Data=[1,2,3,4] and the operation is addition, Minimum=[2,3,4,5] and Maximum=[8,7,6,5], and Mean=[5,5,5,5].
I have the following code that uses ratio as the operation which works OK for a small data size but takes more than 10 seconds for 10,000 rows. Since I will be working with data that can have 1,000,000 rows, what would be a better way to do this?
import pandas as pd
import numpy as np
low=250
high=5000
length=10
x=pd.DataFrame({'A': np.random.uniform(low, high=high, size=length)})
x['mean']=x['min']=x['max']=x['A'].copy()
for i in range(0,len(x)):
ratio=x['A']/x['A'].shift(i)
x['mean'].iloc[[i]]=ratio.mean()
x['max'].iloc[[i]]=ratio.max()
x['min'].iloc[[i]]=ratio.min()
print (x)
Approach #1 : For efficiency and considering that you might have upto 1,000,000 rows, I would suggest using the underlying array data in a similar-looking loopy solution and using the efficient array-slicing to use a gradually diminishing data to work with and these two together should bring on noticeable performance boost.
Thus, an implementation would be -
a = x['A'].values
N = len(a)
out = np.zeros((N,4))
out[:,0] = a
for i in range(N):
ratio = a[i:]/a[:N-i]
out[i,1] = ratio.mean()
out[i,2] = ratio.min()
out[i,3] = ratio.max()
df_out = pd.DataFrame(out, columns= (('A','mean','min','max')))
Approach #2 : For a smaller datasize, we can use a vectorized solution that would create a square 2D array of shape (N,N) with shifted versions of the input data. Then, we mask out the upper triangular region with NaNs and finally employ numpy.nanmean, numpy.nanmin and numpy.nanmax to perform those pandas equivalent mean, min and max equivalent operations -
a = x['A'].values
N = len(a)
r = np.arange(N)
shifting_idx = (r[:,None] - r)%N
vals = a[:,None]/a[shifting_idx]
upper_tri_mask = r[:,None] < r
vals[upper_tri_mask] = np.nan
out = np.zeros((N,4))
out[:,0] = a
out[:,1] = np.nanmean(vals, 0)
out[:,2] = np.nanmin(vals, 0)
out[:,3] = np.nanmax(vals, 0)
df_out = pd.DataFrame(out, columns= (('A','mean','min','max')))
Runtime test
Approaches -
def org_app(x):
x['mean']=x['min']=x['max']=x['A'].copy()
for i in range(0,len(x)):
ratio=x['A']/x['A'].shift(i)
x['mean'].iloc[[i]]=ratio.mean()
x['max'].iloc[[i]]=ratio.max()
x['min'].iloc[[i]]=ratio.min()
return x
def app1(x):
a = x['A'].values
N = len(a)
out = np.zeros((N,4))
out[:,0] = a
for i in range(N):
ratio = a[i:]/a[:N-i]
out[i,1] = ratio.mean()
out[i,2] = ratio.min()
out[i,3] = ratio.max()
return pd.DataFrame(out, columns= (('A','mean','min','max')))
Timings -
In [3]: low=250
...: high=5000
...: length=10000
...: x=pd.DataFrame({'A': np.random.uniform(low, high=high, size=length)})
...:
In [4]: %timeit app1(x)
1 loop, best of 3: 185 ms per loop
In [5]: %timeit org_app(x)
1 loop, best of 3: 8.59 s per loop
In [6]: 8590.0/185
Out[6]: 46.432432432432435
46x+ speedup on 10,000 rows data!
Part of my Python program contains the follow piece of code, where a new grid
is calculated based on data found in the old grid.
The grid i a two-dimensional list of floats. The code uses three for-loops:
for t in xrange(0, t, step):
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
gr = new_gr
return gr
The code is extremly slow for a large grid and a large time t.
I've tried to use Numpy to speed up this code, by substituting the inner loop
with:
J = np.arange(1, width-1)
new_gr[h][J] = gr[h][J] + gr[h][J-1] ...
But the results produced (the floats in the array) are about 10% smaller than
their list-calculation counterparts.
What loss of accuracy is to be expected when converting lists of floats to Numpy array of floats using np.array(pylist) and then doing a calculation?
How should I go about converting a triple for-loop to pretty and fast Numpy code? (or are there other suggestions for speeding up the code significantly?)
If gr is a list of floats, the first step if you are looking to vectorize with NumPy would be to convert gr to a NumPy array with np.array().
Next up, I am assuming that you have new_gr initialized with zeros of shape (height,width). The calculations being performed in the two innermost loops basically represent 2D convolution. So, you can use signal.convolve2d with an appropriate kernel. To decide on the kernel, we need to look at the scaling factors and make a 3 x 3 kernel out of them and negate them to simulate the calculations we are doing with each iteration. Thus, you would have a vectorized solution with the two innermost loops being removed for better performance, like so -
import numpy as np
from scipy import signal
# Get the scaling factors and negate them to get kernel
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
# Initialize output array and run 2D convolution and set values into it
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
Verify output and runtime tests
Define functions :
def org_app(gr,t):
new_gr = np.zeros((height,width))
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
return new_gr
def proposed_app(gr,t):
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
return out
Verify -
In [244]: # Inputs
...: gr = np.random.rand(40,50)
...: height,width = gr.shape
...: t = 1
...:
In [245]: np.allclose(org_app(gr,t),proposed_app(gr,t))
Out[245]: True
Timings -
In [246]: # Inputs
...: gr = np.random.rand(400,500)
...: height,width = gr.shape
...: t = 1
...:
In [247]: %timeit org_app(gr,t)
1 loops, best of 3: 2.13 s per loop
In [248]: %timeit proposed_app(gr,t)
10 loops, best of 3: 19.4 ms per loop
#Divakar, I tried a couple of variations on your org_app. The fully vectorized version is:
def org_app4(gr,t):
new_gr = np.zeros((height,width))
I = np.arange(1,height-1)[:,None]
J = np.arange(1,width-1)
new_gr[I,J] = gr[I,J] + gr[I,J-1] + gr[I-1,J] + t * gr[I+1,J-1]-2 * (gr[I,J-1] + t * gr[I-1,J])
return new_gr
While half the speed of your proposed_app, it is closer in style to the original. And thus may help with understanding how nested loops can be vectorized.
An important step is the conversion of I into a column array, so that together I,J index a block of values.
I have an array that represents a function between 3D points. Therefore as index it gets 6-tuples. Now I want to apply a function on the elements of this array but this function depends not only on the value of an element but on its index as well. So if A is the matrix, and m and n are our 3D points that A[m,n] stores its value and k is a value between 0 and 3 then f(A,k)[m,n] is equal to:
-m[k]**2 if m==n
-m[k]**2-n[k]**2 otherwise
The following is my code:
import numpy as np
def func(a,k):
b=np.empty(a.shape)
for i in range(a.flatten().size):
ind=np.unravel_index(i,a.shape)
if ind[0:3]==ind[3:6]:
b[ind]=a[ind]*ind[0:3][k]**2
else:
b[ind]=a[ind]*(ind[0:3][k]**2-ind[3:6][k]**2)
return b
a=np.arange(729).reshape((3,3,3,3,3,3))
print func(a,2)
Is there anyway of vecotrizing this code?
P.S. This is a simplified version of what I actually need to do.
use numpy.indices() create the index array, then you can vecotrizing the calculation:
import numpy as np
def func(a,k):
b=np.empty(a.shape)
for i in range(a.flatten().size):
ind=np.unravel_index(i,a.shape)
if ind[0:3]==ind[3:6]:
b[ind]=a[ind]*ind[0:3][k]**2
else:
b[ind]=a[ind]*(ind[0:3][k]**2-ind[3:6][k]**2)
return b
def func2(a,k):
b = np.empty(a.shape)
ind = np.indices(a.shape).reshape(6, -1)
mask = np.all(ind[:3] == ind[3:6], axis=0)
ar = a.ravel()
br = b.ravel()
br[mask] = ar[mask]*ind[k, mask]**2
mask = ~mask
br[mask] = ar[mask]*(ind[k, mask]**2 - ind[3+k, mask]**2)
return b
a = np.arange(729).reshape((3,3,3,3,3,3))
b1 = func(a, 2)
b2 = func2(a, 2)
np.allclose(b1, b2)
here is the %timeit result:
%timeit func(a, 2)
%timeit func2(a, 2)
output:
100 loops, best of 3: 16.4 ms per loop
1000 loops, best of 3: 579 µs per loop
You can optimize it a little for your case:
def func3(a,k):
b = np.empty(a.shape)
ind = np.indices(a.shape).reshape(6, -1)
mask = ~np.all(ind[:3] == ind[3:6], axis=0)
ar = a.ravel()
br = b.ravel()
br[:] = ar*ind[k]**2
br[mask] -= ar[mask]*ind[3+k, mask]**2
return b
I want to be able to vectorize this code:
def sobHypot(rec):
a, b, c = rec.shape
hype = np.ones((a,b,c))
for i in xrange(c):
x=ndimage.sobel(abs(rec[...,i])**2,axis=0, mode='constant')
y=ndimage.sobel(abs(rec[...,i])**2,axis=1, mode='constant')
hype[...,i] = np.hypot(x,y)
hype[...,i] = hype[...,i].mean()
index = hype.argmax()
return index
where rec,shape returns (1024,1024,20)
Here's how you can avoid the for-loop with the sobel filter:
import numpy as np
from scipy.ndimage import sobel
def sobHypot_vec(rec):
r = np.abs(rec)
x = sobel(r, 0, mode='constant')
y = sobel(r, 1, mode='constant')
h = np.hypot(x, y)
h = np.apply_over_axes(np.mean, h, [0,1])
return h.argmax()
I'm not sure if the sobel filter is particularly necessary in your application, and this is hard to test without your particular 20-layer 'image', but you could try using np.gradient instead of running the sobel twice. The advantage is that gradient runs in three dimensions. You can ignore the component in the third, and take the hypot of just the first two. This seems wasteful but is actually still faster in my tests.
For a variety of randomly generated images, r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j, this gives the same answer as your code, but test it to be sure, and possibly fiddle with the dx, dy arguments of np.gradient
def grad_max(rec):
g = np.gradient(np.abs(rec))[:2] # ignore derivative in third dimension
h = np.hypot(*g)
h = np.apply_over_axes(np.mean, h, [0,1]) # mean along first and second dimension
return h.argmax()
Using this code for timing:
def sobHypot_clean(rec):
rs = rec.shape
hype = np.ones(rs)
r = np.abs(rec)
for i in xrange(rs[-1]):
ri = r[...,i]
x = sobel(ri, 0, mode='constant')
y = sobel(ri, 1, mode='constant')
hype[...,i] = np.hypot(x,y).mean()
return hype.argmax()
Timing:
In [1]: r = np.random.rand(1024,1024,20) + np.random.rand(1024,1024,20)*1j
# Original Post
In [2]: timeit sobHypot(r)
1 loops, best of 3: 9.85 s per loop
#cleaned up a bit:
In [3]: timeit sobHypot_clean(r)
1 loops, best of 3: 7.64 s per loop
# vectorized:
In [4]: timeit sobHypot_vec(r)
1 loops, best of 3: 5.98 s per loop
# using np.gradient:
In [5]: timeit grad_max(r)
1 loops, best of 3: 4.12 s per loop
Please test any of these functions on your own images to be sure they give the desired output, since different types of arrays could react differently from the simple random tests I did.
I am having performance issues with my code.
step # IIII consumes hours of time. I used to materialize the
the itertools.prodct before, but thanks to a user I dont do pro_data = product(array_b,array_a) anymore. This helped me with memory issues, but the still is heavily time consuming.
I would like to paralellize it with multithreading or multiprocesisng, whatever you can suggest, I am grateful.
Explanation. I have two arrays that contain x and y values of particles. For each particle (defined by two coordinates) I want to calculate a function with another. For combinations I use the itertools.product method and loop over every particle. I run over 50000 particels in total, so I have N*N/2 combinations to calculate.
Thanks in advance
import numpy as np
import matplotlib.pyplot as plt
from itertools import product,combinations_with_replacement
def func(ar1,ar2,ar3,ar4): #example func that takes four arguments
return (ar1*ar2**22+np.sin(ar3)+ar4)
def newdist(a):
return func(a[0][0],a[0][1],a[1][0],a[1][1])
x_edges = np.logspace(-3,1, num=25) #prepare x-axis for histogram
x_mean = 10**((np.log10(x_edges[:-1])+np.log10(x_edges[1:]))/2)
x_width=x_edges[1:]-x_edges[:-1]
hist_data=np.zeros([len(x_edges)-1])
array1=np.random.uniform(0.,10.,100)
array2=np.random.uniform(0.,10.,100)
array_a = np.dstack((array1,array1))[0]
array_b = np.dstack((array2,array2))[0]
# IIII
for i in product(array_a,array_b):
(result,bins) = np.histogram(newdist(i),bins=x_edges)
hist_data+=result
hist_data = np.array(map(float, hist_data))
plt.bar(x_mean,hist_data,width=x_width,color='r')
plt.show()
-----EDIT-----
I used this code now:
def mp_dist(array_a,array_b, d, bins): #d chunks AND processes
def worker(array_ab, out_q):
""" push result in queue """
outdict = {}
outdict = vec_chunk(array_ab, bins)
out_q.put(outdict)
out_q = mp.Queue()
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
array_size_a=len(array_a)-(len(array_a)%d)
array_size_b=len(array_b)-(len(array_b)%d)
a_chunk = array_size_a / d
b_chunk = array_size_b / d
procs = []
#prepare arrays for mp
array_ab = np.empty((4, a_chunk, b_chunk))
for j in xrange(d):
for k in xrange(d):
array_ab[[0, 1]] = a[:, a_chunk * j:a_chunk * (j + 1), None]
array_ab[[2, 3]] = b[:, None, b_chunk * k:b_chunk * (k + 1)]
p = mp.Process(target=worker, args=(array_ab, out_q))
procs.append(p)
p.start()
resultarray = np.empty(len(bins)-1)
for i in range(d):
resultarray+=out_q.get()
# Wait for all worker processes to finish
for pro in procs:
pro.join()
print resultarray
return resultarray
Problem here is that I cannot control the numbers of processes. How Can I use a mp.Pool() instead?
than
First, lets look at a straightforward vectorization of your problem. I have a feeling that you want your array_a and array_b to be the exact same, i.e. the coordinates of the particles, but I am keeping them separate here.
I have turned your code into a function, to make timing easier:
def IIII(array_a, array_b, bins) :
hist_data=np.zeros([len(bins)-1])
for i in product(array_a,array_b):
(result,bins) = np.histogram(newdist(i), bins=bins)
hist_data+=result
hist_data = np.array(map(float, hist_data))
return hist_data
You can, by the way, generate your sample data in a less convoluted way as follows:
n = 100
array_a = np.random.uniform(0, 10, size=(n, 2))
array_b = np.random.uniform(0, 10, size=(n, 2))
So first we need to vectorize your func. I have done it so it can take any array of shape (4, ...). To spare memory, it is doing the calculation in place, and returning the first plane, i.e. array[0].
def func_vectorized(a) :
a[1] **= 22
np.sin(a[2], out=a[2])
a[0] *= a[1]
a[0] += a[2]
a[0] += a[3]
return a[0]
With this function in place, we can write a vectorized version of IIII:
def IIII_vec(array_a, array_b, bins) :
array_ab = np.empty((4, len(array_a), len(array_b)))
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
array_ab[[0, 1]] = a[:, :, None]
array_ab[[2, 3]] = b[:, None, :]
newdist = func_vectorized(array_ab)
hist, _ = np.histogram(newdist, bins=bins)
return hist
With n = 100 points, they both return the same:
In [2]: h1 = IIII(array_a, array_b, x_edges)
In [3]: h2 = IIII_bis(array_a, array_b, x_edges)
In [4]: np.testing.assert_almost_equal(h1, h2)
But the timing differences are already very relevant:
In [5]: %timeit IIII(array_a, array_b, x_edges)
1 loops, best of 3: 654 ms per loop
In [6]: %timeit IIII_vec(array_a, array_b, x_edges)
100 loops, best of 3: 2.08 ms per loop
A 300x speedup!. If you try it again with longer sample data, n = 1000, you can see that they both scale equally bad, as n**2, so the 300x stays there:
In [10]: %timeit IIII(array_a, array_b, x_edges)
1 loops, best of 3: 68.2 s per loop
In [11]: %timeit IIII_bis(array_a, array_b, x_edges)
1 loops, best of 3: 229 ms per loop
So you are still looking at a good 10 min. of processing, which is not really that much when compared to the more than 2 days that your current solution would require.
Of course, for things to be so nice, you will need to fit a (4, 50000, 50000) array of floats into memory, something that my system cannot handle. But you can still keep things relatively fast, by processing it in chunks. The following version of IIII_vec divides each array into d chunks. As written, the length of the array should be divisible by d. It wouldn't bee too hard to overcome that limitation, but it would obfuscate the true purpose:
def IIII_vec_bis(array_a, array_b, bins, d=1) :
a = np.swapaxes(array_a, 0 ,1)
b = np.swapaxes(array_b, 0 ,1)
a_chunk = len(array_a) // d
b_chunk = len(array_b) // d
array_ab = np.empty((4, a_chunk, b_chunk))
hist_data = np.zeros((len(bins) - 1,))
for j in xrange(d) :
for k in xrange(d) :
array_ab[[0, 1]] = a[:, a_chunk * j:a_chunk * (j + 1), None]
array_ab[[2, 3]] = b[:, None, b_chunk * k:b_chunk * (k + 1)]
newdist = func_vectorized(array_ab)
hist, _ = np.histogram(newdist, bins=bins)
hist_data += hist
return hist_data
First, lets check that it really works:
In [4]: h1 = IIII_vec(array_a, array_b, x_edges)
In [5]: h2 = IIII_vec_bis(array_a, array_b, x_edges, d=10)
In [6]: np.testing.assert_almost_equal(h1, h2)
And now some timings. With n = 100:
In [7]: %timeit IIII_vec(array_a, array_b, x_edges)
100 loops, best of 3: 2.02 ms per loop
In [8]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
100 loops, best of 3: 12 ms per loop
But as you start having to have a larger and larger array in memory, doing it in chunks starts to pay off. With n = 1000:
In [12]: %timeit IIII_vec(array_a, array_b, x_edges)
1 loops, best of 3: 223 ms per loop
In [13]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
1 loops, best of 3: 208 ms per loop
With n = 10000 I can no longer call IIII_vec without an array is too big error, but the chunky version is still running:
In [18]: %timeit IIII_vec_bis(array_a, array_b, x_edges, d=10)
1 loops, best of 3: 21.8 s per loop
And just to show that it can be done, I have run it once with n = 50000:
In [23]: %timeit -n1 -r1 IIII_vec_bis(array_a, array_b, x_edges, d=50)
1 loops, best of 1: 543 s per loop
So a good 9 minutes of number crunching, which is not all that bad given it has computed 2.5 billion interactions.
Use vectorized numpy operations. Replace the for-loop over product() with a single newdist() call by creating arguments using meshgrid().
To parallize the problem compute newdist() on slices of array_a, array_b that correspond to subblocks of meshgrid(). Here's an example using slices and multiprocessing.
Here's another example to demonstrate the steps: python loop -> vectorized numpy version -> parallel:
#!/usr/bin/env python
from __future__ import division
import math
import multiprocessing as mp
import numpy as np
try:
from itertools import izip as zip
except ImportError:
zip = zip # Python 3
def pi_loop(x, y, npoints):
"""Compute pi using Monte-Carlo method."""
# note: the method converges to pi very slowly.
return 4 * sum(1 for xx, yy in zip(x, y) if (xx**2 + yy**2) < 1) / npoints
def pi_vectorized(x, y, npoints):
return 4 * ((x**2 + y**2) < 1).sum() / npoints # or just .mean()
def mp_init(x_shared, y_shared):
global mp_x, mp_y
mp_x, mp_y = map(np.frombuffer, [x_shared, y_shared]) # no copy
def mp_pi(args):
# perform computations on slices of mp_x, mp_y
start, end = args
x = mp_x[start:end] # no copy
y = mp_y[start:end]
return ((x**2 + y**2) < 1).sum()
def pi_parallel(x, y, npoints):
# compute pi using multiple processes
pool = mp.Pool(initializer=mp_init, initargs=[x, y])
step = 100000
slices = ((start, start + step) for start in range(0, npoints, step))
return 4 * sum(pool.imap_unordered(mp_pi, slices)) / npoints
def main():
npoints = 1000000
# create shared arrays
x_sh, y_sh = [mp.RawArray('d', npoints) for _ in range(2)]
# initialize arrays
x, y = map(np.frombuffer, [x_sh, y_sh])
x[:] = np.random.uniform(size=npoints)
y[:] = np.random.uniform(size=npoints)
for f, a, b in [(pi_loop, x, y),
(pi_vectorized, x, y),
(pi_parallel, x_sh, y_sh)]:
pi = f(a, b, npoints)
precision = int(math.floor(math.log10(npoints)) / 2 - 1 + 0.5)
print("%.*f %.1e" % (precision + 1, pi, abs(pi - math.pi)))
if __name__=="__main__":
main()
Time performance for npoints = 10_000_000:
pi_loop pi_vectorized pi_parallel
32.6 0.159 0.069 # seconds
It shows that the main performance benefit is from converting the python loop to its vectorized numpy analog.