Pandas apply with argument that varies by row - python

I have a data frame that contains 50 rows, for example the BCI data from R.
import pandas.rpy.common as com
varespec = com.load_data('BCI', 'vegan')
I am attempting to apply a function to each row, where the function takes a 'size' argument.
def rare(y, size):
notabs = ~np.isnan(y)
t = y[notabs]
N = np.sum(t)
diff = N - t
rare = np.sum(1 - comb(diff, size)/comb(N, size))
return rare
If size is an integer, it works fine:
varespec.apply(rare, axis=1, args=(20,))
What I would like to do is make size an array of 50 elements that all differ, so that each row has a unique value of size. If I make size a vector of 50, it passes the entire vector and the function doesn't work. How can I make
varespec.apply(rare, axis=1, args=(size,))
use a unique element of size for each row? I can do for loops:
for i in xrange(50):
rare(varespec.iloc[i,:], size[i])
but is there a better way using apply functions?

You could express the result as a calculation on whole NumPy arrays, rather than one done by calling rare once for each row of varespec:
import pandas as pd
import pandas.rpy.common as com
import scipy.misc as misc
import numpy as np
np.random.seed(1)
def rare(y, size):
notabs = ~np.isnan(y)
t = y[notabs]
N = np.sum(t)
diff = N - t
rare = np.sum(1 - misc.comb(diff, size)/misc.comb(N, size))
return rare
def using_rare(size):
return np.array([rare(varespec.iloc[i,:], size[i]) for i in xrange(50)])
def using_arrays(size):
N = varespec.sum(axis='columns', skina=True)
diff = (N[:, np.newaxis] - varespec.values).T
return np.sum(1 - misc.comb(diff, size) / misc.comb(N, size), axis=0)
varespec = com.load_data('BCI', 'vegan')
size = np.random.randint(varespec.shape[1], size=(varespec.shape[0],))
This shows using_rare and using_arrays produce the same result:
expected = using_rare(size)
result = using_arrays(size)
assert np.allclose(result, expected)
In [229]: %timeit using_rare(size)
10 loops, best of 3: 36.2 ms per loop
In [230]: %timeit using_arrays(size)
100 loops, best of 3: 2.89 ms per loop
This takes advantage of the fact that scipy.misc.comb can accept NumPy arrays as input. So you can call comb(diff, size) where diff is an array of shape (225, 50) and size is an array of shape (50,). Since size is only used in the calls to comb, it is possible to perform all the calculations with just two calls to comb. No looping per row required.

You can add that vector as a column to your data frame (remove it later if you wish):
varespec['size'] = size
And then either change your rare function:
def rare(x):
size = x['size']
y = x.values[:-1]
...
Or if you don't want to change rare, wrap it:
def rare_wrapper(x):
size = x['size']
y = x.values[:-1]
return rare(y, size)

Related

Find location where smaller array matches larger array the most

I need to find where a smaller 2d array, array1 matches the closest inside another 2d array, array2.
array1 with have the size of grid_size 46x46 to 96x96.
array2 will be larger (184x184).
I only have access to numpy.
I am currently trying to use the Tversky formula but am not tied to it.
Efficiency is the most important part as this will run many times. My current solution shown below is very slow.
for i in range(array2.shape[0] - grid_size):
for j in range(array2.shape[1] - grid_size):
r[i, j] = np.sum(array2[i:i+grid_size, j:j+grid_size] == array1 ) / (np.sum(array2[i:i+grid_size, j:j+grid_size] != array1 ) + np.sum(Si[i:i+grid_size, j:j+grid_size] == array1 ))
Edit:
The goal is to find the location where a smaller image matches another image.
Here is an FFT/convolution based approach that minimizes Euclidean distance:
import numpy as np
from numpy import fft
N = 184
n = 46
pad = 192
def best_offs(A,a):
A,a = A.astype(float),a.astype(float)
Ap,ap = (np.zeros((pad,pad)) for _ in "Aa")
Ap[:N,:N] = A
ap[:n,:n] = a
sim = fft.irfft2(fft.rfft2(ap).conj()*fft.rfft2(Ap))[:N-n+1,:N-n+1]
Ap[:N,:N] = A*A
ap[:n,:n] = 1
ref = fft.irfft2(fft.rfft2(ap).conj()*fft.rfft2(Ap))[:N-n+1,:N-n+1]
return np.unravel_index((ref-2*sim).argmin(),sim.shape)
# example
# random picture
A = np.random.randint(0,256,(N,N),dtype=np.uint8)
# random offset
offy,offx = np.random.randint(0,N-n+1,2)
# sub pic at random offset
# randomly flip half of the least significant 75% of all bits
a = A[offy:offy+n,offx:offx+n] ^ np.random.randint(0,64,(n,n))
# reconstruct offset
oyrec,oxrec = best_offs(A,a)
assert offy==oyrec and offx==oxrec
# speed?
from timeit import timeit
print(timeit(lambda:best_offs(A,a),number=100)*10,"ms")
# example with zero a
a[...] = 0
# make A smaller in a matching subsquare
A[offy:offy+n,offx:offx+n]>>=1
# reconstruct offset
oyrec,oxrec = best_offs(A,a)
assert offy==oyrec and offx==oxrec
Sample run:
3.458537160186097 ms

Efficient sum of Gaussians in 3D with NumPy using large arrays

I have an M x 3 array of 3D coordinates, coords (M ~1000-10000), and I would like to compute the sum of Gaussians centered at these coordinates over a mesh grid 3D array. The mesh grid 3D array is typically something like 64 x 64 x 64, but sometimes upwards of 256 x 256 x 256, and can go even larger. I’ve followed this question to get started, by converting my meshgrid array into an array of N x 3 coordinates, xyz, where N is 64^3 or 256^3, etc. However, for large array sizes it takes too much memory to vectorize the entire calculation (understandable since it could approach 1e11 elements and consume a terabyte of RAM) so I’ve broken it up into a loop over M coordinates. However, this is too slow.
I’m wondering if there is any way to speed this up at all without overloading memory. By converting the meshgrid to xyz, I feel like I’ve lost any advantage of the grid being equally spaced, and that somehow, maybe with scipy.ndimage, I should be able to take advantage of the even spacing to speed things up.
Here’s my initial start:
import numpy as np
from scipy import spatial
#create meshgrid
side = 100.
n = 64 #could be 256 or larger
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
#convert meshgrid to list of coordinates
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
#create some coordinates
coords = np.random.random(size=(1000,3))*side - side/2
def sumofgauss(coords,xyz,sigma):
"""Simple isotropic gaussian sum at coordinate locations."""
n = int(round(xyz.shape[0]**(1/3.))) #get n samples for reshaping to 3D later
#this version overloads memory
#dist = spatial.distance.cdist(coords, xyz)
#dist *= dist
#values = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist/(2*sigma**2))
#values = np.sum(values,axis=0)
#run cdist in a loop over coords to avoid overloading memory
values = np.zeros((xyz.shape[0]))
for i in range(coords.shape[0]):
dist = spatial.distance.cdist(coords[None,i], xyz)
dist *= dist
values += 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dist[0]/(2*sigma**2))
return values.reshape(n,n,n)
image = sumofgauss(coords,xyz,1.0)
import matplotlib.pyplot as plt
plt.imshow(image[n/2]) #show a slice
plt.show()
M = 1000, N = 64 (~5 seconds):
M = 1000, N = 256 (~10 minutes):
Considering that many of your distance calculations will give zero weight after the exponential, you can probably drop a lot of your distances. Doing big chunks of distance calculations while dropping distances which are greater than a threshhold is usually faster with KDTree:
import numpy as np
from scipy.spatial import cKDTree # so we can get a `coo_matrix` output
def gaussgrid(coords, sigma = 1, n = 64, side = 100, eps = None):
x_ = np.linspace(-side/2,side/2,n)
x,y,z = np.meshgrid(x_,x_,x_,indexing='ij')
xyz = np.column_stack((x.ravel(),y.ravel(),z.ravel()))
if eps is None:
eps = np.finfo('float64').eps
thr = -np.log(eps) * 2 * sigma**2
data_tree = cKDTree(coords)
discr = 1000 # you can tweak this to get best results on your system
values = np.empty(n**3)
for i in range(n**3//discr + 1):
slc = slice(i * discr, i * discr + discr)
grid_tree = cKDTree(xyz[slc])
dists = grid_tree.sparse_distance_matrix(data_tree, thr, output_type = 'coo_matrix')
dists.data = 1./np.sqrt(2*np.pi*sigma**2) * np.exp(-dists.data/(2*sigma**2))
values[slc] = dists.sum(1).squeeze()
return values.reshape(n,n,n)
Now, even if you keep eps = None it'll be a bit faster as you're still returning about 10% your distances, but with eps = 1e-6 or so, you should get a big speedup. On my system:
%timeit out = sumofgauss(coords, xyz, 1.0)
1 loop, best of 3: 23.7 s per loop
%timeit out = gaussgrid(coords)
1 loop, best of 3: 2.12 s per loop
%timeit out = gaussgrid(coords, eps = 1e-6)
1 loop, best of 3: 382 ms per loop

How can I get the minimum ratio between of each pair of rows with distance n, for all n from 0 up to the length of the dataframe (minus 1)?

I'm trying to do an operation on each pair of rows of distance n, and get the minimum (also maximum and mean) of the results for each n from 0 to n-1. For example, if Data=[1,2,3,4] and the operation is addition, Minimum=[2,3,4,5] and Maximum=[8,7,6,5], and Mean=[5,5,5,5].
I have the following code that uses ratio as the operation which works OK for a small data size but takes more than 10 seconds for 10,000 rows. Since I will be working with data that can have 1,000,000 rows, what would be a better way to do this?
import pandas as pd
import numpy as np
low=250
high=5000
length=10
x=pd.DataFrame({'A': np.random.uniform(low, high=high, size=length)})
x['mean']=x['min']=x['max']=x['A'].copy()
for i in range(0,len(x)):
ratio=x['A']/x['A'].shift(i)
x['mean'].iloc[[i]]=ratio.mean()
x['max'].iloc[[i]]=ratio.max()
x['min'].iloc[[i]]=ratio.min()
print (x)
Approach #1 : For efficiency and considering that you might have upto 1,000,000 rows, I would suggest using the underlying array data in a similar-looking loopy solution and using the efficient array-slicing to use a gradually diminishing data to work with and these two together should bring on noticeable performance boost.
Thus, an implementation would be -
a = x['A'].values
N = len(a)
out = np.zeros((N,4))
out[:,0] = a
for i in range(N):
ratio = a[i:]/a[:N-i]
out[i,1] = ratio.mean()
out[i,2] = ratio.min()
out[i,3] = ratio.max()
df_out = pd.DataFrame(out, columns= (('A','mean','min','max')))
Approach #2 : For a smaller datasize, we can use a vectorized solution that would create a square 2D array of shape (N,N) with shifted versions of the input data. Then, we mask out the upper triangular region with NaNs and finally employ numpy.nanmean, numpy.nanmin and numpy.nanmax to perform those pandas equivalent mean, min and max equivalent operations -
a = x['A'].values
N = len(a)
r = np.arange(N)
shifting_idx = (r[:,None] - r)%N
vals = a[:,None]/a[shifting_idx]
upper_tri_mask = r[:,None] < r
vals[upper_tri_mask] = np.nan
out = np.zeros((N,4))
out[:,0] = a
out[:,1] = np.nanmean(vals, 0)
out[:,2] = np.nanmin(vals, 0)
out[:,3] = np.nanmax(vals, 0)
df_out = pd.DataFrame(out, columns= (('A','mean','min','max')))
Runtime test
Approaches -
def org_app(x):
x['mean']=x['min']=x['max']=x['A'].copy()
for i in range(0,len(x)):
ratio=x['A']/x['A'].shift(i)
x['mean'].iloc[[i]]=ratio.mean()
x['max'].iloc[[i]]=ratio.max()
x['min'].iloc[[i]]=ratio.min()
return x
def app1(x):
a = x['A'].values
N = len(a)
out = np.zeros((N,4))
out[:,0] = a
for i in range(N):
ratio = a[i:]/a[:N-i]
out[i,1] = ratio.mean()
out[i,2] = ratio.min()
out[i,3] = ratio.max()
return pd.DataFrame(out, columns= (('A','mean','min','max')))
Timings -
In [3]: low=250
...: high=5000
...: length=10000
...: x=pd.DataFrame({'A': np.random.uniform(low, high=high, size=length)})
...:
In [4]: %timeit app1(x)
1 loop, best of 3: 185 ms per loop
In [5]: %timeit org_app(x)
1 loop, best of 3: 8.59 s per loop
In [6]: 8590.0/185
Out[6]: 46.432432432432435
46x+ speedup on 10,000 rows data!

Sliding standard deviation on a 1D NumPy array

Suppose that you have an array and want to create another array, which's values are equal to standard deviation of first array's 10 elements successively. With the help of for loop, it can be written easily like below code. What I want to do is avoid using for loop for faster execution time. Any suggestions?
Code
a = np.arange(20)
b = np.empty(11)
for i in range(11):
b[i] = np.std(a[i:i+10])
You could create a 2D array of sliding windows with np.lib.stride_tricks.as_strided that would be views into the given 1D array and as such won't be occupying any more memory. Then, simply use np.std along the second axis (axis=1) for the final result in a vectorized way, like so -
W = 10 # Window size
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
out = np.std(a2D, axis=1)
Runtime test
Function definitions -
def original_app(a, W):
b = np.empty(a.size-W+1)
for i in range(b.size):
b[i] = np.std(a[i:i+W])
return b
def vectorized_app(a, W):
nrows = a.size - W + 1
n = a.strides[0]
a2D = np.lib.stride_tricks.as_strided(a,shape=(nrows,W),strides=(n,n))
return np.std(a2D,1)
Timings and verification -
In [460]: # Inputs
...: a = np.arange(10000)
...: W = 10
...:
In [461]: np.allclose(original_app(a, W), vectorized_app(a, W))
Out[461]: True
In [462]: %timeit original_app(a, W)
1 loops, best of 3: 522 ms per loop
In [463]: %timeit vectorized_app(a, W)
1000 loops, best of 3: 1.33 ms per loop
So, around 400x speedup there!
For completeness, here's the equivalent pandas version -
import pandas as pd
def pdroll(a, W): # a is 1D ndarray and W is window-size
return pd.Series(a).rolling(W).std(ddof=0).values[W-1:]
Not so fancy, but the code with no loops would be something like this:
a = np.arange(20)
b = [a[i:i+10].std() for i in range(len(a)-10)]

Converting a nested loop calculation to Numpy for speedup

Part of my Python program contains the follow piece of code, where a new grid
is calculated based on data found in the old grid.
The grid i a two-dimensional list of floats. The code uses three for-loops:
for t in xrange(0, t, step):
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
gr = new_gr
return gr
The code is extremly slow for a large grid and a large time t.
I've tried to use Numpy to speed up this code, by substituting the inner loop
with:
J = np.arange(1, width-1)
new_gr[h][J] = gr[h][J] + gr[h][J-1] ...
But the results produced (the floats in the array) are about 10% smaller than
their list-calculation counterparts.
What loss of accuracy is to be expected when converting lists of floats to Numpy array of floats using np.array(pylist) and then doing a calculation?
How should I go about converting a triple for-loop to pretty and fast Numpy code? (or are there other suggestions for speeding up the code significantly?)
If gr is a list of floats, the first step if you are looking to vectorize with NumPy would be to convert gr to a NumPy array with np.array().
Next up, I am assuming that you have new_gr initialized with zeros of shape (height,width). The calculations being performed in the two innermost loops basically represent 2D convolution. So, you can use signal.convolve2d with an appropriate kernel. To decide on the kernel, we need to look at the scaling factors and make a 3 x 3 kernel out of them and negate them to simulate the calculations we are doing with each iteration. Thus, you would have a vectorized solution with the two innermost loops being removed for better performance, like so -
import numpy as np
from scipy import signal
# Get the scaling factors and negate them to get kernel
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
# Initialize output array and run 2D convolution and set values into it
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
Verify output and runtime tests
Define functions :
def org_app(gr,t):
new_gr = np.zeros((height,width))
for h in xrange(1, height-1):
for w in xrange(1, width-1):
new_gr[h][w] = gr[h][w] + gr[h][w-1] + gr[h-1][w] + t * gr[h+1][w-1]-2 * (gr[h][w-1] + t * gr[h-1][w])
return new_gr
def proposed_app(gr,t):
kernel = -np.array([[0,1-2*t,0],[-1,1,0,],[t,0,0]])
out = np.zeros((height,width))
out[1:-1,1:-1] = signal.convolve2d(gr, kernel, mode='same')[1:-1,:-2]
return out
Verify -
In [244]: # Inputs
...: gr = np.random.rand(40,50)
...: height,width = gr.shape
...: t = 1
...:
In [245]: np.allclose(org_app(gr,t),proposed_app(gr,t))
Out[245]: True
Timings -
In [246]: # Inputs
...: gr = np.random.rand(400,500)
...: height,width = gr.shape
...: t = 1
...:
In [247]: %timeit org_app(gr,t)
1 loops, best of 3: 2.13 s per loop
In [248]: %timeit proposed_app(gr,t)
10 loops, best of 3: 19.4 ms per loop
#Divakar, I tried a couple of variations on your org_app. The fully vectorized version is:
def org_app4(gr,t):
new_gr = np.zeros((height,width))
I = np.arange(1,height-1)[:,None]
J = np.arange(1,width-1)
new_gr[I,J] = gr[I,J] + gr[I,J-1] + gr[I-1,J] + t * gr[I+1,J-1]-2 * (gr[I,J-1] + t * gr[I-1,J])
return new_gr
While half the speed of your proposed_app, it is closer in style to the original. And thus may help with understanding how nested loops can be vectorized.
An important step is the conversion of I into a column array, so that together I,J index a block of values.

Categories

Resources