Subtract all pairs of values from two arrays - python

I have two vectors, v1 and v2. I'd like to subtract each value of v2 from each value of v1 and store the results in another vector. I also would like to work with very large vectors (e.g. 1e6 size), so I think I should be using numpy for performance.
Up until now I have:
import numpy
v1 = numpy.array(numpy.random.uniform(-1, 1, size=1e2))
v2 = numpy.array(numpy.random.uniform(-1, 1, size=1e2))
vdiff = []
for value in v1:
vdiff.extend([value - v2])
This creates a list with 100 entries, each entry being an array of size 100. I don't know if this is the most efficient way to do this though.
I'd like to calculate the 1e4 desired values very fast with the smallest object size (memory wise) possible.

You're not going to have very much fun with the giant arrays that you mentioned. But if you have more reasonably-sized matrices (small enough that the result can fit in memory), the best way to do this is with broadcasting.
import numpy as np
a = np.array(range(5, 10))
b = np.array(range(2, 6))
res = a[None, :] - b[:, None]
print(res)
# [[3 4 5 6 7]
# [2 3 4 5 6]
# [1 2 3 4 5]
# [0 1 2 3 4]]

np.subtract.outer
You can use np.ufunc.outer with np.subtract and then transpose:
a = np.array(range(5, 10))
b = np.array(range(2, 6))
res1 = np.subtract.outer(a, b).T
res2 = a[None, :] - b[:, None]
assert np.array_equal(res1, res2)
Performance is comparable between the two methods.

Related

Numpy.outer vs. broadcasting; are these equivalent and calculated the same way with similar expected performance?

This answer to performing outer addition with numpy discusses numpy's ufunc's "outer" and also numpy's broadcasting, and the examples there are summarized below.
In the case of addition, subtraction, multiplication and division, is the calculation the same under the hood, or will there be differences in performance, especially when the array size gets large?
Minimal, 2D example from the linked answer:
import numpy as np
a, b = np.arange(3), np.arange(5)
print(np.add.outer(a, b))
print(a[:, None] + b) # or a[:, np.newaxis] + b
both result in:
[[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]]

Masking nested array with value at index with a second array

I have a nested array with some values. I have another array, where the length of both arrays are equal. I'd like to get an output, where I have a nested array of 1's and 0's, such that it is 1 where the value in the second array was equal to the value in that nested array.
I've taken a look on existing stack overflow questions but have been unable to construct an answer.
masks_list = []
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test.values[i]) * 1
masks_list.append(mask)
masks = np.array(masks_list);
Essentially, that's the code I currently have and it works, but I think that it's probably not the most effecient way of doing it.
YPRED:
[[4 0 1 2 3 5 6]
[0 1 2 3 5 6 4]]
YTEST:
8 1
5 4
Masks:
[[0 0 1 0 0 0 0]
[0 0 0 0 0 0 1]]
Another good solution with less line of code.
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
After that you can check performance like in this answer as follow:
import time
from time import perf_counter as pc
t0=pc()
a = set(y_pred).intersection(y_test)
f = [1 if i in a else 0 for i, j in enumerate(y_pred)]
t1 = pc() - t0
t0=pc()
for i in range(len(y_pred)):
mask = (y_pred[i] == y_test[i]) * 1
masks_list.append(mask)
t2 = pc() - t0
val = t1 - t2
Generally it means if value is positive than the first solution are slower.
If you have np.array instead of list you can try do as described in this answer:
type(y_pred)
>> numpy.ndarray
y_pred = y_pred.tolist()
type(y_pred)
>> list
Idea(least loop): compare array and nested array:
masks = np.equal(y_pred, y_test.values)
you can look at this too:
np.array_equal(A,B) # test if same shape, same elements values
np.array_equiv(A,B) # test if broadcastable shape, same elements values
np.allclose(A,B,...) # test if same shape, elements have close enough values

Pandas: vectorization with function on two dataframes

I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.
Let's say I've got two pandas dataframes.
Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.
>>> data1 = {'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5]}
>>> df_1=pd.DataFrame(data=data1)
>>>
>>> df_1
ID x y R
1 1 1 4
2 10 10 5
Dataframe two describes the x,y coordinates of some points, also with unique IDs.
>>> data2 = {'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9]}
>>> df_2=pd.DataFrame(data=data2)
>>>
>>> df_2
ID x y
3 1 2
4 3 5
5 9 9
Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.
All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".
My desired output would be
>>> df_2
ID x y host_circle
3 1 2 1
4 3 5 None
5 9 9 2
First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.
>>> def func(x1,y1,R1,ID_1,x2,y2):
... dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 )
... if dist < R:
... return ID_1
... else:
... return None
Next, the actual vectorization. I'm sorta lost here. I think it should be something like
df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])
but that just throws errors. Can someone help me?
One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.
Numba v1
You might have to install numba with
pip install numba
Then use numbas jit compiler via the njit function decorator
from numba import njit
#njit
def distances(point, points):
return ((points - point) ** 2).sum(1) ** .5
#njit
def find_my_circle(point, circles):
points = circles[:, :2]
radii = circles[:, 2]
dist = distances(point, points)
mask = dist < radii
i = mask.argmax()
return i if mask[i] else -1
#njit
def find_my_circles(points, circles):
n = len(points)
out = np.zeros(n, np.int64)
for i in range(n):
out[i] = find_my_circle(points[i], circles)
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.
Numba v2
This one is more loopy but short circuits when it finds a host
from numba import njit
#njit
def distance(a, b):
return ((a - b) ** 2).sum() ** .5
#njit
def find_my_circles(points, circles):
n = len(points)
m = len(circles)
out = -np.ones(n, np.int64)
centers = circles[:, :2]
radii = circles[:, 2]
for i in range(n):
for j in range(m):
if distance(points[i], centers[j]) < radii[j]:
out[i] = j
break
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
Vectorized
But still problematic
c = ['x', 'y']
centers = df_1[c].values
points = df_2[c].values
radii = df_1['R'].values
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
Explanation
Distance from any point from the center of a circle is
((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5
I can use broadcasting if I extend one of my arrays into a third dimension
points[:, None] - centers
array([[[ 0, 1],
[-9, -8]],
[[ 2, 4],
[-7, -5]],
[[ 8, 8],
[-1, -1]]])
That is all six combinations of vector differences. Now to calculate the distances.
((points[:, None] - centers) ** 2).sum(2) ** .5
array([[ 1. , 12.04159458],
[ 4.47213595, 8.60232527],
[11.3137085 , 1.41421356]])
Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles
((points[:, None] - centers) ** 2).sum(2) ** .5 < radii
array([[ True, False],
[False, False],
[False, True]])
Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.
Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle
def func(df, x2,y2):
val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1)
return list(val.index[val==True])
df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)

Do numpy 1D arrays follow row/column rules?

I have just started using numpy and I am getting confused about how to use arrays. I have seen several Stack Overflow answers on numpy arrays but they all deal with how to get the desired result (I know how to do this, I just don't know why I need to do it this way). The consensus that I've seen is that arrays are better than matrices because they are a more basic class and less restrictive. I understand you can transpose an array which to me means there is a distinction between a row and a column, but the multiplication rules all produce the wrong outputs (compared to what I am expecting).
Here is the test code I have written along with the outputs:
a = numpy.array([1,2,3,4])
print(a)
>>> [1 2 3 4]
print(a.T) # Transpose
>>> [1 2 3 4] # No apparent affect
b = numpy.array( [ [1], [2], [3], [4] ] )
print(b)
>>> [[1]
[2]
[3]
[4]] # Column (Expected)
print(b.T)
>>> [[1 2 3 4]] # Row (Expected, transpose seems to work here)
print((b.T).T)
>>> [[1]
[2]
[3]
[4]] # Column (All of these are as expected,
# unlike for declaring the array as a row vector)
# The following are element wise multiplications of a
print(a*a)
>>> [ 1 4 9 16]
print(a * a.T) # Row*Column
>>> [ 1 4 9 16] # Inner product scalar result expected
print(a.T * a) # Column*Row
>>> [ 1 4 9 16] # Outer product matrix result expected
print(b*b)
>>> [[1]
[4]
[9]
[16]] # Expected result, element wise multiplication in a column
print(b * b.T) # Column * Row (Outer product)
>>> [[ 1 2 3 4]
[ 2 4 6 8]
[ 3 6 9 12]
[ 4 8 12 16]] # Expected matrix result
print(b.T * (b.T)) # Column * Column (Doesn't make much sense so I expected elementwise multiplication
>>> [[ 1 4 9 16]]
print(b.T * (b.T).T) # Row * Column, inner product expected
>>> [[ 1 2 3 4]
[ 2 4 6 8]
[ 3 6 9 12]
[ 4 8 12 16]] # Outer product result
I know that I can use numpy.inner() and numpy.outer() to achieve the affect (that is not a problem), I just want to know if I need to keep track of whether my vectors are rows or columns.
I also know that I can create a 1D matrix to represent my vectors and the multiplication works as expected. I'm trying to work out the best way to store my data so that when I look at my code it is clear what is going to happen - right now the maths just looks confusing and wrong.
I only need to use 1D and 2D tensors for my application.
I'll try annotating your code
a = numpy.array([1,2,3,4])
print(a)
>>> [1 2 3 4]
print(a.T) # Transpose
>>> [1 2 3 4] # No apparent affect
a.shape will show (4,). a.T.shape is the same. It kept the same number of dimensions, and performed the only meaningful transpose - no change. Making it (4,1) would have added a dimension, and destroyed the A.T.T roundtrip.
b = numpy.array( [ [1], [2], [3], [4] ] )
print(b)
>>> [[1]
[2]
[3]
[4]] # Column (Expected)
print(b.T)
>>> [[1 2 3 4]] # Row (Expected, transpose seems to work here)
b.shape is (4,1), b.T.shape is (1,4). Note the extra set of []. If you'd created a as a = numpy.array([[1,2,3,4]]) its shape too would have been (1,4).
The easy way to make b would be b=np.array([[1,2,3,4]]).T (or b=np.array([1,2,3,4])[:,None] or b=np.array([1,2,3,4]).reshape(-1,1))
Compare this to MATLAB
octave:3> a=[1,2,3,4]
a =
1 2 3 4
octave:4> size(a)
ans =
1 4
octave:5> size(a.')
ans =
4 1
Even without the extra [] it has initialed the matrix as 2d.
numpy has a matrix class that imitates MATLAB - back in the time when MATLAB allowed only 2d.
In [75]: m=np.matrix('1 2 3 4')
In [76]: m
Out[76]: matrix([[1, 2, 3, 4]])
In [77]: m.shape
Out[77]: (1, 4)
In [78]: m=np.matrix('1 2; 3 4')
In [79]: m
Out[79]:
matrix([[1, 2],
[3, 4]])
I don't recommend using np.matrix unless it really adds something useful to your code.
Note the MATLAB talks of vectors, but they are really just their matrix with only one non-unitary dimension.
# The following are element wise multiplications of a
print(a*a)
>>> [ 1 4 9 16]
print(a * a.T) # Row*Column
>>> [ 1 4 9 16] # Inner product scalar result expected
This behavior follows from a.T == A. As you noted, * produces element by element multiplication. This is equivalent to the MATLAB .*. np.dot(a,a) gives the dot or matrix product of 2 arrays.
print(a.T * a) # Column*Row
>>> [ 1 4 9 16] # Outer product matrix result expected
No, it is still doing elementwise multiplication.
I'd use broadcasting, a[:,None]*a[None,:] to get the outer product. Octave added this in imitation of numpy; I don't know if MATLAB has it yet.
In the following * is always element by element multiplication. It's broadcasting that produces matrix/outer product results.
print(b*b)
>>> [[1]
[4]
[9]
[16]] # Expected result, element wise multiplication in a column
A (4,1) * (4,1)=>(4,1). Same shapes all around.
print(b * b.T) # Column * Row (Outer product)
>>> [[ 1 2 3 4]
[ 2 4 6 8]
[ 3 6 9 12]
[ 4 8 12 16]] # Expected matrix result
Here (4,1)*(1,4)=>(4,4) product. The 2 size 1 dimensions have been replicated so it becomes, effectively a (4,4)*(4,4). How would you do replicate this in MATLAB - with .*?
print(b.T * (b.T)) # Column * Column (Doesn't make much sense so I expected elementwise multiplication
>>> [[ 1 4 9 16]]
* is elementwise regardless of expectations. Think b' .* b' in MATLAB.
print(b.T * (b.T).T) # Row * Column, inner product expected
>>> [[ 1 2 3 4]
[ 2 4 6 8]
[ 3 6 9 12]
[ 4 8 12 16]] # Outer product result
Again * is elementwise; inner requires a summation in addition to multiplication. Here broadcasting again applies (1,4)*(4,1)=>(4,4).
np.dot(b,b) or np.trace(b.T*b) or np.sum(b*b) give 30.
When I worked in MATLAB I frequently checked the size, and created test matrices that would catch dimension mismatches (e.g. a 2x3 instead of a 2x2 matrix). I continue to do that in numpy.
The key things are:
numpy arrays may be 1d (or even 0d)
A (4,) array is not exactly the same as a (4,1) or (1,4)`.
* is elementwise - always.
broadcasting usually accounts for outer like behavior
"Transposing" is, from a numpy perspective, really only a meaningful concept for two-dimensional structures:
>>> import numpy
>>> arr = numpy.array([1,2,3,4])
>>> arr.shape
(4,)
>>> arr.transpose().shape
(4,)
So, if you want to transpose something, you'll have to make it two-dimensional:
>>> arr_2d = arr.reshape((4,1)) ## four rows, one column -> two-dimensional
>>> arr_2d.shape
(4, 1)
>>> arr_2d.transpose().shape
(1, 4)
Also, numpy.array(iterable, **kwargs) has a key word argument ndmin, which will, set to ndmin=2 prepend your desired shape with as many 1 as necessary:
>>> arr_ndmin = numpy.array([1,2,3,4],ndmin=2)
>>> arr_ndmin.shape
(1, 4)
Yes, they do.
Your question is already answered. Though I assume you are a Matlab user? If so, you may find this guide useful: Moving from MATLAB matrices to NumPy arrays

numpy moving window percent cover

I have a classified raster that I am reading into a numpy array. (n classes)
I want to use a 2d moving window (e.g. 3 by 3) to create a n-dimensional vector that stores the %cover of each class within the window. Because the raster is large it would be useful to store this information so as not to re-compute it each time....therefor I think the best solution is creating a 3d array to act as the vector. A new raster will be created based on these %/count values.
My idea is to:
1) create a 3d array n+1 'bands'
2) band 1 = the original classified raster. each other 'band' value = count cells of a value within the window (i.e. one band per class) ....for example:
[[2 0 1 2 1]
[2 0 2 0 0]
[0 1 1 2 1]
[0 2 2 1 1]
[0 1 2 1 1]]
[[2 2 3 2 2]
[3 3 3 2 2]
[3 3 2 2 2]
[3 3 0 0 0]
[2 2 0 0 0]]
[[0 1 1 2 1]
[1 3 3 4 2]
[1 2 3 4 3]
[2 3 5 6 5]
[1 1 3 4 4]]
[[2 3 2 2 1]
[2 3 3 3 2]
[2 4 4 3 1]
[1 3 5 3 1]
[1 3 3 2 0]]
4) read these bands into a vrt so only needs be created the once ...and can be read in for further modules.
Question: what is the most efficient 'moving window' method to 'count' within the window?
Currently - I am trying, and failing with the following code:
def lcc_binary_vrt(raster, dim, bands):
footprint = np.zeros(shape = (dim,dim), dtype = int)+1
g = gdal.Open(raster)
data = gdal_array.DatasetReadAsArray(g)
#loop through the band values
for i in bands:
print i
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
count_a_band_fname = raster[:-4] + '_' + str(i) + '.tif'
# run the moving window (footprint) accross the band to create a 'count'
count_a_band = ndimage.generic_filter(a_band, np.count_nonzero(x), footprint=footprint, mode = 'constant')
geoTiff.create(count_a_band_fname, g, data, count_a_band, gdal.GDT_Byte, np.nan)
Any suggestions very much appreciated.
Becky
I don't know anything about the spatial sciences stuff, so I'll just focus on the main question :)
what is the most efficient 'moving window' method to 'count' within the window?
A common way to do moving window statistics with Numpy is to use numpy.lib.stride_tricks.as_strided, see for example this answer. Basically, the idea is to make an array containing all the windows, without any increase in memory usage:
from numpy.lib.stride_tricks import as_strided
...
m, n = a_band.shape
newshape = (m-dim+1, n-dim+1, dim, dim)
newstrides = a_band.strides * 2 # strides is a tuple
count_a_band = as_strided(ar, newshape, newstrides).sum(axis=(2,3))
However, for your use case this method is inefficient, because you're summing the same numbers over and over again, especially if the window size increases. A better way is to use a cumsum trick, like in this answer:
def windowed_sum_1d(ar, ws, axis=None):
if axis is None:
ar = ar.ravel()
else:
ar = np.swapaxes(ar, axis, 0)
ans = np.cumsum(ar, axis=0)
ans[ws:] = ans[ws:] - ans[:-ws]
ans = ans[ws-1:]
if axis:
ans = np.swapaxes(ans, 0, axis)
return ans
def windowed_sum(ar, ws):
for axis in range(ar.ndim):
ar = windowed_sum_1d(ar, ws, axis)
return ar
...
count_a_band = windowed_sum(a_band, dim)
Note that in both codes above it would be tedious to handle edge cases. Luckily, there is an easy way to include these and get the same efficiency as the second code:
count_a_band = ndimage.uniform_filter(a_band, size=dim, mode='constant') * dim**2
Though very similar to what you already had, this will be much faster! The downside is that you may need to round to integers to get rid of floating point rounding errors.
As a final note, your code
# create a duplicate '0' array of the raster
a_band = data*0
# we create the binary dataset for the band
a_band = np.where(data == i, 1, a_band)
is a bit redundant: You can just use a_band = (data == i).

Categories

Resources