Pandas: vectorization with function on two dataframes - python

I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.
Let's say I've got two pandas dataframes.
Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.
>>> data1 = {'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5]}
>>> df_1=pd.DataFrame(data=data1)
>>>
>>> df_1
ID x y R
1 1 1 4
2 10 10 5
Dataframe two describes the x,y coordinates of some points, also with unique IDs.
>>> data2 = {'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9]}
>>> df_2=pd.DataFrame(data=data2)
>>>
>>> df_2
ID x y
3 1 2
4 3 5
5 9 9
Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.
All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".
My desired output would be
>>> df_2
ID x y host_circle
3 1 2 1
4 3 5 None
5 9 9 2
First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.
>>> def func(x1,y1,R1,ID_1,x2,y2):
... dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 )
... if dist < R:
... return ID_1
... else:
... return None
Next, the actual vectorization. I'm sorta lost here. I think it should be something like
df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])
but that just throws errors. Can someone help me?
One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.

Numba v1
You might have to install numba with
pip install numba
Then use numbas jit compiler via the njit function decorator
from numba import njit
#njit
def distances(point, points):
return ((points - point) ** 2).sum(1) ** .5
#njit
def find_my_circle(point, circles):
points = circles[:, :2]
radii = circles[:, 2]
dist = distances(point, points)
mask = dist < radii
i = mask.argmax()
return i if mask[i] else -1
#njit
def find_my_circles(points, circles):
n = len(points)
out = np.zeros(n, np.int64)
for i in range(n):
out[i] = find_my_circle(points[i], circles)
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.
Numba v2
This one is more loopy but short circuits when it finds a host
from numba import njit
#njit
def distance(a, b):
return ((a - b) ** 2).sum() ** .5
#njit
def find_my_circles(points, circles):
n = len(points)
m = len(circles)
out = -np.ones(n, np.int64)
centers = circles[:, :2]
radii = circles[:, 2]
for i in range(n):
for j in range(m):
if distance(points[i], centers[j]) < radii[j]:
out[i] = j
break
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
Vectorized
But still problematic
c = ['x', 'y']
centers = df_1[c].values
points = df_2[c].values
radii = df_1['R'].values
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
Explanation
Distance from any point from the center of a circle is
((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5
I can use broadcasting if I extend one of my arrays into a third dimension
points[:, None] - centers
array([[[ 0, 1],
[-9, -8]],
[[ 2, 4],
[-7, -5]],
[[ 8, 8],
[-1, -1]]])
That is all six combinations of vector differences. Now to calculate the distances.
((points[:, None] - centers) ** 2).sum(2) ** .5
array([[ 1. , 12.04159458],
[ 4.47213595, 8.60232527],
[11.3137085 , 1.41421356]])
Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles
((points[:, None] - centers) ** 2).sum(2) ** .5 < radii
array([[ True, False],
[False, False],
[False, True]])
Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.

Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle
def func(df, x2,y2):
val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1)
return list(val.index[val==True])
df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)

Related

How to Find the Max Plot of Land From a 2D Array?

Sorry if the title was a little confusing, I didn't know what to call it. However, I'm still new to programming and I'm stuck on this coding problem that I just have no idea where to start.
Here is the summarized version of the problem:
I have a randomized plot of land, lets just call the variables x and y. This plot of land is a 2D array of all numbers that can be negative or positive. Now, there will be another, smaller plot of randomized numbers, lets call them width, height. With these new variables I need to find the greatest number from the x,y array that is width, height in size.
All numbers will be valid integers.
x ≥ width > 0
y ≥ height > 0
I will need to output the largest sum of land in the x y plot that is width, height in size.
Here is an example
3 - randomly picked y value
4 - randomly picked x value
2 - randomly picked height
1 - randomly picked width
1 2 3 4
-1 0 -1 9
-4 1 -2 7
Now, you can see from the example that the output will be 16, because the biggest 1x2 plot in the 4x3 plot is 16. I was wondering if anyone could point me in the right direction and give me tips on where to start. I have tried researching this, but it has led nowhere because I have no idea what to look up.
A summed-area table seems to be an interesting way to tackle this problem. If I'm not mistaken such an algorithm would be linear in the number of cells (x*y).
The basic idea of a summed-area table is that the sum of a subparcel can be calculated by adding the values for two corners and subtracting the values of the opposite corners, as explained in the Wikipedia article.
Numpy's cumsum helps to quickly create the summed-area table. Maybe there is also a numpy way to calculate the areas?
Here's my sample code (note that numpy first indexes the vertical direction, and then the horizontal). The tests inside the loop could be skipped if we added an extra row and extra column of zeros (but would make the code slightly more difficult to understand).
import numpy as np
def find_highest_area_sum(parcel, x, y, width, height):
sums = np.cumsum(np.cumsum(parcel, axis=0), axis=1)
areas = np.zeros((y - height + 1, x - width + 1), dtype=sums.dtype)
print("Given parcel:")
print(parcel)
print("Cumulative area sums:")
print(sums)
for i in range(x - width + 1):
for j in range(y - height + 1):
areas[j, i] = sums[j + height - 1, i + width - 1]
if i > 0:
areas[j, i] -= sums[j + height - 1, i - 1]
if j > 0:
areas[j, i] -= sums[j - 1, i + width - 1]
if i > 0 and j > 0:
areas[j, i] += sums[j - 1, i - 1]
print("Areas of each subparcel:")
print(areas)
ind_highest = np.unravel_index(np.argmax(areas), areas.shape)
print(f'The highest area sum is {areas[ind_highest]} at pos ({ind_highest[1]}, {ind_highest[0]}) to pos ({ind_highest[1] + width - 1}, {ind_highest[0] + height - 1}) ')
x, y = 4, 3
width, height = 1, 2
parcel = np.array([[1, 2, 3, 4],
[-1, 0, -1, 9],
[-4, 1, -2, 7]])
find_highest_area_sum(parcel, x, y, width=1, height=2)
x = 12
y = 20
parcel = np.random.randint(-10, 20, (y, x))
find_highest_area_sum(parcel, x, y, width=10, height=12)
Output of the first part:
Given parcel:
[[ 1 2 3 4]
[-1 0 -1 9]
[-4 1 -2 7]]
Cumulative area sums:
[[ 1 3 6 10]
[ 0 2 4 17]
[-4 -1 -1 19]]
Areas of each subparcel:
[[ 0 2 2 13]
[-5 1 -3 16]]
The highest area sum is 16 at pos (3, 1) to pos (3, 2)

How to apply euclidean distance function to a groupby object in pandas dataframe?

I have a set of objects and their positions over time. I would like to get the average distance between objects for each time point. An example dataframe is as follows:
time = [0, 0, 0, 1, 1, 2, 2]
x = [216, 218, 217, 280, 290, 130, 132]
y = [13, 12, 12, 110, 109, 3, 56]
car = [1, 2, 3, 1, 3, 4, 5]
df = pd.DataFrame({'time': time, 'x': x, 'y': y, 'car': car})
df
x y car
time
0 216 13 1
0 218 12 2
0 217 12 3
1 280 110 1
1 290 109 3
2 130 3 4
2 132 56 5
The end result I would like to have is:
df2
average distance
between cars
time
0 1.55
1 10.05
2 53.04
any idea on how to proceed? I've been trying apply the scipy.spatial.distance function to the dataframe, but I'm not sure how to apply it to df.groupby('time'), and then get the mean value of all those distances.
Any help appreciated!
You could pass an array of the points to scipy.spatial.distaince.pdist and it will calculate all pair-wise distances between Xi and Xj for i>j. Then take the mean.
import numpy as np
from scipy import spatial
df.groupby('time').apply(lambda x: spatial.distance.pdist(np.array(list(zip(x.x, x.y)))).mean())
Outputs:
time
0 1.550094
1 10.049876
2 53.037722
dtype: float64
For me using apply or for loop does not have much different
l1=[]
l2=[]
for y,x in df.groupby('time'):
v=np.triu(spatial.distance.cdist(x[['x','y']].values, x[['x','y']].values),k=0)
v = np.ma.masked_equal(v, 0)
l2.append(np.mean(v))
l1.append(y)
pd.DataFrame({'ave':l2},index=l1)
Out[250]:
ave
0 1.550094
1 10.049876
2 53.037722
building this up from the first principles:
For each point at index n, it is necessary to compute the distance with all the points with index > n.
if the distance between two points is given by formula:
np.sqrt((x0 - x1)**2 + (y0 - y1)**2)
then for an array of points in a dataframe, we can get all the distances & then calculate its mean:
distances = []
for i in range(len(df)-1):
distances += np.sqrt( (df.x[i+1:] - df.x[i])**2 + (df.y[i+1:] - df.y[i])**2 ).tolist()
np.mean(distances)
expressing the same logic using pd.concat & a couple of helper functions
def diff_sq(x, i):
return (x.iloc[i+1:] - x.iloc[i])**2
def dist_df(x, y, i):
d_sq = diff_sq(x, i) + diff_sq(y, i)
return np.sqrt(d_sq)
def avg_dist(df):
return pd.concat([dist_df(df.x, df.y, i) for i in range(len(df)-1)]).mean()
then it is possible to use the avg_dist function with groupby
df.groupby('time').apply(avg_dist)
# outputs:
time
0 1.550094
1 10.049876
2 53.037722
dtype: float64
You could also use the itertools package to define your own function as follow:
import itertools
import numpy as np
def combinations(series):
l = list()
for item in itertools.combinations(series,2):
l.append(((item[0] - item[1])**2))
return l
df2 = df.groupby('time').agg(combinations)
df2['avg_distance'] = [np.mean(np.sqrt(pd.Series(df2.iloc[k,0]) +
pd.Series(df2.iloc[k,1]))) for k in range(len(df2))]
df2.avg_distance.to_frame()
Then, the output is:
avg_distance
time
0 1.550094
1 10.049876
2 53.037722

Python: how to reduce the dimension of a matrix by doing the sum of the first neighbors?

Suppose we have a matrix of dimension N x M and we want to reduce its dimension preserving the values in each by summing the firs neighbors.
Suppose the matrix A is a 4x4 matrix:
A =
3 4 5 6
2 3 4 5
2 2 0 1
5 2 2 3
we want to reduce it to a 2x2 matrix as following:
A1 =
12 20
11 6
In particular my matrix represent the number of incident cases in an x-y plane. My matrix is A=103x159, if I plot it I get:
what I want to do is to aggregate those data to a bigger area, such as
Assuming you're using a numpy.matrix:
import numpy as np
A = np.matrix([
[3,4,5,6],
[2,3,4,5],
[2,2,0,1],
[5,2,2,3]
])
N, M = A.shape
assert N % 2 == 0
assert M % 2 == 0
A1 = np.empty((N//2, M//2))
for i in range(N//2):
for j in range(M//2):
A1[i,j] = A[2*i:2*i+2, 2*j:2*j+2].sum()
Though these loops can probably be optimized away by proper numpy functions.
I see that there is a solution using numpy.maxtrix, maybe you can test my solution too and return your feedbacks.
It works with a*b matrix if a and b are even. Otherwise, it may fail if a or b are odd.
Here is my solution:
v = [
[3,4,5,6],
[2,3,4,5],
[2,2,0,1],
[5,2,2,3]
]
def shape(v):
return len(v), len(v[0])
def chunks(v, step):
"""
Chunk list step per step and sum
Example: step = 2
[3,4,5,6] => [7,11]
[2,3,4,5] => [5,9]
[2,2,0,1] => [4,1]
[5,2,2,3] => [7,5]
"""
for i in v:
for k in range(0, len(i),step):
yield sum(j for j in i[k:k+step])
def sum_chunks(k, step):
"""
Sum near values with step
Example: step = 2
[
[7,11], [
[5,9], => [12, 11],
[4,1], [20, 6]
[7,5] ]
]
"""
a, c = [k[i::step] for i in range(step)], []
print(a)
for m in a:
# sum near values
c.append([sum(m[j:j+2]) for j in range(0, len(m), 2)])
return c
rows, columns = shape(v)
chunk_list = list(chunks(v, columns // 2))
final_sum = sum_chunks(chunk_list, rows // 2)
print(final_sum)
Output:
[[12, 11], [20, 6]]

"Slice" a number into three random numbers

I need to generate a file filled with three "random" values per line (10 lines), but those values sum must equal 15.
The structure is: "INDEX A B C".
Example:
1 15 0 0
2 0 15 0
3 0 0 15
4 1 14 0
5 2 13 0
6 3 12 0
7 4 11 0
8 5 10 0
9 6 9 0
10 7 8 0
If you want to avoid needing to create (or iterate through) the full space of satisfying permutations (which, for large N is important), then you can solve this problem with sequential sample.
My first approach was to just draw a value uniformly from [0, N], call it x. Then draw a value uniformly from [0, N-x] and call it y, then set z = N - x - y. If you then shuffle these three, you'll get back a reasonable draw from the space of solutions, but it won't be exactly uniform.
As an example, consider where N=3. Then the probability of some permutation of (3, 0, 0) is 1/4, even though it is only one out of 10 possible triplets. So this privileges values that contain a high max.
You can perfectly counterbalance this effect by sampling the first value x proportionally to how many values will be possible for y conditioned on x. So for example, if x happened to be N, then there is only 1 compatible value for y, but if x is 0, then there are 4 compatible values, namely 0 through 3.
In other words, let Pr(X=x) be (N-x+1)/sum_i(N-i+1) for i from 0 to N. Then let Pr(Y=y | X=x) be uniform on [0, N-x].
This works out to P(X,Y) = P(Y|X=x) * P(X) = 1/(N-x+1) * [N - x + 1]/sum_i(N-i+1), which is seen to be uniform, 1/sum_i(N-i+1), for each candidate triplet.
Note that sum(N-i+1 for i in range(0, N+1)) gives the number of different ways to sum 3 non-negative integers to get N. I don't know a good proof of this, and would happy if someone adds one to the comments!
Here's a solution that will sample this way:
import random
from collections import Counter
def discrete_sample(weights):
u = random.uniform(0, 1)
w_t = 0
for i, w in enumerate(weights):
w_t += w
if u <= w_t:
return i
return len(weights)-1
def get_weights(N):
vals = [(N-i+1.0) for i in range(0, N+1)]
totl = sum(vals)
return [v/totl for v in vals]
def draw_summing_triplet(N):
weights = get_weights(N)
x = discrete_sample(weights)
y = random.randint(0, N-x)
triplet = [x, y, N - x - y]
random.shuffle(triplet)
return tuple(triplet)
Much credit goes to #DSM in the comments for questioning my original answer and providing good feedback.
In this case, we can test out the sampler like this:
foo = Counter(draw_summing_triplet(3) for i in range(10**6))
print foo
Counter({(1, 2, 0): 100381,
(0, 2, 1): 100250,
(1, 1, 1): 100027,
(2, 1, 0): 100011,
(0, 3, 0): 100002,
(3, 0, 0): 99977,
(2, 0, 1): 99972,
(1, 0, 2): 99854,
(0, 0, 3): 99782,
(0, 1, 2): 99744})
If the numbers can by any just use combinations:
from itertools import combinations
with open("rand.txt","w") as f:
combs = [x for x in combinations(range(16),3) if sum(x ) == 15 ][:10]
for a,b,c in combs:
f.write("{} {} {}\n".format(a,b,c))
This seems straight forward to me and it utilizes the random module.
import random
def foo(x):
a = random.randint(0,x)
b = random.randint(0,x-a)
c = x - (a +b)
return (a,b,c)
for i in range(100):
print foo(15)

Subtract all pairs of values from two arrays

I have two vectors, v1 and v2. I'd like to subtract each value of v2 from each value of v1 and store the results in another vector. I also would like to work with very large vectors (e.g. 1e6 size), so I think I should be using numpy for performance.
Up until now I have:
import numpy
v1 = numpy.array(numpy.random.uniform(-1, 1, size=1e2))
v2 = numpy.array(numpy.random.uniform(-1, 1, size=1e2))
vdiff = []
for value in v1:
vdiff.extend([value - v2])
This creates a list with 100 entries, each entry being an array of size 100. I don't know if this is the most efficient way to do this though.
I'd like to calculate the 1e4 desired values very fast with the smallest object size (memory wise) possible.
You're not going to have very much fun with the giant arrays that you mentioned. But if you have more reasonably-sized matrices (small enough that the result can fit in memory), the best way to do this is with broadcasting.
import numpy as np
a = np.array(range(5, 10))
b = np.array(range(2, 6))
res = a[None, :] - b[:, None]
print(res)
# [[3 4 5 6 7]
# [2 3 4 5 6]
# [1 2 3 4 5]
# [0 1 2 3 4]]
np.subtract.outer
You can use np.ufunc.outer with np.subtract and then transpose:
a = np.array(range(5, 10))
b = np.array(range(2, 6))
res1 = np.subtract.outer(a, b).T
res2 = a[None, :] - b[:, None]
assert np.array_equal(res1, res2)
Performance is comparable between the two methods.

Categories

Resources