I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.
Let's say I've got two pandas dataframes.
Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.
>>> data1 = {'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5]}
>>> df_1=pd.DataFrame(data=data1)
>>>
>>> df_1
ID x y R
1 1 1 4
2 10 10 5
Dataframe two describes the x,y coordinates of some points, also with unique IDs.
>>> data2 = {'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9]}
>>> df_2=pd.DataFrame(data=data2)
>>>
>>> df_2
ID x y
3 1 2
4 3 5
5 9 9
Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.
All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".
My desired output would be
>>> df_2
ID x y host_circle
3 1 2 1
4 3 5 None
5 9 9 2
First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.
>>> def func(x1,y1,R1,ID_1,x2,y2):
... dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 )
... if dist < R:
... return ID_1
... else:
... return None
Next, the actual vectorization. I'm sorta lost here. I think it should be something like
df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])
but that just throws errors. Can someone help me?
One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.
Numba v1
You might have to install numba with
pip install numba
Then use numbas jit compiler via the njit function decorator
from numba import njit
#njit
def distances(point, points):
return ((points - point) ** 2).sum(1) ** .5
#njit
def find_my_circle(point, circles):
points = circles[:, :2]
radii = circles[:, 2]
dist = distances(point, points)
mask = dist < radii
i = mask.argmax()
return i if mask[i] else -1
#njit
def find_my_circles(points, circles):
n = len(points)
out = np.zeros(n, np.int64)
for i in range(n):
out[i] = find_my_circle(points[i], circles)
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.
Numba v2
This one is more loopy but short circuits when it finds a host
from numba import njit
#njit
def distance(a, b):
return ((a - b) ** 2).sum() ** .5
#njit
def find_my_circles(points, circles):
n = len(points)
m = len(circles)
out = -np.ones(n, np.int64)
centers = circles[:, :2]
radii = circles[:, 2]
for i in range(n):
for j in range(m):
if distance(points[i], centers[j]) < radii[j]:
out[i] = j
break
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
Vectorized
But still problematic
c = ['x', 'y']
centers = df_1[c].values
points = df_2[c].values
radii = df_1['R'].values
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
Explanation
Distance from any point from the center of a circle is
((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5
I can use broadcasting if I extend one of my arrays into a third dimension
points[:, None] - centers
array([[[ 0, 1],
[-9, -8]],
[[ 2, 4],
[-7, -5]],
[[ 8, 8],
[-1, -1]]])
That is all six combinations of vector differences. Now to calculate the distances.
((points[:, None] - centers) ** 2).sum(2) ** .5
array([[ 1. , 12.04159458],
[ 4.47213595, 8.60232527],
[11.3137085 , 1.41421356]])
Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles
((points[:, None] - centers) ** 2).sum(2) ** .5 < radii
array([[ True, False],
[False, False],
[False, True]])
Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.
Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle
def func(df, x2,y2):
val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1)
return list(val.index[val==True])
df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)
I have a skeletonised image (shown below).
I would like to get the intersections of the lines. I have tried the following method below, skeleton is a openCV image and the algorithm returns a list of coordinates:
def getSkeletonIntersection(skeleton):
image = skeleton.copy();
image = image/255;
intersections = list();
for y in range(1,len(image)-1):
for x in range(1,len(image[y])-1):
if image[y][x] == 1:
neighbourCount = 0;
neighbours = neighbourCoords(x,y);
for n in neighbours:
if (image[n[1]][n[0]] == 1):
neighbourCount += 1;
if(neighbourCount > 2):
print(neighbourCount,x,y);
intersections.append((x,y));
return intersections;
It finds the coordinates of white pixels where there are more than two adjacent pixels. I thought that this would only return corners but it does not - it returns many more points.
This is the output with the points it detects marked on the image. This is because it detects some of the examples shown below that are not intersections.
0 0 0 1 1 0 0 1 1
1 1 1 0 1 0 1 1 0
0 0 1 0 0 1 0 0 0
And many more examples. Is there another method I should look at to detect intersections. All input and ideas appreciated, thanks.
I am not sure about OpenCV features, but you should maybe try using Hit and Miss morphology which is described here.
Read up on Line Junctions and see the 12 templates you need to test for:
I received an email recently asking for my eventual solution to the problem. It is posted below such that it could inform others. I make no claim that this code is particularly fast or stable - only that it's what worked for me! The function also includes filtering of duplicates and intersections detected too close together, suggesting that they are not real intersections and instead introduced noise from the skeletonisation process.
def neighbours(x,y,image):
"""Return 8-neighbours of image point P1(x,y), in a clockwise order"""
img = image
x_1, y_1, x1, y1 = x-1, y-1, x+1, y+1;
return [ img[x_1][y], img[x_1][y1], img[x][y1], img[x1][y1], img[x1][y], img[x1][y_1], img[x][y_1], img[x_1][y_1] ]
def getSkeletonIntersection(skeleton):
""" Given a skeletonised image, it will give the coordinates of the intersections of the skeleton.
Keyword arguments:
skeleton -- the skeletonised image to detect the intersections of
Returns:
List of 2-tuples (x,y) containing the intersection coordinates
"""
# A biiiiiig list of valid intersections 2 3 4
# These are in the format shown to the right 1 C 5
# 8 7 6
validIntersection = [[0,1,0,1,0,0,1,0],[0,0,1,0,1,0,0,1],[1,0,0,1,0,1,0,0],
[0,1,0,0,1,0,1,0],[0,0,1,0,0,1,0,1],[1,0,0,1,0,0,1,0],
[0,1,0,0,1,0,0,1],[1,0,1,0,0,1,0,0],[0,1,0,0,0,1,0,1],
[0,1,0,1,0,0,0,1],[0,1,0,1,0,1,0,0],[0,0,0,1,0,1,0,1],
[1,0,1,0,0,0,1,0],[1,0,1,0,1,0,0,0],[0,0,1,0,1,0,1,0],
[1,0,0,0,1,0,1,0],[1,0,0,1,1,1,0,0],[0,0,1,0,0,1,1,1],
[1,1,0,0,1,0,0,1],[0,1,1,1,0,0,1,0],[1,0,1,1,0,0,1,0],
[1,0,1,0,0,1,1,0],[1,0,1,1,0,1,1,0],[0,1,1,0,1,0,1,1],
[1,1,0,1,1,0,1,0],[1,1,0,0,1,0,1,0],[0,1,1,0,1,0,1,0],
[0,0,1,0,1,0,1,1],[1,0,0,1,1,0,1,0],[1,0,1,0,1,1,0,1],
[1,0,1,0,1,1,0,0],[1,0,1,0,1,0,0,1],[0,1,0,0,1,0,1,1],
[0,1,1,0,1,0,0,1],[1,1,0,1,0,0,1,0],[0,1,0,1,1,0,1,0],
[0,0,1,0,1,1,0,1],[1,0,1,0,0,1,0,1],[1,0,0,1,0,1,1,0],
[1,0,1,1,0,1,0,0]];
image = skeleton.copy();
image = image/255;
intersections = list();
for x in range(1,len(image)-1):
for y in range(1,len(image[x])-1):
# If we have a white pixel
if image[x][y] == 1:
neighbours = neighbours(x,y,image);
valid = True;
if neighbours in validIntersection:
intersections.append((y,x));
# Filter intersections to make sure we don't count them twice or ones that are very close together
for point1 in intersections:
for point2 in intersections:
if (((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2) < 10**2) and (point1 != point2):
intersections.remove(point2);
# Remove duplicates
intersections = list(set(intersections));
return intersections;
This is also available on github here.
It might help if when for a given pixel, instead of counting the number of total 8-neighbors (= neighbors with a connectivity 8), you count the number of 8-neighbors which are not 4-neighbors with each other
So in your example of false positives
0 0 0 1 1 0 0 1 1
1 1 1 0 1 0 1 1 0
0 0 1 0 0 1 0 0 0
For every case, you have 3 neighbors, but each time, 2 of them are 4-connected. (pixels marked "2" in next snippet)
0 0 0 2 2 0 0 2 2
1 1 2 0 1 0 1 1 0
0 0 2 0 0 1 0 0 0
If you consider only one of these for your counts (instead of both of them in your code right now), you indeed have only 2 total newly-defined "neighbors" and the considered points are not considered intersections.
Other "real intersections" would still be kept, like the following
0 1 0 0 1 0 0 1 0
1 1 1 0 1 0 1 1 0
0 0 0 1 0 1 0 0 1
which still have 3 newly-defined neighbors.
I haven't checked on your image if it works perfectly, but I had implemented something like this for this problem a while back...
Here is my solution:
# Functions to generate kernels of curve intersection
def generate_nonadjacent_combination(input_list,take_n):
"""
It generates combinations of m taken n at a time where there is no adjacent n.
INPUT:
input_list = (iterable) List of elements you want to extract the combination
take_n = (integer) Number of elements that you are going to take at a time in
each combination
OUTPUT:
all_comb = (np.array) with all the combinations
"""
all_comb = []
for comb in itertools.combinations(input_list, take_n):
comb = np.array(comb)
d = np.diff(comb)
fd = np.diff(np.flip(comb))
if len(d[d==1]) == 0 and comb[-1] - comb[0] != 7:
all_comb.append(comb)
print(comb)
return all_comb
def populate_intersection_kernel(combinations):
"""
Maps the numbers from 0-7 into the 8 pixels surrounding the center pixel in
a 9 x 9 matrix clockwisely i.e. up_pixel = 0, right_pixel = 2, etc. And
generates a kernel that represents a line intersection, where the center
pixel is occupied and 3 or 4 pixels of the border are ocuppied too.
INPUT:
combinations = (np.array) matrix where every row is a vector of combinations
OUTPUT:
kernels = (List) list of 9 x 9 kernels/masks. each element is a mask.
"""
n = len(combinations[0])
template = np.array((
[-1, -1, -1],
[-1, 1, -1],
[-1, -1, -1]), dtype="int")
match = [(0,1),(0,2),(1,2),(2,2),(2,1),(2,0),(1,0),(0,0)]
kernels = []
for n in combinations:
tmp = np.copy(template)
for m in n:
tmp[match[m][0],match[m][1]] = 1
kernels.append(tmp)
return kernels
def give_intersection_kernels():
"""
Generates all the intersection kernels in a 9x9 matrix.
INPUT:
None
OUTPUT:
kernels = (List) list of 9 x 9 kernels/masks. each element is a mask.
"""
input_list = np.arange(8)
taken_n = [4,3]
kernels = []
for taken in taken_n:
comb = generate_nonadjacent_combination(input_list,taken)
tmp_ker = populate_intersection_kernel(comb)
kernels.extend(tmp_ker)
return kernels
# Find the curve intersections
def find_line_intersection(input_image, show=0):
"""
Applies morphologyEx with parameter HitsMiss to look for all the curve
intersection kernels generated with give_intersection_kernels() function.
INPUT:
input_image = (np.array dtype=np.uint8) binarized m x n image matrix
OUTPUT:
output_image = (np.array dtype=np.uint8) image where the nonzero pixels
are the line intersection.
"""
kernel = np.array(give_intersection_kernels())
output_image = np.zeros(input_image.shape)
for i in np.arange(len(kernel)):
out = cv2.morphologyEx(input_image, cv2.MORPH_HITMISS, kernel[i,:,:])
output_image = output_image + out
if show == 1:
show_image = np.reshape(np.repeat(input_image, 3, axis=1),(input_image.shape[0],input_image.shape[1],3))*255
show_image[:,:,1] = show_image[:,:,1] - output_image *255
show_image[:,:,2] = show_image[:,:,2] - output_image *255
plt.imshow(show_image)
return output_image
# finding corners
def find_endoflines(input_image, show=0):
"""
"""
kernel_0 = np.array((
[-1, -1, -1],
[-1, 1, -1],
[-1, 1, -1]), dtype="int")
kernel_1 = np.array((
[-1, -1, -1],
[-1, 1, -1],
[1,-1, -1]), dtype="int")
kernel_2 = np.array((
[-1, -1, -1],
[1, 1, -1],
[-1,-1, -1]), dtype="int")
kernel_3 = np.array((
[1, -1, -1],
[-1, 1, -1],
[-1,-1, -1]), dtype="int")
kernel_4 = np.array((
[-1, 1, -1],
[-1, 1, -1],
[-1,-1, -1]), dtype="int")
kernel_5 = np.array((
[-1, -1, 1],
[-1, 1, -1],
[-1,-1, -1]), dtype="int")
kernel_6 = np.array((
[-1, -1, -1],
[-1, 1, 1],
[-1,-1, -1]), dtype="int")
kernel_7 = np.array((
[-1, -1, -1],
[-1, 1, -1],
[-1,-1, 1]), dtype="int")
kernel = np.array((kernel_0,kernel_1,kernel_2,kernel_3,kernel_4,kernel_5,kernel_6, kernel_7))
output_image = np.zeros(input_image.shape)
for i in np.arange(8):
out = cv2.morphologyEx(input_image, cv2.MORPH_HITMISS, kernel[i,:,:])
output_image = output_image + out
if show == 1:
show_image = np.reshape(np.repeat(input_image, 3, axis=1),(input_image.shape[0],input_image.shape[1],3))*255
show_image[:,:,1] = show_image[:,:,1] - output_image *255
show_image[:,:,2] = show_image[:,:,2] - output_image *255
plt.imshow(show_image)
return output_image#, np.where(output_image == 1)
# 0- Find end of lines
input_image = img.astype(np.uint8) # must be blaack and white thin network image
eol_img = find_endoflines(input_image, 0)
# 1- Find curve Intersections
lint_img = find_line_intersection(input_image, 0)
# 2- Put together all the nodes
nodes = eol_img + lint_img
plt.imshow(nodes)
I need to generate a file filled with three "random" values per line (10 lines), but those values sum must equal 15.
The structure is: "INDEX A B C".
Example:
1 15 0 0
2 0 15 0
3 0 0 15
4 1 14 0
5 2 13 0
6 3 12 0
7 4 11 0
8 5 10 0
9 6 9 0
10 7 8 0
If you want to avoid needing to create (or iterate through) the full space of satisfying permutations (which, for large N is important), then you can solve this problem with sequential sample.
My first approach was to just draw a value uniformly from [0, N], call it x. Then draw a value uniformly from [0, N-x] and call it y, then set z = N - x - y. If you then shuffle these three, you'll get back a reasonable draw from the space of solutions, but it won't be exactly uniform.
As an example, consider where N=3. Then the probability of some permutation of (3, 0, 0) is 1/4, even though it is only one out of 10 possible triplets. So this privileges values that contain a high max.
You can perfectly counterbalance this effect by sampling the first value x proportionally to how many values will be possible for y conditioned on x. So for example, if x happened to be N, then there is only 1 compatible value for y, but if x is 0, then there are 4 compatible values, namely 0 through 3.
In other words, let Pr(X=x) be (N-x+1)/sum_i(N-i+1) for i from 0 to N. Then let Pr(Y=y | X=x) be uniform on [0, N-x].
This works out to P(X,Y) = P(Y|X=x) * P(X) = 1/(N-x+1) * [N - x + 1]/sum_i(N-i+1), which is seen to be uniform, 1/sum_i(N-i+1), for each candidate triplet.
Note that sum(N-i+1 for i in range(0, N+1)) gives the number of different ways to sum 3 non-negative integers to get N. I don't know a good proof of this, and would happy if someone adds one to the comments!
Here's a solution that will sample this way:
import random
from collections import Counter
def discrete_sample(weights):
u = random.uniform(0, 1)
w_t = 0
for i, w in enumerate(weights):
w_t += w
if u <= w_t:
return i
return len(weights)-1
def get_weights(N):
vals = [(N-i+1.0) for i in range(0, N+1)]
totl = sum(vals)
return [v/totl for v in vals]
def draw_summing_triplet(N):
weights = get_weights(N)
x = discrete_sample(weights)
y = random.randint(0, N-x)
triplet = [x, y, N - x - y]
random.shuffle(triplet)
return tuple(triplet)
Much credit goes to #DSM in the comments for questioning my original answer and providing good feedback.
In this case, we can test out the sampler like this:
foo = Counter(draw_summing_triplet(3) for i in range(10**6))
print foo
Counter({(1, 2, 0): 100381,
(0, 2, 1): 100250,
(1, 1, 1): 100027,
(2, 1, 0): 100011,
(0, 3, 0): 100002,
(3, 0, 0): 99977,
(2, 0, 1): 99972,
(1, 0, 2): 99854,
(0, 0, 3): 99782,
(0, 1, 2): 99744})
If the numbers can by any just use combinations:
from itertools import combinations
with open("rand.txt","w") as f:
combs = [x for x in combinations(range(16),3) if sum(x ) == 15 ][:10]
for a,b,c in combs:
f.write("{} {} {}\n".format(a,b,c))
This seems straight forward to me and it utilizes the random module.
import random
def foo(x):
a = random.randint(0,x)
b = random.randint(0,x-a)
c = x - (a +b)
return (a,b,c)
for i in range(100):
print foo(15)