I use the following code to return Shannon's Entropy on an array that represents a probability distribution.
A = np.random.randint(10, size=10)
pA = A / A.sum()
Shannon2 = -np.sum(pA*np.log2(pA))
This works fine if the array doesn't contain any zero's.
Example:
Input: [2 3 3 3 2 1 5 3 3 4]
Output: 3.2240472715
However, if the array does contain zero's, Shannon's Entropy produces nan
Example:
Input:[7 6 6 8 8 2 8 3 0 7]
Output: nan
I do get two RuntimeWarnings:
1) RuntimeWarning: divide by zero encountered in log2
2) RuntimeWarning: invalid value encountered in multiply
Is there a way to alter the code to include zero's? I'm just not sure if removing them completely will influence the result. Specifically, if the variation would be greater due to the greater frequency in distribution.
I think you want to use nansum to count nans as zero:
A = np.random.randint(10, size=10)
pA = A / A.sum()
Shannon2 = -np.nansum(pA*np.log2(pA))
The easiest and most used way is to ignore the zero probabilities and calculate the Shannon's Entropy on remaining values.
Try the following:
import numpy as np
A = np.array([1.0, 2.0, 0.0, 5.0, 0.0, 9.0])
A = np.array(filter(lambda x: x!= 0, A))
pA = A / A.sum()
Shannon2 = -np.sum(pA * np.log2(pA))
Related
I couldn't make a better title. Let me explain:
Numpy has the percentile() function, which calculates the Nth percentile of any array:
import numpy as np
arr = np.arange(0, 10)
print(arr)
print(np.percentile(arr, 80))
>>> [0 1 2 3 4 5 6 7 8 9]
>>> 7.2
Which is great - 7.2 marks the 80th percentile on that array.
How can I obtain the same percentile type of calculation, but find out the Nth percentile of both extremities of an array (the positive and negative numbers)?
For example, my array may be:
[-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9]
So I place them in a number line, it would go from -10 to 10. I like to get the Nth percentile that marks extremities of that number line. For 90th percentile, the output could look like -8.1 and 7.5, for example, since since 90% of the values in the array fall within that range, and the remaining 10% are lower than -8.1 or greater than 7.5.
I made these numbers up of course, just for illustrating what I'm trying to calculate.
Is there any NumPy method for obtaining such boundaries?
Please let me know if I can explain or clarify further, I know this is a complicated question to ask and I'm trying my best to make it clear. Thanks a lot!
Are you looking for something like
import numpy as np
def extremities(array, pct):
# assert 50 <= pct <= 100
return np.percentile(array, [100 - pct, pct])
arr = np.arange(-10, 10)
print(extremities(arr, 90)) # [-8.1, 7.1]
In Python, is there a way to generate a 2d array using numpy with random integer entries without specifying either the low or high?
I tried mat = np.random.randint(size=(3, 4)) but it did not work.
Assuming you don't want to specify the min or max values of the array, one can use numpy.random.normal
np.random.normal(mean, standard deviation, (rows,columns))
And then round it with astype(np.int), as
>>> import numpy as np
>>> mat = (np.random.normal(1, 3, (3,4))).astype(np.int)
[[ 0 0 0 -1]
[ 0 5 0 0]
[-5 1 2 2]]
Please note that the output may vary, as the values are random.
If you want to specify the min and max values, there are various ways of doing that, such as
mat = (np.random.random((3,4))*10).astype(np.int) # Random ints between 0 and 10
or
mat = np.random.randint(1,5, size=(3,4)) # Random ints between 1 and 5
And more.
I am working with 3D matrix in Python, for example, given matrix like this with size of 2x3x4:
[[[1 2 1 4]
[3 2 1 1]
[4 3 1 4]]
[[2 1 3 3]
[1 4 2 1]
[3 2 3 3]]]
I have task to find the value of entropy in each row in each dimension matrix. For example, in row 1 of dimension 1 of the matrix above [1,2,1,4], the normalized value (as such the total sum is 1) is [0.125, 0.25, 0.125, 0.5] and the value of entropy is calculated by the formula -sum(i*log(i)) where i is the normalized value. The resulting matrix is a 2x3 matrix where in each dimension there are 3 values of entropy (because there are 3 rows).
Here is the working example of my code using random matrix each time:
from scipy.stats import entropy
import numpy as np
matrix = np.random.randint(low=1,high=5,size=(2,3,4)) #how if size is (200,50,1000)
entropy_matrix=np.zeros((matrix.shape[0],matrix.shape[1]))
for i in range(matrix.shape[0]):
normalized = np.array([float(k)/np.sum(j) for j in matrix[i] for k in j]).reshape(matrix.shape[1],matrix.shape[2])
entropy_matrix[i] = np.array([entropy(m) for m in normalized])
My question is how do I scale-up this program to work with very large 3D matrix (for example with size of 200x50x1000) ?
I am using Python in Windows 10 (with Anaconda distribution).
Using 3D matrix size of 200x50x1000, I got running time of 290 s on my computer.
Using the definition of entropy for the second part and broadcasted operation on the first part, one vectorized solution would be -
p1 = matrix/matrix.sum(-1,keepdims=True).astype(float)
entropy_matrix_out = -np.sum(p1 * np.log(p1), axis=-1)
Alternatively, we can use einsum for the second part for further perf. boost -
entropy_matrix_out = -np.einsum('ijk,ijk->ij',p1,np.log(p1),optimize=True)
I'm having trouble with implementing vectorization in pandas. Let me preface this by saying I am a total newbie to vectorization so it's extremely likely that I'm getting some syntax wrong.
Let's say I've got two pandas dataframes.
Dataframe one describes the x,y coordinates of some circles with radius R, with unique IDs.
>>> data1 = {'ID': [1, 2], 'x': [1, 10], 'y': [1, 10], 'R': [4, 5]}
>>> df_1=pd.DataFrame(data=data1)
>>>
>>> df_1
ID x y R
1 1 1 4
2 10 10 5
Dataframe two describes the x,y coordinates of some points, also with unique IDs.
>>> data2 = {'ID': [3, 4, 5], 'x': [1, 3, 9], 'y': [2, 5, 9]}
>>> df_2=pd.DataFrame(data=data2)
>>>
>>> df_2
ID x y
3 1 2
4 3 5
5 9 9
Now, imagine plotting the circles and the points on a 2D plane. Some of the points will reside inside the circles. See the image below.
All I want to do is create a new column in df_2 called "host_circle" that indicates the ID of the circle that each point resides in. If the particle does not reside in a circle, the value should be "None".
My desired output would be
>>> df_2
ID x y host_circle
3 1 2 1
4 3 5 None
5 9 9 2
First, define a function that checks if a given particle (x2,y2) resides inside a given circle (x1,y1,R1,ID_1). If it does, return the ID of the circle; else, return None.
>>> def func(x1,y1,R1,ID_1,x2,y2):
... dist = np.sqrt( (x1-x2)**2 + (y1-y2)**2 )
... if dist < R:
... return ID_1
... else:
... return None
Next, the actual vectorization. I'm sorta lost here. I think it should be something like
df_2['host']=func(df_1['x'],df_1['y'],df_1['R'],df_1['ID'],df_2['x'],df_2['y'])
but that just throws errors. Can someone help me?
One final note: My actual data I'm working with is VERY large; tens of millions of rows. Speed is crucial, hence why I'm trying to make vectorization work.
Numba v1
You might have to install numba with
pip install numba
Then use numbas jit compiler via the njit function decorator
from numba import njit
#njit
def distances(point, points):
return ((points - point) ** 2).sum(1) ** .5
#njit
def find_my_circle(point, circles):
points = circles[:, :2]
radii = circles[:, 2]
dist = distances(point, points)
mask = dist < radii
i = mask.argmax()
return i if mask[i] else -1
#njit
def find_my_circles(points, circles):
n = len(points)
out = np.zeros(n, np.int64)
for i in range(n):
out[i] = find_my_circle(points[i], circles)
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
This iterates row by row... meaning one point at a time it tries to find the host circle. Now, that part is still vectorized. And the loop should be very fast. The massive benefit is that you don't occupy tons of memory.
Numba v2
This one is more loopy but short circuits when it finds a host
from numba import njit
#njit
def distance(a, b):
return ((a - b) ** 2).sum() ** .5
#njit
def find_my_circles(points, circles):
n = len(points)
m = len(circles)
out = -np.ones(n, np.int64)
centers = circles[:, :2]
radii = circles[:, 2]
for i in range(n):
for j in range(m):
if distance(points[i], centers[j]) < radii[j]:
out[i] = j
break
return out
ids = np.append(df_1.ID.values, np.nan)
i = find_my_circles(points, df_1[['x', 'y', 'R']].values)
df_2['host_circle'] = ids[i]
df_2
Vectorized
But still problematic
c = ['x', 'y']
centers = df_1[c].values
points = df_2[c].values
radii = df_1['R'].values
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
df_2.loc[df_2.index[i], 'host_circle'] = df_1['ID'].iloc[j].values
df_2
ID x y host_circle
0 3 1 2 1.0
1 4 3 5 NaN
2 5 9 9 2.0
Explanation
Distance from any point from the center of a circle is
((x1 - x0) ** 2 + (y1 - y0) ** 2) ** .5
I can use broadcasting if I extend one of my arrays into a third dimension
points[:, None] - centers
array([[[ 0, 1],
[-9, -8]],
[[ 2, 4],
[-7, -5]],
[[ 8, 8],
[-1, -1]]])
That is all six combinations of vector differences. Now to calculate the distances.
((points[:, None] - centers) ** 2).sum(2) ** .5
array([[ 1. , 12.04159458],
[ 4.47213595, 8.60232527],
[11.3137085 , 1.41421356]])
Thats all 6 combinations of distances and I can compare against the radii to see which are within the circles
((points[:, None] - centers) ** 2).sum(2) ** .5 < radii
array([[ True, False],
[False, False],
[False, True]])
Ok, I want to find where the True values are. That is a perfect use case for np.where. It will give me two arrays, the first will be the row positions, the second the column positions of where these True values are. Turns out, the row positions are the points and column positions are the circles.
i, j = np.where(((points[:, None] - centers) ** 2).sum(2) ** .5 < radii)
Now I just have to slice df_2 with i somehow and assign to it values I get from df_1 using j somehow... But I showed that above.
Try this. I have modified your function a bit for calculation and I am getting as list assuming there are many circle satisfying one point. You can modify it if that's not the case. Also it will be zero member list in case particle do not reside in any of the circle
def func(df, x2,y2):
val = df.apply(lambda row: np.sqrt((row['x']-x2)**2 + (row['y']-y2)**2) < row['R'], axis=1)
return list(val.index[val==True])
df_2['host'] = df_2.apply(lambda row: func(df_1, row['x'],row['y']), axis=1)
I have an array A, say :
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
And I wish to create a new array B by replacing each element in A by the median of its four nearest neighbors, without taking into account the value at the given position... for example :
B[2] = np.median([A[0], A[1], A[3], A[4]]) (=3)
The thing is that I need to perform this on a gigantic A and I want to optimize times, so I want to avoid for loops or similar. And... I don't care about the result at the edges.
I already tried scipy.ndimage.filters.median_filter but it is not producing the desired output :
import scipy.ndimage
B = scipy.ndimage.filters.median_filter(A,footprint=[1,1,0,1,1],mode='wrap')
which produces B=[7,4,4,5,6,7,6,6], which is clearly not the correct answer.
Any idea is welcome.
On way could be using np.roll to shift the number in your array such as:
A_1 = np.roll(A,1)
# output: array([8, 1, 2, 3, 4, 5, 6, 7])
And then the same thing with rolling by -2, -1 and 2:
A_2 = np.roll(A,2)
A_m1 = np.roll(A,-1)
A_m2 = np.roll(A,-2)
Now you just need to sum your 4 arrays, as for each index you have the 4 neighbors in one of them:
B = (A_1 + A_2 + A_m1 + A_m2)/4.
And as you said you don't care about the edges, I think it works for you!
EDIT: I guess I was focus on the rolling idea that I mixed up mean and median, the median can be calculated by B = np.median([A_1,A_2,A_m1,A_m2],axis=0)
I'd make a rolling, central window of length 5 in pandas, and apply the median function to the values of the window, the middle one masked away:
import numpy as np
A = np.array([1,2,3,4,5,6,7,8])
mask = np.array(np.ones(5), bool)
mask[5//2] = False
import pandas as pd
df = pd.DataFrame(A)
r5 = df.rolling(5, center=True)
result = r5.apply(lambda x: np.median(x[mask]))
result
0
0 NaN
1 NaN
2 3.0
3 4.0
4 5.0
5 6.0
6 NaN
7 NaN