I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.
The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)
I'm working with some position vectors. I am operating each position with each other position and am using matrices to do it as efficiently as I can. I encountered a problem with my most recent version where it gives me a warning: RuntimeWarning: invalid value encountered in sqrt
return sqrt(add.reduce(s, axis=axis, keepdims=keepdims))
An example of some code that gives me this warning is below.
This warning is caused by np.linalg.norm and only happens when I specify a data type for the array, it also only happens in the example code below when I have more than 90 vectors.
Is this a NumPy bug, a known limitation in NumPy, or am I doing something wrong?
x = np.full((100, 3), 1) # Create an array of vectors, in this case all [1, 1, 1]
ps, qs = np.broadcast_arrays(x, np.expand_dims(x, 1)) # Created so that I can operate each vector on each other vector.
z = np.subtract(ps, qs, dtype=np.float32) # Get the difference between them.
np.linalg.norm(z, axis=2) # Get the magnitude of the difference.
You should make sure that Z doesn't contain any negative value!
test if you have negative values:
print len([_ for _ in z if _ < 0])
Prompt:
Given a 2D integer matrix M representing the gray scale of an image, you need to design a smoother to make the gray scale of each cell becomes the average gray scale (rounding down) of all the 8 surrounding cells and itself. If a cell has less than 8 surrounding cells, then use as many as you can.
Example:
Input:
[[1,1,1],
[1,0,1],
[1,1,1]]
Output:
[[0, 0, 0],
[0, 0, 0],
[0, 0, 0]]
Explanation:
For the point (0,0), (0,2), (2,0), (2,2) -> floor(3/4) = floor(0.75) = 0
For the point (0,1), (1,0), (1,2), (2,1) -> floor(5/6) = floor(0.83333333) = 0
For the point (1,1): floor(8/9) = floor(0.88888889) = 0
Solution:
class Solution:
def imageSmoother(self, grid):
"""
:type M: List[List[int]]
:rtype: List[List[int]]
"""
rows, cols = len(grid), len(grid[0])
#Go through each cell
for r in range(rows):
for c in range(cols):
#Metrics for calculating average, starting inputs are zero since the loop includes the current cell, grid[r][c]
total = 0
n = 0
#Checking the neighbors
for ri in [-1,0,1]:
for ci in [-1,0,1]:
if (r + ri >= 0 and r + ri <= rows-1 and c + ci >=0 and c + ci <= cols-1):
total += grid[r+ri][c+ci]
n += 1
#Now we convert the cell value to the average
grid[r][c] = int(total/n)
return grid
My solution is incorrect. It passes some test cases, but for this one I fail.
Input: [[2,3,4],[5,6,7],[8,9,10],[11,12,13],[14,15,16]]
Output: [[4,4,5],[6,6,6],[8,9,9],[11,11,12],[12,12,12]]
Expected: [[4,4,5],[5,6,6],[8,9,9],[11,12,12],[13,13,14]]
As you can see, my solution is really close. I'm not sure where I'm messing up since when I changed the parameters around I started failing other basic test cases. The solutions I see online use other packages which I'd prefer not to use since I want to approach this problem more intuitively.
How do you check where you're going wrong with 2D array problems? Thanks!
Leetcode solution:
def imageSmoother(self, M):
R,C=len(M),len(M[0])
M2=[[0]*C for i in range(R)]
for i in range(R):
for j in range(C):
temp=[M[i+x][j+y] for x,y in list(itertools.product([-1,0,1],[-1,0,1])) if 0<=i+x<R and 0<=j+y<C ]
M2[i][j]=(sum(temp)//len(temp))
return M2
The problem with your code is that you're modifying grid as you go along. So, for each cell, you're using the input values for the down/right neighbors, but the output values for the up/left neighbors.
So, for your given example, when you're computing the neighbors of grid[1][0], you've already replaced two of the neighbors, grid[0][0] and grid[0][1], so they're now 4, 4 instead of 2, 3. Which means you're averaging 4, 4, 5, 6, 8, 9 instead of 2, 3, 5, 6, 8, 9. So, instead of getting a 5.5 that you round down to 5, you get a 6.0 that you round down to 6.
The simplest fix is to just build up a new output grid as you go along, then return that:
rows, cols = len(grid), len(grid[0])
outgrid = []
#Go through each cell
for r in range(rows):
outrow = []
for c in range(cols):
# … same code as before, but instead of the grid[r][c] =
outrow.append(int(total/n))
outgrid.append(outrow)
return outgrid
If you need to modify the grid in place, you can instead copy the original grid, and iterate over that copy:
rows, cols = len(grid), len(grid[0])
ingrid = [list(row) for row in grid]
#Go through each cell
for r in range(rows):
for c in range(cols):
# … same code as before, but instead of total += grid[r+ri][c+ci]
total += ingrid[r+ri][c+ci]
If you used a 2D NumPy array instead of a list of lists, you could solve this at a higher level.
NumPy lets you add entire arrays all at once, divide them by scalars, etc., so you can get rid of those loops over r and c and just do the work array-wide. But you still have to think about your boundaries. You can't just add arr and arr[:-1] and arr[1:] and so on, you need to pad them out to the same size. And if you just pad with 0s, you'll end up averaging 0, 4, 4, 0, 5, 6, 0, 8, 9, which is no good. But if you pad them with NaN values, so you're averaging NaN, 4, 4, NaN, 5, 6, NaN, 8, 9, then you can use the nanmean function, which ignores those NaN values and averages the 6 real values.
So, this is still a few lines of code to iterate over the 9 directions, pad the 9 arrays, and nanmean the results. (Or you could cram it into a giant expression with product, like the leetcode answer, but that isn't exactly more readable or easier to understand.)
But if you can drag in SciPy, a collection of algorithms for almost anything you'd ever want to build on top of NumPy, it has a function in its ndimage library called generic_filter that can do every conceivable variation of "gather the N neighbors, padding like X, and run function Y on the resulting arrays".
In our case, we want to gather the 3-per-axis neighbors, pad with the constant value NaN, and run the nanmean function, so this one-liner will do everything you need:
scipy.ndimage.generic_filter(grid, function=np.nanmean, size=3, mode='constant', cval=np.NaN)
I want to solve a difference equation using python.
y = x(n - 1) - (0.5(x(n-2) + x(n))
x here is a long array of values. I want to plot y with respect to another time sequence array t using Plotly. I can plot x with t, but I am not able to generate the filtered signal y. I have tried the code below, but it seems I'm missing something. I am not getting the desired output.
from scipy import signal
from plotly.offline import plot, iplot
x = array(...)
t = array(...) # x and t are big arrays
b = [-0.5, 1, -0.5]
a = 0
y = signal.lfilter(b, a, x, axis=-1, zi=None)
iplot([{"x": t, "y": y}])
However, the output is something like this.
>>>y
>>> array([-inf, ..., nan])
Therefore, I am getting a blank graph.
UPDATE with examples of x and t (9 values each):
x = [3.1137561664814495,
-1.4589810840917137,
-0.12631870857936914,
-1.2695030212226599,
2.7600637824592158,
-1.7810937909691049,
0.050527483431747656,
0.27158522344564368,
0.48001109260160274]
t = [0.0035589523041146265,
0.011991765409288035,
0.020505576424579175,
0.028935389041247817,
0.037447199517441021,
0.045880011487565042,
0.054462819797731044,
0.062835632533346342,
0.071347441874490158]
It appears that your problem is defining a = 0. When running your example, you get the following warning from SciPy:
/usr/local/lib/python2.7/site-packages/scipy/signal/signaltools.py:1353: RuntimeWarning:
divide by zero encountered in true_divide
[-inf inf nan nan nan inf -inf nan nan]
This division by zero is defined by value a. If you look at the documentation of scipy.signal.lfilter, it points out the following:
a : array_like
The denominator coefficient vector in a 1-D sequence. If a[0] is not 1, then both a and b are normalized by a[0].
If you change a = 0 to a = 1 you should get output you desire, although do consider that you might want to apply data normalization by a different factor.
I think I missed something somewhere. I filled a numpy array using two for loops (x and y) and a function based on the x,y position. The only problem is that the value of the array always ends in zero irregardless of the size of the array.
thetamap = numpy.zeros(36, dtype=float)
thetamap.shape = (6, 6)
for y in range(0,5):
for x in range(0,5):
thetamap[x][y] = x+y
print thetamap
range(0, 5) produces 0, 1, 2, 3, 4. The endpoint is always omitted. You want simply range(6).
Better yet, use the awesome power of NumPy to make the array in one line:
thetamap = np.arange(6) + np.arange(6)[:,None]
This makes a row vector and a column vector, then adds them together using NumPy broadcasting to make a matrix.