Weighted smoothing of a 1D array - Python - python

I am quite new to Python and I have an array of some parameter detections, some of the values were detected incorrectly and (like 4555555):
array = [1, 20, 55, 33, 4555555, 1]
And I want to somehow smooth it. Right now I'm doing that with a weighted mean:
def smoothify(array):
for i in range(1, len(array) - 2):
array[i] = 0.7 * array[i] + 0.15 * (array[i - 1] + array[i + 1])
return array
But it works pretty bad, of course, we can take a weighted mean of more than 3 elements, but it results in copypasting... I tried to find some native functions for that, but I failed.
Could you please help me with that?
P.S. Sorry if it's a noob question :(
Thanks for your time,
Best regards, Anna

For weighted smoothing purposes, you are basically looking to perform convolution. For our case, since we are dealing with 1D arrays, we can simply use NumPy's 1D convolution function : np.convolve for a vectorized solution. The only important thing to remember here is that the weights are to be reversed given the nature of convolution that uses a reversed version of the kernel that slides across the main input array. Thus, the solution would be -
weights = [0.7,0.15,0.15]
out = np.convolve(array,np.array(weights)[::-1],'same')
If you were looking to get weighted mean, you could get those with out/sum(weights). In our case, since the sum of the given weights is already 1, so the output would stay the same as out.
Let's plot the output alongwith the input for a graphical debugging -
# Input array and weights
array = [1, 20, 55, 33, 455, 200, 100, 20 ]
weights = [0.7,0.15,0.15]
out = np.convolve(array,np.array(weights)[::-1],'same')
x = np.arange(len(array))
f, axarr = plt.subplots(2, sharex=True, sharey=True)
axarr[0].plot(x,array)
axarr[0].set_title('Original and smoothened arrays')
axarr[1].plot(x,out)
Output -

Would suggest numpy.average to help you with this. the trick is getting the weights calculated - below I zip up the three lists - one the same as the original array, the next one step ahead, the next one step behind. Once we have the weights, we feed them into the np.average function
import numpy as np
array = [1, 20, 55, 33, 4555555, 1]
arrayCompare = zip(array, array[1:] + [0], [0] + array)
weights = [.7 * x + .15 * (y + z) for x, y, z in arrayCompare]
avg = np.average(array, weights=weights)

Maybe you want to have a look at numpy and in particular at numpy.average.
Also, did you see this question Weighted moving average in python? Might be helpful, too.

Since you tagged this with numpy I wrote how I would do this with numpy:
import numpy as np
def smoothify(thisarray):
"""
returns moving average of input using:
out(n) = .7*in(n) + 0.15*( in(n-1) + in(n+1) )
"""
# make sure we got a numpy array, else make it one
if type(thisarray) == type([]): thisarray = np.array(thisarray)
# do the moving average by adding three slices of the original array
# returns a numpy array,
# could be modified to return whatever type we put in...
return 0.7 * thisarray[1:-1] + 0.15 * ( thisarray[2:] + thisarray[:-2] )
myarray = [1, 20, 55, 33, 4555555, 1]
smootharray = smoothify(myarray)
Instead of looping through the original array, with numpy you can get "slices" by indexing. The output array will be two items shorter than the input array. The central points (n) are thisarray[1:-1] : "From item index 1 until the last item (not inclusive)". The other slices are "From index 2 until the end" and "Everything except the last two"

Related

Numpy function, adding the log of the exponential. Python

I am new user to Python.
I want to add many exponential functions, and then take (and store in memory) the logarithm of the result. (Side note : I am doing this because the sum of the exponential functions is very large so storing the log value of this result is a workaround). Can anyone help me use this numpy function https://numpy.org/doc/stable/reference/generated/numpy.logaddexp.html
In the below code I have a 2 x 2 matrix M and a 2 dimensional vector v. I want to first add v the columns of M. So in the below code the result should be
[[11, 22], [13, 24]]
Then I want to take the exponential of each value and sum across the rows (ending up with a vector of length 2), and storing the logarithm of the result. However the below code outputs a matrix and I cant work out how to use the "out=None" imput for the logaddexp function.
import numpy as np
M = np.array([[1, 2], [3, 4]])
v = np.array([10, 20])
result = np.logaddexp(M, v[None, :])
The function np.logaddexp() performs an elementwise operation. In your case, you need the addition to be performed along a given axis. Using some basic functions, you can try the following.
import numpy as np
M = np.array([[1, 2], [3, 4]]) # '2 x 2' array
v = np.array([[10, 20]]) # '1 x 2' array
sum_Mv = M + v # '2 x 2' array
result = np.log(np.sum(np.exp(sum_Mv), axis=1))
Change the 'axis' parameter if needed.
If you still want to use np.logaddexp(), you can split the summed matrix into two halves and perform the operation as shown below.
result = np.logaddexp(sum_Mv[:, 0], sum_Mv[:, 1])
TLDR:
import numpy as np
M = np.array([[1, 2], [3, 4]])
v = np.array([10, 20])
result = np.logaddexp.reduce(M + v, axis=___)
Fill in ___ depending on what "sum across the rows" means
Consider the difference between np.add and np.sum.
np.add, much like the + operator, always takes in 2 arguments, x1 and x2, and adds them together. np.add is a numpy ufunc. If x1 or x2 is an array_like, then the arguments are broadcast together.
np.sum always takes in 1 argument, typically an array_like of items, and performs a summation of all of the elements in the array_like. This is essentially equivalent to iteratively taking an element from the array_like and repeatedly calling np.add with that element on a running result variable. The running result variable is initialized with 0.
Similarly, what np.sum is to np.add, np.prod is to np.multiply (with running result initalized as 1).
Every np.ufunc (such as np.add and np.multiply, but also np.logaddexp), comes with a reduce method and an accompanying identity property that is used as initialization for the running result.
np.add.reduce is exactly equivalent to np.sum. np.multiply.reduce is exactly equivalent to np.prod.
What you're looking to do is a log-sum-exp; but numpy only offers np.logaddexp. As such, you can use np.logaddexp.reduce to get the required functionality. Confusion arises from the fact that you're adding M and v as well as adding exponential terms together. You can simply perform the M + v operation first, and pass the resulting array (the intermediate result in your question), to np.logaddexp.reduce. Note that M + v is equivalent to M + v[None, :] in this case due to numpy's broadcasting rules.

What is an efficient way to calculate the mean of values in the bin with maximum frequency for large number of numpy arrays?

I am looking for an efficient way to do the following calculations on millions of arrays. For the values in each array, I want to calculate the mean of the values in the bin with most frequency as demonstrated below. Some of the arrays might contain nan values and other values are float. The loop for my actual data takes too long to finish.
import numpy as np
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
bin_values=np.linspace(0, 10, 21)
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0])
values = np.zeros(array.shape[0])
for i in range(array.shape[0]):
values[i] = np.nanmean(array[i][(array[i]>=bin_start[i])*(array[i]<bin_end[i])])
Also, when I run the above code I get three warnings. The first is 'RuntimeWarning: Mean of empty slice' for the line where I calculate the value variable. I set a condition in case I have all nan values to skip this line, but the warning did not go away. I was wondering what the reason is. The other two warnings are for when the less and greater_equal conditions do not meet which make sense to me since they might be nan values.
The arrays that I want to run this algorithm on are independent, but I am already processing them with 12 separate scripts. Running the code in parallel would be an option, however, for now I am looking to improve the algorithm itself.
The reason that I am using lambda function is to run numpy.histogram over an axis since it seems the histogram function does not take an axis as an option. I was able to use a mask and remove the loop from the code. The code is 2 times faster now, but I think it still can be improved more.
I can explain what I want to do in more detail by an example if it clarifies it. Imagine I have 36 numbers which are greater than 0 and smaller than 20. Also, I have bins with equal distance of 0.5 over the same interval (0.0_0.5, 0.5_1.0, 1.0_1.5, … , 19.5_20.0). I want to see if I put the 36 numbers in their corresponding bin what would be the mean of the numbers within the bin which contain the most number of numbers.
Please post your solution if you can think of a faster algorithm.
import numpy as np
# creating an array to test the algorithm
array = np.array([np.random.uniform(0, 10) for i in range(800,)])
# adding nan values
mask = np.random.choice([1, 0], array.shape, p=[.7, .3]).astype(bool)
array[mask] = np.nan
array = array.reshape(50, 16)
# the algorithm
bin_values=np.linspace(0, 10, 21)
# calculating the frequency of each bin
f = np.apply_along_axis(lambda a: np.histogram(a, bins=bin_values)[0], 1, array)
bin_start = np.apply_along_axis(lambda a: bin_values[np.argmax(a)], 1, f).reshape(array.shape[0], -1)
bin_end = bin_start + (abs(bin_values[1]-bin_values[0]))
# creating a mask to get the mean over the bin with maximum frequency
mask = (array>=bin_start) * (array<bin_end)
mask_nan = np.tile(np.nan, (mask.shape[0], mask.shape[1]))
mask_nan[mask] = 1
v = np.nanmean(array * mask_nan, axis = 1)

Suggestion to vectorize a Python function

I wrote the following function, which takes as inputs three 1D array (namely int_array, x, and y) and a number lim. The output is a number as well.
def integrate_to_lim(int_array, x, y, lim):
if lim >= np.max(x):
res = 0.0
if lim <= np.min(x):
res = int_array[0]
else:
index = np.argmax(x > lim) # To find the first element of x larger than lim
partial = int_array[index]
slope = (y[index-1] - y[index]) / (x[index-1] - x[index])
rest = (x[index] - lim) * (y[index] + (lim - x[index]) * slope / 2.0)
res = partial + rest
return res
Basically, outside form the limit cases lim>=np.max(x) and lim<=np.min(x), the idea is that the function finds the index of the first value of the array x larger than lim and then uses it to make some simple calculations.
In my case, however lim can also be a fairly big 2D array (shape ~2000 times ~1000 elements)
I would like to rewrite it such that it makes the same calculations for the case that lim is a 2D array.
Obviously, the output should also be a 2D array of the same shape of lim.
I am having a real struggle figuring out how to vectorize it.
I would like to stick only to the numpy package.
PS I want to vectorize my function because efficiency is important and as I understand using for loops is not a good choice in this regard.
Edit: my attempt
I was not aware of the function np.take, which made the task way easier.
Here is my brutal attempt that seems to work (suggestions on how to clean up or to make the code faster are more than welcome).
def integrate_to_lim_vect(int_array, x, y, lim_mat):
lim_mat = np.asarray(lim_mat) # Make sure that it is an array
shape_3d = list(lim_mat.shape) + [1]
x_3d = np.ones(shape_3d) * x # 3 dimensional version of x
lim_3d = np.expand_dims(lim_mat, axis=2) * np.ones(x_3d.shape) # also 3d
# I use np.argmax on the 3d matrices (is there a simpler way?)
index_mat = np.argmax(x_3d > lim_3d, axis=2)
# Silly calculations
partial = np.take(int_array, index_mat)
y1_mat = np.take(y, index_mat)
y2_mat = np.take(y, index_mat - 1)
x1_mat = np.take(x, index_mat)
x2_mat = np.take(x, index_mat - 1)
slope = (y1_mat - y2_mat) / (x1_mat - x2_mat)
rest = (x1_mat - lim_mat) * (y1_mat + (lim_mat - x1_mat) * slope / 2.0)
res = partial + rest
# Make the cases with np.select
condlist = [lim_mat >= np.max(x), lim_mat <= np.min(x)]
choicelist = [0.0, int_array[0]] # Shoud these options be a 2d matrix?
output = np.select(condlist, choicelist, default=res)
return output
I am aware that if the limit is larger than the maximum value in the array np.argmax returns the index zero (leading to wrong results). This is why I used np.select to check and correct for these cases.
Is it necessary to define the three dimensional matrices x_3d and lim_3d, or there is a simpler way to find the 2D matrix of the indices index_mat?
Suggestions, especially to improve the way I expanded the dimension of the arrays, are welcome.
I think you can solve this using two tricks. First, a 2d array can be easily flattened to a 1d array, and then your answers can be converted back into a 2d array with reshape.
Next, your use of argmax suggests that your array is sorted. Then you can find your full set of indices using digitize. Thus instead of a single index, you will get a complete array of indices. All the calculations you are doing are intrinsically supported as array operations in numpy, so that should not cause any problems.
You will have to specifically look at the limiting cases. If those are rare enough, then it might be okay to let the answers be derived by the default formula (they will be garbage values), and then replace them with the actual values you desire.

How to efficiently populate a numpy 2D array?

I want to create a 2D numpy array of size (N_r * N_z).
Across columns, the elements for 1 specific column (say j) shall be created based on the value r_thresh[j].
So 1 column (say j) out of the total of N_z columns in the numpy 2D array is created as:
(np.arange(N_r) + 0.5) * r_thresh[j] # this gives an array of size (1, N_r)
Of course, the column j + 1 shall be created as:
(np.arange(N_r) + 0.5) * r_thresh[j+1] # this gives an array of size (1, N_r)
r_thresh is a numpy array of size (1, N_z), already populated with values before I want to create the 2D array.
I want to ask you how do I go further and use this ''rule'' of creating each element of the numpy 2D array and actually create the whole array, in the most efficient way possible (speed-wise).
I initially wrote all the code using 2 nested for loops and plain python lists and the code worked, but took forever to run.
More experienced programmers told me to avoid for loops and use numpy because it's the best.
I now understand how to create 1D arrays using numpy np.arange() instruction, but I lack the knowledge on how to extrapolate this to 2 Dimensions.
Thanks!
The easiest way is to use einsum. In the case of r_thresh with the shape of (N_z,), you can use this code:
res = np.einsum("i,j->ij", np.arange(N_r) + 0.5, r_thresh)
Also, you can reshape np.arange(N_r) + 0.5 to the shape (N_r,1) and r_thresh to the shape (1,N_z). Thus, you can use the dot product (for Python version > 3.5):
res = (np.arange(N_r) + 0.5).reshape(N_r,1) # r_thresh.reshape(1,N_z)
or following to the comment of hpaulj:
res = (np.arange(N_r) + 0.5)[:,None] # r_thresh[None,:]
EDIT1
The comment of hpaulj is also very helpful (I pasted this into my answer to see better):
res = (np.arange(N_r) + 0.5)[:,None] * r_thresh
res = np.outer(np.arange(N_r) + 0.5, r_thresh)
IN ADDITION
You can also use tensordot:
res = np.tensordot((np.arange(N_r) + 0.5)[:,None], r_thresh[:,None], axes=[[-1],[-1]])

Python Numpy error : setting an array element with a sequence

I'm quite new to Python and Numpy, so I apologize if I'm missing something obvious here.
I have a function that solves a system of 2 differential equations :
import numpy as np
import numpy.linalg as la
def solve_ode(x0, a0, beta, t):
At = np.array([[0.23*t, (-10**5)*t], [0, -beta*t]], dtype=np.float32)
# get eigenvalues and eigenvectors
evals, V = la.eig(At)
Vi = la.inv(V)
# get e^At coeff
eAt = V # np.exp(evals) # Vi
xt = eAt*x0
return xt
However, running it with this code :
import matplotlib.pyplot as plt
# initial values
x0 = 10**6
a0 = 2.5
beta = 0.05
t = np.linspace(0, 3600, 360)
plt.semilogy(t, solve_ode(x0, a0, beta, t))
... throws this error :
ValueError: setting an array element with a sequence.
At this line :
At = np.array([[0.23*t, (-10**5)*t], [0, -beta*t]], dtype=np.float32)
Note that t and beta are supposed to be floats. I think Python might not be able to infer this but I don't know how I could do this...
Thx in advance for your help.
You are supplying t as a numpy array of shape 360 from linspace and not simply a float. The resulting At numpy array you are trying to create is then ill formed as all columns must be the same length. In python there is an important difference between lists and numpy arrays. For example, you could do what you have here as a list of lists, e.g.
At = [[0.23*t, (-10**5)*t], [0, -beta*t]]
with dimensions [[360 x 360] x [1 x 360]].
Alternatively, if all elements of At are the length of t the array would work,
At = np.array([[0.23*t, (-10**5)*t], [t, -beta*t]], dtype=np.float32)
with shape [2, 2, 360].
When you give a list or a list of lists, or in this case, a list of list of listss, all of them should have the same length, so that numpy can automatically infer the dimensions (shape) of the resulting matrix.
In your example, it's all correctly put, except the part you put 0 as a column I guess. Not sure what to call it though, cause your expected output is a cube I suppose.
You can fix it by giving the correct number of zeros as bellow:
At = np.array([[0.23*t, (-10**5)*t], [np.zeros(len(t)), -beta*t]], dtype=np.float32)
But check the .shape of the resulting array, and make sure it's what you want.
As others note the problem is the 0 in the inner list. It doesn't match the 360 length arrays generated by the other expressions. np.array can make an object dtype array from that (2x2), but can't make a float one.
At = np.array([[0.23*t, (-10**5)*t], [0*t, -beta*t]])
produces a (2,2,360) array. But I suspect the rest of that function is built around the assumption that At is (2,2) - a 2d square array with eig, inv etc.
What is the return xt supposed to be?
Does this work?
S = np.array([solve_ode(x0, a0, beta, i) for i in t])
giving a 1d array with the same number of values as in t?
I'm not suggesting this is the fastest way of solving the problem, but it's the simplest, especially if you are only generating 360 values.

Categories

Resources