What would be an efficient approach to iterating through simple moving average (SMA) filtering on a mild dataset (<10,000 elements)?
I'm trying to remove vertical tangents and extreme peaks from my dataset, while retaining as much resolution as possible. From a process standpoint, my plan was to use scipy's simpsons rule integration to compare the area under the original noisy curve to the SMA applied curve. This works well in my process due to the inherent data properties. I'm using Pandas to calculate the SMA. I'd like to iteratively change the window length (fixed integer), until the error is minimized -- where error = (area under original curve - area under SMA curve)**2.0
Unfortunately, pandas does not accept an array of windows. To ensure I've hit an acceptable target, I plan to compare the error calculated at each window value, and select the one with the smallest error. What would be a code-efficient way to iteratively compare?
This is an example of what I have currently.
noise_data_x = [1,1.1,1,1.2,1.3,1.4,1.5,1.6.........100]
noise_data_y = [2.1,3.4,3.2,4.7,................2.1,5.7]
SMA_data_y = pd.DataFrame(noise_data_y).rolling(window=4).mean()
SMA_data_y_array = []
SMA_data_x_array = []
for i in range(len(SMA_data_y)):
#Drop NAN
if np.isnan(SMA_data_y.iloc[i,0]) == False:
SMA_data_x_array.append(noise_data_x[i])
SMA_data_y_array.append(SMA_data_y.iloc[i,0])
data_cleaned = sci.integrate.simpson(SMA_data_y_array, SMA_data_x_array)
print(data_cleaned)
data_original = sci.integrate.simpson(noise_data_y, noise_data_x)
error = ((data_cleaned - data_original))**2.0
This code works for a one-off approach, but how would you go about iteratively looking at windows for i in range (2,200) with this type of error reduction?
Surely there's a better way than duplicating the array hundreds of times.
I've looked at using a for loop to pass an array of arrays using np.tile(), but have not had success. Thoughts?
Related
I have two signals that I'm trying to see their correlation lag:
It looks like they are synced, so I expect the correlate function to give minimum at zero (because they have anti-correlation every ~100 timesteps).
However, using this code:
yhat1 = np.load('cor1.npy')
yhat2 = np.load('cor2.npy')
corr = np.correlate(yhat1 - np.mean(yhat1),
yhat2 - np.mean(yhat2),
mode='same')
plt.plot(corr)
plt.show()
I'm getting the following (I tried to use 'full ' and 'same' in the mode and got the same result):
Why the minimum is not at 0 as expected but at 250?
Why it seems like there are other significant peaks on both sides of the minimum?
data is here
Numpy's correlation function returns you the auto/cross correlation function depending on the inputs you give. Correlation is same as convolution except you dont apply time reversal to one of the signals. In other words, apply a sliding dot-product between signals.
At t=0, it's normal to get zero correlation as one signal has zero at t=0. However,as you perform this further, signals are fluctuating both in magnitude and sign. Due to (relatively) extreme peaks of signals to each other at different times, correlatino is fluctuiating. THe huge peak is at t=500 because at theat time full overlap occurs between two signals. This happens as your signals extreme peaks are aligned at that moment. After t=500, your overlapped regions decrease and obsreve that the behavior is similar to the case before we've reached to t<500.
I want to make sure I am using numpy's correlate correctly, it is not giving me the answer I expect. Perhaps I am misunderstanding the correlate function. Here is a code snipet with comments:
import numpy as np
ref = np.sin(np.linspace(-2*np.pi, 2*np.pi, 10000)) # make some data
fragment = ref[2149:7022] # create a fragment of data from ref
corr = np.correlate(ref, fragment) # Find the correlation between the two
maxLag = np.argmax(corr) # find the maximum lag, this should be the offset that we chose above, 2149
print(maxLag)
2167 # I expected this to be 2149.
Isn't the index in the corr array where the correlation is maximum the lag between these two datasets? I would think the starting index I chose for the smaller dataset would be the offset with the greatest correlation.
Why is there a discrepancy between what I expect, 2149, and the result, 2167?
Thanks
That looks like a precision error to me, cross-correlation is an integral and it will always have problems when being represented in discrete space, I guess the problem arises when the values are close to 0. Maybe if you increase the numbers or increase the precision that difference will disappear but I don't think it is really necessary since you are already dealing with approximation when using the discrete cross-correlation, below is the graph of the correlation for you te see that the values are indeed close:
I have a numpy array which is basically a data column from excel sheet. This data is acquired through low-pass 10 Hz filter DAS but due to some ambiguity it contains square wave like artifacts. The data now has to be filtered at 0.4 Hz Highpass Butterworth filter, which I do through scipy.signal. But after applying Highpass fiter, the square wave like artifacts change into spikes. When applying scipy.median to it I am not able to successfully filter the spikes. What should I try?
The following pic shows the original data.
Following pic shows highpass filter of 0.4 Hz applied followed by median filter of order 3
Even a median filter of order 51 is not useful.
If your input is always expected to have a significant outlier, I would recommend an iterative filtering approach.
Here is your data plotted along with the mean, 1-sigma, 2-sigma and 3-sigma lines:
I would start by removing everything above and below 2-sigma from the mean. Since that will tighten the distribution, I would recommend doing the iterations over and over until the size of the un-trimmed data remains the same. I would recommend increasing the threshold geometrically to avoid trimming "good" data. Finally, you can fill in the missing points with the mean of the remainder or something like that.
Here is a sample implementation, with no attempt at optimization whatsoever:
data = np.loadtxt('data.txt', skiprows=1)
x = np.arange(data.size)
loop_data = data
prev_size = 0
nsigma = 2
while prev_size != loop_data.size:
mean = loop_data.mean()
std = loop_data.std()
mask = (loop_data < mean + nsigma * std) & (loop_data > mean - nsigma * std)
prev_size = loop_data.size
loop_data = loop_data[mask]
x = x[mask]
# Constantly expanding sigma guarantees fast loop termination
nsigma *= 2
# Reconstruct the mask
mask = np.zeros_like(data, dtype=np.bool)
mask[x] = True
# This destroys the original data somewhat
data[~mask] = data[mask].mean()
This approach may not be optimal in all situations, but I have found it to be fairly robust most of the time. There are lots of tweakable parameters. You may want to change your increase factor from 2, or even go with a linear instead of geometrical increase (although I tried the latter and it really didn't work well). You could also use IQR instead of sigma, since it is more robust against outliers.
Here is an image of the resulting dataset (with the removed portion in red and the original dotted):
Another artifact of interest: here are the plots of the data showing the trimming progression and how it affects the cutoff points. Plots show the data, with cut portions in red, and a the n-sigma line for the remainder. The title shows how much the sigma shrinks:
I'm using meanshift clustering to remove unwanted noise from my input data..
Data can be found here. Here what I have tried so far..
import numpy as np
from sklearn.cluster import MeanShift
data = np.loadtxt('model.txt', unpack = True)
## data size is [3X500]
ms = MeanShift()
ms.fit(data)
after trying some different bandwidth value I am getting only 1 cluster.. but the outliers and noise like in the picture suppose to be in different cluster.
when decreasing the bandwidth a little more then I ended up with this ... which is again not what I was looking for.
Can anyone help me with this?
You can remove outliers before using mean shift.
Statistical removal
For example, fix a number of neighbors to analyze for each point (e.g. 50), and the standard deviation multiplier (e.g. 1). All points who have a distance larger than 1 standard deviation of the mean distance to the query point will be marked as outliers and removed. This technique is used in libpcl, in the class pcl::StatisticalOutlierRemoval, and a tutorial can be found here.
Deterministic removal (radius based)
A simpler technique consists in specifying a radius R and a minimum number of neighbors N. All points who have less than N neighbours withing a radius of R will be marked as outliers and removed. Also this technique is used in libpcl, in the class pcl::RadiusOutlierRemoval, and a tutorial can be found here.
Mean-shift is not meant to remove low-density areas.
It tries to move all data to the most dense areas.
If there is one single most dense point, then everything should move there, and you get only one cluster.
Try a different method. Maybe remove the outliers first.
set his parameter to false cluster_allbool, default=True
If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.
I have some discrete data values, that taken together form some sort of distribution.
This is one of them, but they are different with the peak being in all possible locations, from 0 to end.
So, I want to use it's quantiles (percentiles) in Python. I think I could write some sort of function, that would some up all values starting from zero, until it reaches desired percent. But probably there is a better solution? For example, to create an empirical distribution of some sort in SciPy and then use SciPy's methods of calculating percentiles?
In the very end I need x-coordinates of a left percentile and a right percentile. One could use 20% and 80% percentiles as an example, I will have to find the best numbers for my case later.
Thank you in advance!
EDIT:
some example code for almost what I want.
import numpy as np
np.random.seed(0)
distribution = np.random.normal(0, 1, 1000)
left, right = np.percentile(distribution, [20, 80])
print left, right
This returns percentiles themselves, I need to get their x-coordinates somehow. For normal distribution here it is possible, obviously, but I have a distribution of an unknown shape, so if a percentile isn't equal to one of the values (which is the most common thing, obviously), it becomes much more complicated.
if you are looking for empirical CDF then you may use statsmodels ECDF. For percentiles/quantiles you can use numpy percentile
OK, for now I have written the following function and now use it:
def percentile(distribution, percent):
percent = 1.0*percent/100
cum_percent = 0
i=0
while cum_percent <= percent:
cum_percent = cum_percent + distribution[i]
i = i+1
return i
It is a little rough, because returns an index of the most close value to the left of the required value. For my purposes it works as a temporary solution, but I would love to see a working solution for precise percentile x-coordinate determination.