Smoothing / noise filtering data in Python

Smoothing / noise filtering data in Python - python

I have table with data as follows
article price wished outcome
horse 10 10
duck 15 15
child 9 15 - 21
panda 21 21
lamb 24 22
gorilla 23 23
I want to smooth column Price to the wished Price and then put it into dataframe, so that I see the values.
Please, is there some built in library - method that smoothens the data?
in this format?
I found savitzky-golay filter, moving average, etc.
But I fail to make it on these kind of data - where x axis is some product = not value.
Please, can you help?
Thanks!!!
d = {'Price': [10, 15, 9, 21,24,23], 'Animal': ['horse', 'lamb', 'gorilla', 'child','panda','duck']}
df = pd.DataFrame(d)
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
from scipy.interpolate import interp1d
from scipy.signal import savgol_filter
import numpy as np
x = np.arange(1,len(df)+1)
y = df['Price']
xx = np.linspace(x.min(),x.max(), 1001)
# interpolate + smooth
itp = interp1d(x,y, kind='quadratic') #kind = 'linear', 'nearest' (dobre vysledky), slinear (taky ok), cubic (nebrat), quadratic - nebrat
window_size, poly_order = 1001, 1
yy_sg = savgol_filter(itp(xx), window_size, poly_order)
# or fit to a global function
# to stejne jako scipy.optimize.curve.fit
def func(x, A, B, x0, sigma):
return A+B*np.tanh((x-x0)/sigma)
fit, _ = curve_fit(func, x, y)
yy_fit = func(xx, *fit)
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(x, y, 'r.', label= 'Unsmoothed curve')
ax.plot(xx, yy_fit, 'b--', label=r"$f(x) = A + B \tanh\left(\frac{x-x_0} {\sigma}\right)$")
ax.plot(xx, yy_sg, 'k', label= "Smoothed curve")
plt.legend(loc='best')
I am getting : AttributeError: 'range' object has no attribute 'min'
Savitzky golay is producing very strange values.
With window lenght 1000
When I set window to len(df) +1 (in order it to be odd) then I get these data:

You're getting that error because of the following line:
x = range(1,len(df)).
As the error tells you, a range object has no attribute min.
However, numpy.array()s do, so if you change that line to
x = np.arange(1, len(df)) then this error (at least) will disappear.
EDIT:
In order for the function to do what you want it to do, you should change it to x = np.arange(1, len(df)+1)

Related

Recover the time shift from nympy.correlate result in Python

This is not a duplicate question since other answers only explain how to plot the cross-correlation function and do not explain how you can get the time difference.
Given a sin signal and shifted version, we should be able to get the time delay between them.
I have created a sin signal and shifted it by t_d=0.05. The following is my code and its output:
import numpy as np
import matplotlib.pyplot as plt
fs = 1000
x = np.linspace(0, 1, fs)
f = 5
t_shift = 0.05
y = np.sin(2*np.pi*f*x)
y_shifted = np.sin(2*np.pi*f*(x-t_shift))
fig, ax = plt.subplots()
ax.plot(x, y, x, y_shifted)
plt.show()
By normalizing signals and applying numpy.correlate we get the following:
y_norm = (y-y.mean())/y.std()
y_shifted_norm = (y_shifted - y_shifted.mean())/y_shifted.std()
cc = np.correlate(y_norm, y_shifted_norm, 'full')
fig, ax = plt.subplots()
ax.plot(range(len(cc)), cc)
plt.show()
Question
From the indices of cross-correlation function, how can I get t_shift=0.05?

#Sepide. It seems to me as if you are trying to maximise the correlation between the signal y and a shifted version of y_shifted. This might be accomplished using np.correlate() but it seems nontrivial indeed to recover the time shifts in the signals. In the solution below I manually shift the time series and compute the correlation coefficient using np.corrcoef. As soon as this Pearson correlation coefficient equals 1, the two signals are aligned.
import numpy as np
import matplotlib.pyplot as plt
# Setting
fs = 1000
x = np.linspace(0, 1, fs)
f = 5
t_shift = 0.05
t_step = 1/fs
# Data
y = np.sin(2*np.pi*f*x)
y_shifted = np.sin(2*np.pi*f*(x-t_shift))
# Compute correlation
MaxTimeShift = 200
CorrelationList = np.empty((MaxTimeShift,1));
CorrelationList[:] = np.NaN
# Compute correlation for various shifts
for iter in range(MaxTimeShift):
CorrelationList[iter] = np.corrcoef( y[0:801].T, y_shifted[iter:(801+iter)].T)[0,1]
# Plot 1
plt.figure(1)
plt.plot(x, y, x, y_shifted)
plt.show()
# Plot 2
plt.figure(2)
ShiftList = t_step*np.arange(MaxTimeShift)
plt.plot(ShiftList, CorrelationList)
plt.title("Correlation coefficient")
plt.show()
print("The time shift between the signals is: ", ShiftList[np.argmax(CorrelationList)])

How to plot cumulative distribution function? [duplicate]

I have a disordered list named d that looks like:
[0.0000, 123.9877,0.0000,9870.9876, ...]
I just simply want to plot a cdf graph based on this list by using Matplotlib in Python. But don't know if there's any function I can use
d = []
d_sorted = []
for line in fd.readlines():
(addr, videoid, userag, usertp, timeinterval) = line.split()
d.append(float(timeinterval))
d_sorted = sorted(d)
class discrete_cdf:
def __init__(data):
self._data = data # must be sorted
self._data_len = float(len(data))
def __call__(point):
return (len(self._data[:bisect_left(self._data, point)]) /
self._data_len)
cdf = discrete_cdf(d_sorted)
xvalues = range(0, max(d_sorted))
yvalues = [cdf(point) for point in xvalues]
plt.plot(xvalues, yvalues)
Now I am using this code, but the error message is :
Traceback (most recent call last):
File "hitratioparea_0117.py", line 43, in <module>
cdf = discrete_cdf(d_sorted)
TypeError: __init__() takes exactly 1 argument (2 given)

I know I'm late to the party. But, there is a simpler way if you just want the cdf for your plot and not for future calculations:
plt.hist(put_data_here, normed=True, cumulative=True, label='CDF',
histtype='step', alpha=0.8, color='k')
As an example,
plt.hist(dataset, bins=bins, normed=True, cumulative=True, label='CDF DATA',
histtype='step', alpha=0.55, color='purple')
# bins and (lognormal / normal) datasets are pre-defined
EDIT: This example from the matplotlib docs may be more helpful.

As mentioned, cumsum from numpy works well. Make sure that your data is a proper PDF (ie. sums to one), otherwise the CDF won't end at unity as it should. Here is a minimal working example:
import numpy as np
from pylab import *
# Create some test data
dx = 0.01
X = np.arange(-2, 2, dx)
Y = np.exp(-X ** 2)
# Normalize the data to a proper PDF
Y /= (dx * Y).sum()
# Compute the CDF
CY = np.cumsum(Y * dx)
# Plot both
plot(X, Y)
plot(X, CY, 'r--')
show()

The numpy function to compute cumulative sums cumsum can be useful here
In [1]: from numpy import cumsum
In [2]: cumsum([.2, .2, .2, .2, .2])
Out[2]: array([ 0.2, 0.4, 0.6, 0.8, 1. ])

Nowadays, you can just use seaborn's kdeplot function with cumulative as True to generate a CDF.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
X1 = np.arange(100)
X2 = (X1 ** 2) / 100
sns.kdeplot(data = X1, cumulative = True, label = "X1")
sns.kdeplot(data = X2, cumulative = True, label = "X2")
plt.legend()
plt.show()

For an arbitrary collection of values, x:
def cdf(x, plot=True, *args, **kwargs):
x, y = sorted(x), np.arange(len(x)) / len(x)
return plt.plot(x, y, *args, **kwargs) if plot else (x, y)
((If you're new to python, the *args, and **kwargs allow you to pass arguments and named arguments without declaring and managing them explicitly))

What works best for me is quantile function of pandas.
Say I have 71 participants. Each participant have a certain number of interruptions. I want to compute the CDF plot of #interruptions for participants. Goal is to be able to tell how many percent of participants have at least 30 interventions.
step=0.05
indices = np.arange(0,1+step,step)
num_interruptions_per_participant = [32,70,52,52,39,20,37,31,60,57,31,71,24,23,38,4,77,37,79,43,63,43,75,13
,45,31,57,28,61,29,30,52,65,11,76,37,65,28,33,73,65,43,50,33,45,40,50,44
,33,49,24,69,55,47,22,45,54,11,30,13,32,52,31,50,10,46,10,25,47,51,83]
CDF = pd.DataFrame({'dummy':num_interruptions_per_participant})['dummy'].quantile(indices)
plt.plot(CDF,indices,linewidth=9, label='#interventions', color='blue')
According to Graph Almost 25% of the participants have less than 30 interventions.
You can use this statistic for your further analysis. For instance, In my case I need at least 30 intervention for each participant in order to meet minimum sample requirement needed for leave-one-subject out evaluation. CDF tells me that I have problem with 25% of the participants.

import matplotlib.pyplot as plt
X=sorted(data)
Y=[]
l=len(X)
Y.append(float(1)/l)
for i in range(2,l+1):
Y.append(float(1)/l+Y[i-2])
plt.plot(X,Y,color=c,marker='o',label='xyz')
I guess this would do,for the procedure refer http://www.youtube.com/watch?v=vcoCVVs0fRI

Plotting and modeling data with lmfit - Fit doesn't match data. What am I doing wrong?

I have some data I'm trying to model with lmfit's Model.
Specifically, I'm measuring superconducting resistors. I'm trying fit the experimental data (resistance vs. temperature) to a model which incorporates the critical temperature Tc (material dependent), the resistance below Tc (nominally 0), and the resistance above Tc (structure dependent).
Here's a simplified version (with simulated data) of the code I'm using to plot my data, along with the output plot.
I'm not getting any errors but, as you can see, I'm also not getting a fit that matches my data.
What am I doing wrong? This is my first time using lmfit and Model, so I may be making a newbie mistake. I thought I was following the lmfit example but, as I said, I'm obviously doing something wrong.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lmfit import Model
def main():
x = np.linspace(0, 12, 50)
x_ser = pd.Series(x) # Simulated temperature data
y1 = [0] * 20
y2 = [10] * 30
y1_ser = pd.Series(y1) # Simulated resistance data below Tc
y2_ser = pd.Series(y2) # Simulated resistance data above Tc (
y_ser = y1_ser.append(y2_ser, ignore_index=True)
xcrit_model = Model(data_equation)
params = xcrit_model.make_params(y1_guess=0, y2_guess=12, xcrit_guess=9)
print('params: {}'.format(params))
result = xcrit_model.fit(y_ser, params, x=x_ser)
print(result.fit_report())
plt.plot(x_ser, y_ser, 'bo', label='simulated data')
plt.plot(x_ser, result.init_fit, 'k.', label='initial fit')
plt.plot(x_ser, result.best_fit, 'r:', label='best fit')
plt.legend()
plt.show()
def data_equation(x, y1_guess, y2_guess, xcrit_guess):
x_lt_xcrit = x[x < xcrit_guess]
x_ge_xcrit = x[x >= xcrit_guess]
y1 = [y1_guess] * x_lt_xcrit.size
y1_ser = pd.Series(data=y1)
y2 = [y2_guess] * x_ge_xcrit.size
y2_ser = pd.Series(data=y2)
y = y1_ser.append(y2_ser, ignore_index=True)
return y
if __name__ == '__main__':
main()

lmfit (and basically all similar solvers) work with continuous variables and investigate how they alter the result by making tiny changes in the parameter values and seeing how that effects this fit.
But your xcrit_guess parameter is used only as a discrete variable. If its value changes from 9.0000 to 9.00001, the fit will not change at all.
So, basically, don't do:
x_lt_xcrit = x[x < xcrit_guess]
x_ge_xcrit = x[x >= xcrit_guess]
Instead, you should use a smoother sigmoidal step function. In fact, lmfit has one of these built-in. So you might try something like this (note, there is no point in converting numpy.arrays to pandas.Series - the code will just turn these back to numpy arrays anyway):
import numpy as np
from lmfit.models import StepModel
import matplotlib.pyplot as plt
x = np.linspace(0, 12, 50)
y = 9.5*np.ones(len(x))
y[:26] = 0.0
y = y + np.random.normal(size=len(y), scale=0.0002)
xcrit_model = StepModel(form='erf')
params = xcrit_model.make_params(amplitude=4, center=5, sigma=1)
result = xcrit_model.fit(y, params, x=x)
print(result.fit_report())
plt.plot(x, y, 'bo', label='simulated data')
plt.plot(x, result.init_fit, 'k', label='initial fit')
plt.plot(x, result.best_fit, 'r:', label='best fit')
plt.legend()
plt.show()

Identifying Outliers with Quantile Regression and Python

I am trying to identify outliers in a dataset using the 5th and 95th percentiles of a regression line so I'm using quantile regression in Python with statsmodel, matplotlib and pandas. Based on this answer from blokeley, I can create a scatterplot of my data and show the best fit line and the lines for the 5th and 95th percentile based on quantile regression. But how do I identify those points that fall above and below those lines and then save them out to a pandas dataframe?
My data looks like this (there are 95 values in total):
Month Year LST NDVI
0 June 1984 310.550975 0.344335
1 June 1985 310.495331 0.320504
2 June 1986 306.820900 0.369494
3 June 1987 308.945602 0.369946
4 June 1988 308.694022 0.31863
2
and the script I have so far is this:
import pandas as pd
excel = my_excel
df = pd.read_excel(excel)
df.head()
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
model = smf.quantreg('NDVI ~ LST',df)
quantiles = [0.05,0.95]
fits = [model.fit(q=q) for q in quantiles]
figure,axes = plt.subplots()
x = df['LST']
y = df['NDVI']
axes.scatter(x,df['NDVI'],c='green',alpha=0.3,label='data point')
fit = np.polyfit(x, y, deg=1)
axes.plot(x, fit[0] * x + fit[1], color='grey',label='best fit')
_x = np.linspace(x.min(),x.max())
for index, quantile in enumerate(quantiles):
_y = fits[index].params['LST'] * _x + fits[index].params['Intercept']
axes.plot(_x, _y, label=quantile)
title = 'LST/NDVI Jun-Aug'
plt.title(title)
axes.legend()
axes.set_xticks(np.arange(298,320,4))
axes.set_yticks(np.arange(0.25,0.5,.05))
axes.set_xlabel('LST')
axes.set_ylabel('NDVI');
And the chart I get out of that is this:
So I can definitely see data points above the 95th line and below the 5th line that I would classify as outliers, but I want to identify those in my original dataframe and maybe plot them on the cart or highlight them in some way to show them as "outliers".
I am searching on a method but coming up empty and could use some help.

You need to figure out if certain point are above the 95% quantile line or below the 5% quantile line. This you can do using the cross product, see this answer for a straightforward implementation.
In your example, you would need to combine the points above and below the quantile lines, possibly in a mask.
Here's is an example:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(np.random.normal(0, 1, (100, 2)))
df.columns = ['LST', 'NDVI']
model = smf.quantreg('NDVI ~ LST', df)
quantiles = [0.05, 0.95]
fits = [model.fit(q=q) for q in quantiles]
figure, axes = plt.subplots()
x = df['LST']
y = df['NDVI']
fit = np.polyfit(x, y, deg=1)
_x = np.linspace(x.min(), x.max(), num=len(y))
# fit lines
_y_005 = fits[0].params['LST'] * _x + fits[0].params['Intercept']
_y_095 = fits[1].params['LST'] * _x + fits[1].params['Intercept']
# start and end coordinates of fit lines
p = np.column_stack((x, y))
a = np.array([_x[0], _y_005[0]]) #first point of 0.05 quantile fit line
b = np.array([_x[-1], _y_005[-1]]) #last point of 0.05 quantile fit line
a_ = np.array([_x[0], _y_095[0]])
b_ = np.array([_x[-1], _y_095[-1]])
#mask based on if coordinates are above 0.95 or below 0.05 quantile fitlines using cross product
mask = lambda p, a, b, a_, b_: (np.cross(p-a, b-a) > 0) | (np.cross(p-a_, b_-a_) < 0)
mask = mask(p, a, b, a_, b_)
axes.scatter(x[mask], df['NDVI'][mask], facecolor='r', edgecolor='none', alpha=0.3, label='data point')
axes.scatter(x[~mask], df['NDVI'][~mask], facecolor='g', edgecolor='none', alpha=0.3, label='data point')
axes.plot(x, fit[0] * x + fit[1], label='best fit', c='lightgrey')
axes.plot(_x, _y_095, label=quantiles[1], c='orange')
axes.plot(_x, _y_005, label=quantiles[0], c='lightblue')
axes.legend()
axes.set_xlabel('LST')
axes.set_ylabel('NDVI')
plt.show()

2d fft numpy/python confusion

I have data in the form x-y-z and want to create a power spectrum along x-y. Here is a basic example I am posting to check where I might be going wrong with my actual data:
import numpy as np
from matplotlib import pyplot as plt
fq = 10; N = 20
x = np.linspace(0,8,N); y = x
space = x[1] -x[0]
xx, yy = np.meshgrid(x,y)
fnc = np.sin(2*np.pi*fq*xx)
ft = np.fft.fft2(fnc)
ft = np.fft.fftshift(ft)
freq_x = np.fft.fftfreq(ft.shape[0], d=space)
freq_y = np.fft.fftfreq(ft.shape[1], d=space)
plt.imshow(
abs(ft),
aspect='auto',
extent=(freq_x.min(),freq_x.max(),freq_y.min(),freq_y.max())
)
plt.figure()
plt.imshow(fnc)
This results in the following function & frequency figures with the incorrect frequency. Thanks.

One of your problems is that matplotlib's imshow using a different coordinate system to what you expect. Provide a origin='lower' argument, and the peaks now appear at y=0, as expected.
Another problem that you have is that fftfreq needs to be told your timestep, which in your case is 8 / (N - 1)
import numpy as np
from matplotlib import pyplot as plt
fq = 10; N = 20
x = np.linspace(0,8,N); y = x
xx, yy = np.meshgrid(x,y)
fnc = np.sin(2*np.pi*fq*xx)
ft = np.fft.fft2(fnc)
ft = np.fft.fftshift(ft)
freq_x = np.fft.fftfreq(ft.shape[0], d=8 / (N - 1)) # this takes an argument for the timestep
freq_y = np.fft.fftfreq(ft.shape[1], d=8 / (N - 1))
plt.imshow(
abs(ft),
aspect='auto',
extent=(freq_x.min(),freq_x.max(),freq_y.min(),freq_y.max()),
origin='lower' , # this fixes your problem
interpolation='nearest', # this makes it easier to see what is happening
cmap='viridis' # let's use a better color map too
)
plt.grid()
plt.show()
You may say "but the frequency is 10, not 0.5!" However, if you want to sample a frequency of 10, you need to sample a lot faster than 8/19! Nyquist's theorem says you need to exceed a sampling rate of 20 to have any hope at all

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Smoothing / noise filtering data in Python - python

Related

Recover the time shift from nympy.correlate result in Python

How to plot cumulative distribution function? [duplicate]

Plotting and modeling data with lmfit - Fit doesn't match data. What am I doing wrong?

Identifying Outliers with Quantile Regression and Python

2d fft numpy/python confusion

Categories

Resources