How to perform sigma clipping on data set?

How to perform sigma clipping on data set? - python

*Context: I am looking at the change of velocity of an object periodically, where the period is 1.846834days and I am expecting a sinusoidal fit to my set of data.
Suppose I have a set of data which look like this:
#days vel error
5725.782701 0.195802 0.036312
5729.755560 -0.006370 0.041495
5730.765352 -0.071253 0.030760
5745.710214 0.092082 0.036094
5745.932853 0.238030 0.040097
5749.705307 0.196649 0.037140
5741.682112 0.186664 0.028075
5742.681765 -0.262104 0.038049
6186.729146 -0.243796 0.031687
6187.742803 -0.009394 0.054541
6190.717317 -0.001821 0.033684
6192.716356 0.117557 0.037807
6196.704736 0.093935 0.032336
6203.683879 0.076051 0.033085
6204.679898 -0.301463 0.033483
6208.659585 -0.409340 0.036002
6209.669701 0.180807 0.041666
There are only one or two data points observed in each cycle, so what I wanted to do is to phase fold my data, plot them and fit my data using chi-square minimization. This is what I have so far:
import numpy as np
import matplotlib.pyplot as plt
from scipy import optimize as op
import csv
with open('testdata.dat') as fin:
with open('testdata_out.dat', 'w') as fout: #write output in a new .dat file
o=csv.writer(fout)
for line in fin:
o.writerow(line.split())
#create a 2d array and give names to columns
data = np.genfromtxt('testdata_out.dat',delimiter=',',dtype=('f4,f4,f4'))
data.dtype.names = ('bjd','rv','err')
# parameters
x = data['bjd']
y = data['rv']
e = data['err']
P = 1.846834 #orbital period
T = 3763.85112 #time from ephemeris
q = (data['bjd']-T)%P
#print(q)
def niceplot():
plt.xlabel('BJD')
plt.ylabel('RV (km/s)')
plt.tight_layout()
def model(q,A,V):
return A*np.cos(np.multiply(np.divide((q),P),np.multiply(2,np.pi))-262) + V
def residuals((A,V),q,y,e): #for least square
return (y-model(q,A,V)) / e
def chisq((A,V),q,y,e):
return ((residuals((A,V),q,y,e))**2).sum()
result_min = op.minimize(chisq, (0.3,0), args=(q,y,e))
print(result_min)
A,V = result_min.x
xf = np.arange(q.min(), q.max(), 0.1)
yf = model(xf,A,V)
print(xf)
print(yf)
plt.errorbar(q, y, e, fmt='ok')
plt.plot(xf,yf)
niceplot()
plt.show()
In my plot, the shape of the sine curve seems to fit my data, but it does not go through all the data points.
My question is: How can I perform sigma clipping such that I can get a better fit for my set of data? I know the curve_fit module in Scipy may do the job. But I would like to know if it is possible to perform sigma clipping while using minimize module??
Many thanks in advance!

Related

How to determine multiple time-periods present in a waveform using findpeaks() in Python?

I'm trying to determine the different time-periods present in a waveform (shown below) using findpeaks().
The dataset for which I'm finding the peaks and eventually time-period, is an Autocorrelation Plot (though it could have been any given plot). I generated it using the following code (which you can use as it is for your reference):
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#GENERATION OF A FUNCTION WITH DUAL SEASONALITY & NOISE
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#x_daily_weekly_long is the final waveform on which I'm carrying out Autocorrelation
#PERFORMING AUTOCORRELATION:
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode = "same")
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#VISUALIZATION:
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
My approach to determining the time-period is as follows:
#1) Finding peak indices
indices = find_peaks(autocorr.flatten())[0]
#2) Determining Period
diff = [(indices[i - 1] - x) for i, x in enumerate(indices)][1:]
short_period = abs(np.mean(diff))/fs
short_period #is the distance between all the individual peaks.
In this way, I'm able to only find the time-period (distance) between individual peaks. But my need is to determine the time-period for all the multiple periods that the data may contain.
Can anyone please help, how to determine all possible time-periods present in the data?

How to De-Trend a waveform which is piece-wise linear or non-linear?

I'm trying to remove the trend present in the waveform which looks like the following:
For doing so, I use scipy.signal.detrend() as follows:
autocorr = scipy.signal.detrend(autocorr)
But I don't see any significant flattening in trend. I get the following:
My objective is to have the trend completely eliminated from the waveform. And I need to also generalize it so that it can detrend any kind of waveform - be it linear, piece-wise linear, polynomial, etc.
Can you please suggest a way to do the same?
Note: In order to replicate the above waveform, you can simply run the following code that I used to generate it:
#Loading Libraries
import warnings
warnings.filterwarnings("ignore")
import json
import sys, os
import numpy as np
import pandas as pd
import glob
import pickle
from statsmodels.tsa.stattools import adfuller, acf, pacf
from scipy.signal import find_peaks, square
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt
#Generating a function with Dual Seasonality:
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
def signal_line_plot(input_signal: pd.Series, title: str = "", y_label: str = "Signal"):
""" Function to plot a time series signal
Args:
input_signal: time series signal that you want to plot
title: title on plot
y_label: label of the signal being plotted
Returns:
signal plot
"""
plt.plot(input_signal)
plt.title(title)
plt.ylabel(y_label)
plt.show()
# Square with two periodicities of daily and weekly. With #15min sampling frequency it means 4*24=96 samples and 4*24*7=672
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
signal_line_plot(x_daily_weekly_long)
signal_line_plot(x_daily_weekly_long[0:1000])
#Finding Autocorrelation & Lags for the signal [WHICH THE FINAL PARAMETERS WHICH ARE TO BE PLOTTED]:
#Determining Autocorrelation & Lag values
import scipy.signal as signal
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode="same")
#Normalize the autocorr values (such that the hightest peak value is at 1)
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
#Visualization
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)
#DETRENDING:
autocorr = scipy.signal.detrend(autocorr)
#Visualization
f = plt.figure()
f.set_figwidth(40)
f.set_figheight(10)
plt.plot(lags, autocorr)

Since it's an auto-correlation, it will always be even; so detrending with a breakpoint at lag=0 should get you part of the way there.
An alternative way to detrend is to use a high-pass filter; you could do this in two ways. What will be tricky is deciding what the cut-off frequency should be.
Here's a possible way to do this:
#Loading Libraries
import numpy as np
from scipy import signal
import matplotlib.pyplot as plt
#Generating a function with Dual Seasonality:
def white_noise(mu, sigma, num_pts):
""" Function to generate Gaussian Normal Noise
Args:
sigma: std value
num_pts: no of points
mu: mean value
Returns:
generated Gaussian Normal Noise
"""
noise = np.random.normal(mu, sigma, num_pts)
return noise
# High-pass filter via discrete Fourier transform
# Drop all components from 0th to dropcomponent-th
def dft_highpass(x, dropcomponent):
fx = np.fft.rfft(x)
fx[:dropcomponent] = 0
return np.fft.irfft(fx)
# Square with two periodicities of daily and weekly. With #15min sampling frequency it means 4*24=96 samples and 4*24*7=672
t_week = np.linspace(1,480, 480)
t_weekend=np.linspace(1,192,192)
T=96 #Time Period
x_weekday = 10*signal.square(2*np.pi*t_week/T, duty=0.7)+10 + white_noise(0, 1,480)
x_weekend = 2*signal.square(2*np.pi*t_weekend/T, duty=0.7)+2 + white_noise(0,1,192)
x_daily_weekly = np.concatenate((x_weekday, x_weekend))
x_daily_weekly_long = np.concatenate((x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly,x_daily_weekly))
#Finding Autocorrelation & Lags for the signal [WHICH THE FINAL PARAMETERS WHICH ARE TO BE PLOTTED]:
#Determining Autocorrelation & Lag values
autocorr = signal.correlate(x_daily_weekly_long, x_daily_weekly_long, mode="same")
#Normalize the autocorr values (such that the hightest peak value is at 1)
autocorr = (autocorr-min(autocorr))/(max(autocorr)-min(autocorr))
lags = signal.correlation_lags(len(x_daily_weekly_long), len(x_daily_weekly_long), mode = "same")
# detrend w/ breakpoints
dautocorr = signal.detrend(autocorr, bp=len(lags)//2)
# detrend w/ high-pass filter
# use `filtfilt` to get zero-phase
b, a = signal.butter(1, 1e-3, 'high')
fautocorr = signal.filtfilt(b, a, autocorr)
# detrend with DFT HPF
rautocorr = dft_highpass(autocorr, len(autocorr) // 1000)
#Visualization
fig, ax = plt.subplots(3)
for i in range(3):
ax[i].plot(lags, autocorr, label='orig')
ax[0].plot(lags, dautocorr, label='detrend w/ bp')
ax[1].plot(lags, fautocorr, label='HPF')
ax[2].plot(lags, rautocorr, label='DFT')
for i in range(3):
ax[i].legend()
ax[i].set_ylabel('autocorr')
ax[-1].set_xlabel('lags')
giving

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?

You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

How to validate the downsampling is as intended

how to validate whether the down sampled output is correct. For example, I had make some example, however, I am not sure whether the output is correct or not?
Any idea on the validation
Code
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
import mne
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=400 # frequency of the signal
x = np.arange(fs)
y = [ np.sin(2*np.pi*fTwo * (i/fs)) for i in x]
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure(1)
plt.subplot(211)
plt.stem(x, y)
plt.subplot(212)
plt.stem(xnew, f_res, 'r')
plt.show()

Plotting the data is a good first take at a verification. Here I made regular plot with the points connected by lines. The lines are useful since they give a guide for where you expect the down-sampled data to lie, and also emphasize what the down-sampled data is missing. (It would also work to only show lines for the original data, but lines, as in a stem plot, are too confusing, imho.)
import numpy as np
import matplotlib.pyplot as plt # For ploting
from scipy import signal
fs = 100 # sample rate
rsample=43 # downsample frequency
fTwo=13 # frequency of the signal
x = np.arange(fs, dtype=float)
y = np.sin(2*np.pi*fTwo * (x/fs))
print y
f_res = signal.resample(y, rsample)
xnew = np.linspace(0, 100, f_res.size, endpoint=False)
#
# ##############################
#
plt.figure()
plt.plot(x, y, 'o')
plt.plot(xnew, f_res, 'or')
plt.show()
A few notes:
If you're trying to make a general algorithm, use non-rounded numbers, otherwise you could easily introduce bugs that don't show up when things are even multiples. Similarly, if you need to zoom in to verify, go to a few random places, not, for example, only the start.
Note that I changed fTwo to be significantly less than the number of samples. Somehow, you need at least more than one data point per oscillation if you want to make sense of it.
I also remove the loop for calculating y: in general, you should try to vectorize calculations when using numpy.

The spectrum of the resampled signal should have a tone at the same frequency as the input signal just in a smaller nyquist bandwidth.
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
import scipy.fftpack as fft
fs = 100 # sample rate
rsample=50 # downsample frequency
fTwo=10 # frequency of the signal
n = np.arange(1024)
y = np.sin(2*np.pi*fTwo/fs*n)
y_res = signal.resample(y, len(n)/2)
Y = fft.fftshift(fft.fft(y))
f = -fs*np.arange(-512, 512)/1024
Y_res = fft.fftshift(fft.fft(y_res, 1024))
f_res = -fs/2*np.arange(-512, 512)/1024
plt.figure(1)
plt.subplot(211)
plt.stem(f, abs(Y))
plt.subplot(212)
plt.stem(f_res, abs(Y_res))
plt.show()
The tone is still at 10.

IF you down sample a signal both signals will still have the exact same value and a given time , so just loop through "time" and check that the values are the same. In your case you go from a sample rate of 100 to 50. Assuming you have 1 seconds worth of data from building your x from fs, then just loop through t = 0 to t=1 in 1/50'th increments and make sure that Yd(t) = Ys(t) where Yd d is the down sampled f and Ys is the original sampled frequency. Or to say it simply Yd(n) = Ys(2n) for n = 1,2,3,...n=total_samples-1.

Python: two normal distribution

I have two data sets where two values where measured. I am interested in the difference between the value and the standard deviation of the difference. I made a histogram which I would like to fit two normal distributions. To calculate the difference between the maxima. I also would like to evaluate the effect that in on data set I have much less data on one value. I've already looked at this link but it is not really what I need:
Python: finding the intersection point of two gaussian curves
for ii in range(2,8):
# Kanal = ii - 1
file = filepath + '\Mappe1.txt'
data = np.loadtxt(file, delimiter='\t', skiprows=1)
data = data[:,ii]
plt.hist(data,bins=100)
plt.xlabel("bins")
plt.ylabel("Counts")
plt.tight_layout()
plt.grid()
plt.figure()
plt.show()

Quick and dirty fitting can be readily achieved using scipy:
from scipy.optimize import curve_fit #non linear curve fitting tool
from matplotlib import pyplot as plt
def func2fit(x1,x2,m_1,m_2,std_1,std_2,height1, height2): #define a simple gauss curve
return height1*exp(-(x1-m_1)**2/2/std_1**2)+height2*exp(-(x2-m_2)**2/2/std_2**2)
init_guess=(-.3,.3,.5,.5,3000,3000)
#contains the initial guesses for the parameters (m_1, m_2, std_1, std_2, height1, height2) using your first figure
#do the fitting
fit_pars, pcov =curve_fit(func2fit,xdata,ydata,init_guess)
#fit_pars contains the mean, the heights and the SD values, pcov contains the estimated covariance of these parameters
plt.plot(xdata,func2fit(xdata,*fit_pars),label='fit') #plot the fit
For further reference consult the scipy manual page:
curve_fit

Assuming that the two samples are independent there is no need to handle this problem using curve fitting. It's basic statistics. Here's some code that does the calculations required, with the source attributed in a comment.
## adapted from http://onlinestatbook.com/2/estimation/difference_means.html
from random import gauss
from numpy import sqrt
sample_1 = [ gauss(0,1) for _ in range(10) ]
sample_2 = [ gauss(1,.5) for _ in range(20) ]
n_1 = len(sample_1)
n_2 = len(sample_2)
mean_1 = sum(sample_1)/n_1
mean_2 = sum(sample_2)/n_2
SSE = sum([(_-mean_1)**2 for _ in sample_1]) + sum([(_-mean_2)**2 for _ in sample_2])
df = (n_1-1) + (n_2-1)
MSE = SSE/df
n_h = 2 / ( 1/n_1 + 1/n_2 )
s_mean_diff = sqrt( 2* MSE / n_h )
print ( 'difference between means', abs(n_1-n_2))
print ( 'std dev of this difference', s_mean_diff )

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to perform sigma clipping on data set? - python

Related

How to determine multiple time-periods present in a waveform using findpeaks() in Python?

How to De-Trend a waveform which is piece-wise linear or non-linear?

Gaussian Mixture Model with discrete data

How to validate the downsampling is as intended

Python: two normal distribution

Categories

Resources