Genextreme fit not working for some datasets

Genextreme fit not working for some datasets - python

I'm trying to fit a GEV distribution to temperature data to help identify extreme values. I have data sets for different regions - for some regions the fit works fine but for others it breaks down. It appears that it is setting the location parameter close to the maximum of the distribution range. All data sets are large, of the same size, complete and have no particularly strange values.
Could you please suggest what might be happening or how I can investigate the genextreme function process to work out what the problem is?
Here's the relevant bits of code (values are read in from NetCDF without any problem):
import pandas as pd
import numpy as np
import netCDF4 as nc
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import genextreme as gev
# calculate GEV fit
fit = gev.fit(season_temp)
# GEV parameters from fit
c, loc, scale = fit
fit_mean= loc
min_extreme,max_extreme = gev.interval(0.99,c,loc,scale)
# evenly spread x axis values for pdf plot
x = np.linspace(min(season_temp),max(season_temp),200)
# plot distribution
fig,ax = plt.subplots(1, 1)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(season_temp,30,normed=True,alpha=0.3)
And here are two examples of outputs from different regions, successful and not:
Successful fit
Unsuccessful fit
The successfully fitted distribution has mean location parameter of 1.066 compared to data mean of 2.395. The one that failed has calculated a location parameter of 12.202 compared to data mean of 2.138.
Thanks in advance for your help!

Related

When to use line=r and line=45 in qqplot

Can anyone tell me when do we use line='r'??
sm.qqplot(df['Delivery_time'], line='r')
and when do we use line=45?
sm.qqplot(df['Delivery_time'], line='45')

According to the documentation stats.model ,line can be {None, “45”, “s”, “r”, “q”}
“45” - 45-degree line
“r” - A regression line is fit
Let me add codes and output of the the two cases for better understanding.
Case 1:, if we are comparing the distribution of a sample of data to a theoretical normal distribution, we might use line='r' to plot a regression line that shows how well the sample data fits the normal distribution.
import numpy as np
import statsmodels.api as sm
import pylab as py
data_points = np.random.normal(0, 1, 100)
sm.qqplot(data_points, line ='r')
py.show()
Case 2: If the line='45' would refer to a 45-degree line, which is a line that has a slope of 1 and passes through the origin. This line is used as a reference for comparing the distribution of the data being plotted to a uniform distribution.
import numpy as np
import statsmodels.api as sm
import pylab as py
data_points = np.random.normal(0, 1, 100)
sm.qqplot(data_points, line ='45')
py.show()
For example, in the above case we are compare the distribution of a sample of data to a uniform distribution, we might use line='45' to plot a reference line that shows how well the sample data fits the uniform distribution.
Note :The line parameter can be used to customize the reference line on the Q-Q plot to better compare the distribution of the data being plotted to a theoretical distribution.

Clustering data- Poor results, feature extraction

I have measured data (vibrations) from a wind turbine running under different operating conditions. My dataset consists of operating conditions as well as measurement features I have extracted from the measured data.
Dataset shape: (423, 15). Each of the 423 data points represent a measurement on a day, chronologically over 423 days.
I now want to cluster the data to see if there is any change in the measurements. Specifically, I want to examine if the vibrations change over time (which could indicate a fault in the turbine gearbox).
What I have currently done:
Scale the data between 0,1 ->
Perform PCA (reduce from 15 to 5)
Cluster using db scan since I do not know the number of clusters. I am using this code to find the optimal epsilon (eps) in dbscan:
# optimal Epsilon (distance):
X_pca = principalDf.values
neigh = NearestNeighbors(n_neighbors=2)
nbrs = neigh.fit(X_pca)
distances, indices = nbrs.kneighbors(X_pca)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances,color="#0F215A")
plt.grid(True)
The result so far are not giving any clear indication that the data is changing over time:
Of course, the case could be that the data is not changing over these data points. Howver, what are some other things I could try? Kind of an open question, but I am running out of ideas.

First of all, with KMeans, if the dataset is not naturally partitioned, you may end up with some very weird results! As KMeans is unsupervised, you basically dump in all kinds of numeric variables, set the target variable, and let the machine do the lift for you. Here is a simple example using the canonical Iris dataset. You can EASILY modify this to fit your specific dataset. Just change the 'X' variables (all but the target variable) and 'y' variable (just one target variable). Try that and feedback.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import urllib.request
import random
# seaborn is a layer on top of matplotlib which has additional visualizations -
# just importing it changes the look of the standard matplotlib plots.
# the current version also shows some warnings which we'll disable.
import seaborn as sns
sns.set(style="white", color_codes=True)
import warnings
warnings.filterwarnings("ignore")
from sklearn import datasets
iris = datasets.load_iris()
X = iris.data[:, 0:4] # we only take the first two features.
y = iris.target
from sklearn import preprocessing
scaler = preprocessing.StandardScaler()
scaler.fit(X)
X_scaled_array = scaler.transform(X)
X_scaled = pd.DataFrame(X_scaled_array)
X_scaled.sample(5)
# try clustering on the 4d data and see if can reproduce the actual clusters.
# ie imagine we don't have the species labels on this data and wanted to
# divide the flowers into species. could set an arbitrary number of clusters
# and try dividing them up into similar clusters.
# we happen to know there are 3 species, so let's find 3 species and see
# if the predictions for each point matches the label in y.
from sklearn.cluster import KMeans
nclusters = 3 # this is the k in kmeans
seed = 0
km = KMeans(n_clusters=nclusters, random_state=seed)
km.fit(X_scaled)
# predict the cluster for each data point
y_cluster_kmeans = km.predict(X_scaled)
y_cluster_kmeans
# use seaborn to make scatter plot showing species for each sample
sns.FacetGrid(data, hue="species", size=4) \
.map(plt.scatter, "sepal_length", "sepal_width") \
.add_legend();

Fit a cosine squared function to a set of data

I took a set of data from an experimental set up and am trying to model an interference pattern for a diffraction grating. Currently my function does not overlay itself well onto the data and was wondering whether there is a way to improve the fitting.
[EDIT] From the comments I have received I am now trying to fit the sin(x)**2 function onto the graph, where in this case amplitude is a position of another/several functions, or fit a superposition of sine waves. Additionally whilst trying to overlay a Gaussian distribution over the peaks of the data points see image 2
import scipy as sp
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import curve_fit
x=sp.array([5.5,5.75,5.95,6,6.1,6.2,6.2,6.4,6.5,6.75,7,7.05,7.1,7.15,7.2,
7.3,7.35,7.5,7.55,7.7,8,8.2,8.3,8.35,8.4,8.6,8.7])
V=sp.array([0.048,0.1,0.617,0.557,0.258,0.112,0.098,0.08,0.075,0.093,0.253,
0.486,0.98,1.391,1.58,0.964,0.71,0.166,0.152,0.11,0.121,0.256,0.591,1.186,
1.552,0.787,0.283])
def my_sin(t,peroid,amplitude,phase):
return (amplitude*(sp.sin(t*2*sp.pi/peroid+phase)))**2
guess_peroid= 2
guess_amplitude = 0.8
guess_phase = (sp.pi)/2
p0 =[guess_peroid, guess_amplitude, guess_phase]
fit = curve_fit(my_sin,x, V, p0=p0)
print ('The fit paramters are:', fit[0])
x1 = sp.linspace(5.5,8.7,100000)
data_fit = my_sin(x1,*fit[0])
print(x1)
plt.xlabel('Distance Between Minima(m)')
plt.ylabel('Voltage')
plt.plot(x1,data_fit)
plt.errorbar(x,V,fmt='x')
plt.show()
Here is a picture of the output of my code
I would like to replicate this kind of fitting for the three Maxima of Light Intensity shown in my data set
Any help would be greatly appreciated.
An Image of the formula i am using

applying "tighter" bounds in scipy.optimize.curve_fit

I have a dataset that I am trying to fit with parameters (a,b,c,d) that are within +/- 5% of the true fitting parameters. However, when I do this with scipy.optimize.curve_fit I get the error "ValueError: 'x0' is infeasible." inside the least squares package.
If I relax my bounds, then optimize.curve_fit seems to work as desired. I have also noticed that my parameters that are larger seem to be more flexible in applying bounds (i.e. I can get these to work with tighter constraints, but never below 20%). The following code is an MWE and has two sets of bounds (variable B), one that works and one that returns an error.
# %% import modules
import IPython as IP
IP.get_ipython().magic('reset -sf')
import matplotlib.pyplot as plt
import os as os
import numpy as np
import scipy as sp
import scipy.io as sio
plt.close('all')
#%% Load the data
capacity = np.array([1.0,9.896265560165975472e-01,9.854771784232364551e-01,9.823651452282157193e-01,9.797717842323651061e-01,9.776970954356846155e-01,9.751037344398340023e-01,9.735477178423235234e-01,9.714730290456431439e-01,9.699170124481327759e-01,9.683609958506222970e-01,9.668049792531120401e-01,9.652489626556015612e-01,9.636929460580913043e-01,9.621369294605808253e-01,9.610995850622406911e-01,9.595435684647302121e-01,9.585062240663900779e-01,9.574688796680497216e-01,9.559128630705394647e-01,9.548755186721991084e-01,9.538381742738588631e-01,9.528008298755185068e-01,9.517634854771783726e-01,9.507261410788381273e-01,9.496887966804978820e-01,9.486514522821576367e-01,9.476141078838172804e-01,9.460580912863070235e-01,9.450207468879666672e-01,9.439834024896265330e-01,9.429460580912862877e-01,9.419087136929459314e-01,9.408713692946057972e-01,9.393153526970953182e-01,9.382780082987551840e-01,9.372406639004148277e-01,9.356846473029045708e-01,9.346473029045642145e-01,9.330912863070539576e-01,9.320539419087136013e-01,9.304979253112033444e-01,9.289419087136928654e-01,9.273858921161826085e-01,9.258298755186721296e-01,9.242738589211617617e-01,9.227178423236513938e-01,9.211618257261410259e-01,9.196058091286306579e-01,9.180497925311202900e-01,9.159751037344397995e-01,9.144190871369294316e-01,9.123443983402489410e-01,9.107883817427384621e-01,9.087136929460579715e-01,9.071576763485477146e-01,9.050829875518671130e-01,9.030082987551866225e-01,9.009336099585061319e-01,8.988589211618257524e-01,8.967842323651451508e-01,8.947095435684646603e-01,8.926348547717841697e-01,8.905601659751035681e-01,8.884854771784231886e-01,8.864107883817426980e-01,8.843360995850622075e-01,8.817427385892115943e-01,8.796680497925309927e-01,8.775933609958505022e-01,8.749999999999998890e-01,8.729253112033195094e-01,8.708506224066390189e-01,8.682572614107884057e-01,8.661825726141078041e-01,8.635892116182571909e-01,8.615145228215767004e-01,8.589211618257260872e-01,8.563278008298754740e-01,8.542531120331948724e-01,8.516597510373442592e-01,8.490663900414936460e-01,8.469917012448132665e-01,8.443983402489626533e-01,8.418049792531120401e-01,8.397302904564315496e-01,8.371369294605809364e-01,8.345435684647303232e-01,8.324688796680497216e-01,8.298755186721991084e-01,8.272821576763484952e-01,8.246887966804978820e-01,8.226141078838173915e-01,8.200207468879667783e-01,8.174273858921160540e-01,8.153526970954355635e-01,8.127593360995849503e-01,8.101659751037343371e-01,8.075726141078837239e-01,8.054979253112033444e-01,8.029045643153527312e-01,8.003112033195021180e-01,7.977178423236515048e-01,7.956431535269707922e-01,7.930497925311201790e-01,7.904564315352695658e-01,7.883817427385891863e-01,7.857883817427385731e-01,7.831950207468879599e-01,7.811203319502073583e-01,7.785269709543567451e-01,7.759336099585061319e-01,7.738589211618256414e-01,7.712655601659750282e-01,7.686721991701244150e-01,7.665975103734440355e-01,7.640041493775934223e-01,7.619294605809127097e-01,7.593360995850620965e-01,7.567427385892114833e-01,7.546680497925311037e-01,7.520746887966804906e-01,7.499999999999998890e-01,7.474066390041492758e-01,7.453319502074687852e-01,7.427385892116181720e-01,7.406639004149377925e-01,7.380705394190871793e-01,7.359958506224064667e-01,7.339211618257260872e-01,7.313278008298754740e-01,7.292531120331949834e-01,7.266597510373443702e-01,7.245850622406637687e-01,7.225103734439833891e-01,7.199170124481327759e-01,7.178423236514521744e-01,7.157676348547717948e-01,7.136929460580911933e-01,7.110995850622405801e-01,7.090248962655600895e-01,7.069502074688797100e-01,7.048755186721989974e-01,7.022821576763483842e-01,7.002074688796680046e-01,6.981327800829875141e-01,6.960580912863069125e-01,6.939834024896265330e-01,6.919087136929459314e-01,6.898340248962655519e-01,6.877593360995849503e-01])
cycles = np.arange(0,151)
#%% fit the capacity data
# define the empicrial model to be fitted
def He_model(k,a,b,c,d):
return a*np.exp(b*k)+c*np.exp(d*k)
# Fit the entire data set with the function
P0 = [-40,0.005,40,-0.005]
fit, tmp = sp.optimize.curve_fit(He_model, cycles,capacity, p0=P0,maxfev=10000000)
capacity_fit = He_model(cycles, fit[0], fit[1],fit[2], fit[3])
# track all four data points with a 5% bound from the best fit
b_lim = np.zeros((4,2))
for i in range(4):
b_lim[i,0] = fit[i]-np.abs(0.2*fit[i]) # these should be 0.05
b_lim[i,1] = fit[i]+np.abs(0.2*fit[i])
# bounds that work
B = ([b_lim[0,0],-np.inf,b_lim[2,0],-np.inf],[b_lim[0,1], np.inf, b_lim[2,1], np.inf])
# bounds that do not work, but are closer to what I want.
#B = ([b_lim[0,0],b_lim[1,0],b_lim[2,0],b_lim[3,0]],[b_lim[0,1], b_lim[1,1], b_lim[2,1], b_lim[3,1]])
fit_4_5per, tmp = sp.optimize.curve_fit(He_model, cycles,capacity , p0=P0,bounds=B)
capacity_4_5per = He_model(cycles, fit_4_5per[0], fit_4_5per[1], fit_4_5per[2], fit_4_5per[3])
#%% plot the results
plt.figure()
plt.plot(cycles,capacity,'o',fillstyle='none',label='data',markersize=4)
plt.plot(cycles,capacity_fit,'--',label='best fit')
plt.plot(cycles,capacity_4_5per,'-.',label='5 percent bounds')
plt.ylabel('capacity')
plt.xlabel('charge cycles')
plt.legend()
plt.ylim([0.70,1])
plt.tight_layout()
I understand that optimize.curve_fit may need some space to explore the data set to find the optimum spot, however, I feel that 5% should be enough for it. Maybe I am wrong? Is there maybe a better way to apply bounds to a parameter?

The ValueError of "x0 is infeasible" is coming because your initial values violate the bounds. Printing out the parameters values and bounds will show this.
Basically, you're setting the bounds too cleverly, based on the first refined values. But the refined values are different enough from your starting values that the bounds for the second call to curve_fit mean the initial values fall outside the bounds.
More importantly, what leads you to "feel that 5% should be enough"? Primarily, you should apply bounds to make sure the model makes sense, and secondarily to help the fit avoid false solutions. You're calculating the bounds based on an initial fit, so I doubt there's a strong physical justification for those bounds. Why not let the fit do its job?

scipy/numpy FFT on data from file

I looked into many examples of scipy.fft and numpy.fft. Specifically this example Scipy/Numpy FFT Frequency Analysis is very similar to what I want to do. Therefore, I used the same subplot positioning and everything looks very similar.
I want to import data from a file, which contains just one column to make my first test as easy as possible.
My code writes like this:
import numpy as np
import scipy as sy
import scipy.fftpack as syfp
import pylab as pyl
# Read in data from file here
array = np.loadtxt("data.csv")
length = len(array)
# Create time data for x axis based on array length
x = sy.linspace(0.00001, length*0.00001, num=length)
# Do FFT analysis of array
FFT = sy.fft(array)
# Getting the related frequencies
freqs = syfp.fftfreq(array.size, d=(x[1]-x[0]))
# Create subplot windows and show plot
pyl.subplot(211)
pyl.plot(x, array)
pyl.subplot(212)
pyl.plot(freqs, sy.log10(FFT), 'x')
pyl.show()
The problem is that I will always get my peak at exactly zero, which should not be the case at all. It really should appear at around 200 Hz.
With smaller range:
Still biggest peak at zero.

As already mentioned, it seems like your signal has a DC component, which will cause a peak at f=0. Try removing the mean with, e.g., arr2 = array - np.mean(array).
Furthermore, for analyzing signals, you might want to try plotting power spectral density.:
import matplotlib.pylab as plt
import matplotlib.mlab as mlb
Fs = 1./(d[1]- d[0]) # sampling frequency
plt.psd(array, Fs=Fs, detrend=mlb.detrend_mean)
plt.show()
Take a look at the documentation of plt.psd(), since there a quite a lot of options to fiddle with. For investigating the change of the spectrum over time, plt.specgram() comes in handy.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Genextreme fit not working for some datasets - python

Related

When to use line=r and line=45 in qqplot

Clustering data- Poor results, feature extraction

Fit a cosine squared function to a set of data

applying "tighter" bounds in scipy.optimize.curve_fit

scipy/numpy FFT on data from file

Categories

Resources