Can anyone tell me when do we use line='r'??
sm.qqplot(df['Delivery_time'], line='r')
and when do we use line=45?
sm.qqplot(df['Delivery_time'], line='45')
According to the documentation stats.model ,line can be {None, “45”, “s”, “r”, “q”}
“45” - 45-degree line
“r” - A regression line is fit
Let me add codes and output of the the two cases for better understanding.
Case 1:, if we are comparing the distribution of a sample of data to a theoretical normal distribution, we might use line='r' to plot a regression line that shows how well the sample data fits the normal distribution.
import numpy as np
import statsmodels.api as sm
import pylab as py
data_points = np.random.normal(0, 1, 100)
sm.qqplot(data_points, line ='r')
py.show()
Case 2: If the line='45' would refer to a 45-degree line, which is a line that has a slope of 1 and passes through the origin. This line is used as a reference for comparing the distribution of the data being plotted to a uniform distribution.
import numpy as np
import statsmodels.api as sm
import pylab as py
data_points = np.random.normal(0, 1, 100)
sm.qqplot(data_points, line ='45')
py.show()
For example, in the above case we are compare the distribution of a sample of data to a uniform distribution, we might use line='45' to plot a reference line that shows how well the sample data fits the uniform distribution.
Note :The line parameter can be used to customize the reference line on the Q-Q plot to better compare the distribution of the data being plotted to a theoretical distribution.
Related
I want to create an array with unequally spaced values. The spacing should be determined by the superposition of (for example) two normal distributions with different mean and width values. For a single (normal) distribution I managed to get what I want with the help of this post: python, weighted linspace
Using this code:
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
dist = stats.norm(loc=1.2, scale=0.6)
bounds = dist.cdf([0, 2])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [1]*vals.size, 'o')
plt.show()
I get the result I want for a single distribution:
However, I need exactly the same for a superposition of two normal distributions like:
dist1 = stats.norm(loc=3, scale=2)
dist2 = stats.norm(loc=1.2, scale=0.6)
This is how a histrogramm of the superimposed distributions looks like:
As a temporary solution I created the arrays for each distribution individually and added them together. However, this is not exactly what I want, because adding the the two individual arrays leads to fluctuating step sizes between the added arrays (for example it might happen that two values from the two different (individual) arrays are almost or exactly identical).
I also tried to define a new distribution that inherits from rv_continuous class from scipy.stats, but I failed to implement two different mean/width parameters.
I am pretty sure that it should work adding the individual probability density functions, but unfortunately I also failed with this approach.
Thanks in advance for any help and/or comment!
You could subclass rv_continuous and provide a pdf that is the mean of the two given pdfs.
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
class sum_gaussians_gen(stats.rv_continuous):
def _pdf(self, x):
return (stats.norm.pdf(x, loc=3, scale=2) + stats.norm.pdf(x, loc=1.2, scale=0.6)) / 2
dist = sum_gaussians_gen()
bounds = dist.cdf([0, 7])
pp = np.linspace(*bounds, num=21)
vals = dist.ppf(pp)
plt.plot(vals, [0.5] * vals.size, 'o')
xs = np.linspace(0, 7, 500)
plt.plot(xs, dist.pdf(xs))
plt.ylim(ymin=0)
plt.show()
I have been using np.trapz for integration over arrays for a while and have not had any problems with it, until now. I have obtained a distribution which clearly has an area of less than 1, because its maxima are 0.16 and the width of the distribution is roughly 6 but it seems to return that the area underneath the distribution is >60.
Here is my code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1000)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data[0:1000])
plt.show()
integralvalue=np.trapz(data)
print('The integral of this distribution results in: ',integralvalue)
Using numpy trapz, without the choice of the x parameter, the spacing of our distribution is assumed to be evenly spaced apart, these however should be spaced apart in relation to the theta values that formed the distribution in the first place, using the following code:
import numpy as np
import matplotlib.pyplot as plt
data = np.load('dist.npy')
thetavals=np.linspace(0,2*np.pi,1001)
plt.xlabel(r'$\theta$')
plt.ylabel(r'$P(\theta)$')
plt.plot(thetavals,data)
plt.show()
integralvalue=np.trapz(data,thetavals)
print('The integral of this distribution results in: ',integralvalue)
a number less than 1 is obtained, as expected.
Suppose I have a grid given by
import numpy as np
grid = np.linspace(0,20,1000)
I want to get a 1000-by-1 vector p so that if one were to plot points
(grid[i], p[i]) the graph would look like the density of a lognormal distribution.
Use scipy's stats for obtaining pdf's of probability-distributions!
Numpy, in most (all?) cases only support sampling-methods, not pdf-calculations. What's needed surely depends on the use-case.
Often the pdf plays no role in practical sampling-only implementations, like in this case, where sampling is reduced to normal-distribution sampling (often reduced to uniform-sampling combined with other functions) followed by the exponential-function (code):
double rk_lognormal(rk_state *state, double mean, double sigma)
{
return exp(rk_normal(state, mean, sigma));
}
Make sure to read above docs to learn how to use these!
Example code:
import numpy as np
import scipy.stats as spt
import matplotlib.pyplot as plt
rv = spt.lognorm(0.954) # "frozen" RV (shape-param fixed)
x_points = np.linspace(1,20,1000, dtype=int) # 0 excluded
plt.scatter(x_points, rv.pdf(x_points))
plt.show()
Output:
I'm trying to fit a GEV distribution to temperature data to help identify extreme values. I have data sets for different regions - for some regions the fit works fine but for others it breaks down. It appears that it is setting the location parameter close to the maximum of the distribution range. All data sets are large, of the same size, complete and have no particularly strange values.
Could you please suggest what might be happening or how I can investigate the genextreme function process to work out what the problem is?
Here's the relevant bits of code (values are read in from NetCDF without any problem):
import pandas as pd
import numpy as np
import netCDF4 as nc
import matplotlib.pyplot as plt
from scipy import stats
from scipy.stats import genextreme as gev
# calculate GEV fit
fit = gev.fit(season_temp)
# GEV parameters from fit
c, loc, scale = fit
fit_mean= loc
min_extreme,max_extreme = gev.interval(0.99,c,loc,scale)
# evenly spread x axis values for pdf plot
x = np.linspace(min(season_temp),max(season_temp),200)
# plot distribution
fig,ax = plt.subplots(1, 1)
plt.plot(x, gev.pdf(x, *fit))
plt.hist(season_temp,30,normed=True,alpha=0.3)
And here are two examples of outputs from different regions, successful and not:
Successful fit
Unsuccessful fit
The successfully fitted distribution has mean location parameter of 1.066 compared to data mean of 2.395. The one that failed has calculated a location parameter of 12.202 compared to data mean of 2.138.
Thanks in advance for your help!
I looked into many examples of scipy.fft and numpy.fft. Specifically this example Scipy/Numpy FFT Frequency Analysis is very similar to what I want to do. Therefore, I used the same subplot positioning and everything looks very similar.
I want to import data from a file, which contains just one column to make my first test as easy as possible.
My code writes like this:
import numpy as np
import scipy as sy
import scipy.fftpack as syfp
import pylab as pyl
# Read in data from file here
array = np.loadtxt("data.csv")
length = len(array)
# Create time data for x axis based on array length
x = sy.linspace(0.00001, length*0.00001, num=length)
# Do FFT analysis of array
FFT = sy.fft(array)
# Getting the related frequencies
freqs = syfp.fftfreq(array.size, d=(x[1]-x[0]))
# Create subplot windows and show plot
pyl.subplot(211)
pyl.plot(x, array)
pyl.subplot(212)
pyl.plot(freqs, sy.log10(FFT), 'x')
pyl.show()
The problem is that I will always get my peak at exactly zero, which should not be the case at all. It really should appear at around 200 Hz.
With smaller range:
Still biggest peak at zero.
As already mentioned, it seems like your signal has a DC component, which will cause a peak at f=0. Try removing the mean with, e.g., arr2 = array - np.mean(array).
Furthermore, for analyzing signals, you might want to try plotting power spectral density.:
import matplotlib.pylab as plt
import matplotlib.mlab as mlb
Fs = 1./(d[1]- d[0]) # sampling frequency
plt.psd(array, Fs=Fs, detrend=mlb.detrend_mean)
plt.show()
Take a look at the documentation of plt.psd(), since there a quite a lot of options to fiddle with. For investigating the change of the spectrum over time, plt.specgram() comes in handy.