qqplot not working with different sample sizes - python

I'm trying to get started with the statsmodel package to make qqplots. I installed from source using the master branch with python 3.6. For what I'd like to do I want to make a qqplot comparing two data distributions of different sample sizes. I'm trying to just run the example code they have in the documentation, but it's throwing an error about the different sample sizes.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import statsmodels.api as sm
from statsmodels.graphics.gofplots import qqplot
# example 6
x = np.random.normal(loc=8.25, scale=2.75, size=37)
y = np.random.normal(loc=8.75, scale=3.25, size=57)
pp_x = sm.ProbPlot(x, fit=True)
pp_y = sm.ProbPlot(y, fit=True)
fig = pp_x.qqplot(line='45', other=pp_y)
title = 'Ex. 6 - qqplot - compare different sample sizes'
h = plt.title(title)
plt.show()
I get this error:
ValueError: x and y must have same first dimension, but have shapes
(57,) and (37,)
Has anyone gotten this feature to work?

you should write
x = np.random.normal(loc=8.25, scale=2.75, size=(37,value-of-a-dimension-for-your-code)
like
x = np.random.normal(loc = 0, scale = 1, size = (3,3))

Related

How increase the size of my plot when i am using control.bodeplot and see the lines

I want to use the control module of python for my transferfunctions. It works like a charm however i want to increase the size my bode plot that the module can produce. And see the log scale inside the plot.
This is the code i use in jupyter lab:
import matplotlib.pyplot as plt
import numpy as np
import time
plt.rcParams['font.size'] = 14
import os
import control
f = np.logspace(0,6,1000)
f.min()
f.max()
len(f)
w = 2*np.pi*f
# s = 1.0j*w
s = control.TransferFunction.s
R1=31600
R2=5230
R3=4420
R4=5110
C1=10e-9
C2=1.8e-9
C3=150e-12
num = ((s*C2*(R1+R3)+1))*(s*C1*R2+1)
den = (s*R1*C1)*(s*C2*R3+1)*(s*R2*C3+1)
Hs = control.tf((num/den))
print(Hs)
plt.figure
out = control.bode_plot(Hs,w,dB=1,Hz=1,deg=1,plot=1,margins=1)
It shows the following:
bodeplot
How can i have increase the size to make it better readable and how do i see de log lines????

while using shading in pcolormesh getting error

I am facing a problem while using shading for pcolormesh in contour fill plot. As soon as I am give shading as "shading='gouraud'" I am getting this error " TypeError: Dimensions of C (73, 144) are incompatible with X (145) and/or Y (74); see help(pcolormesh)".If anyone can help me in this regard it will be much appreciated. I am also posting my code which I am using.
import os
os.environ["PROJ_LIB"] = "C:\\Utilities\\Python\\Anaconda\\Library\\share"; #fixr
import numpy as np
import xarray as xr
import proplot as plot
import matplotlib.pyplot as plt
import pandas as pd
# --- read netcdf file
dset = xr.open_dataset(r'E:\DATA_SETS\OLR_NCEP_REANALYSIS\olr.daily.1974.2020.nc')
# --- select an area and time (optional)
#dset = dset.sel(lat=slice(15, -60), lon=slice(270, 330))
plot.rc.reso = 'lo'
#--- plotting
f, ax = plot.subplots(ncols=1,figsize=[6.4, 5.0],tight=True,
proj='cyl', proj_kw={'lon_0': 0})
# format options
ax.format(land=True, landcolor='mushroom',coast=True, innerborders=False, borders=False,
labels=True,
latlim=(0, 30), lonlim=(50, 100),linewidth=1,
gridlinewidth=0,latlines=5, lonlines=10,
abc=True, abcloc='ll', abcstyle='(a)')
levels=list(np.arange(120,260,20))
map = ax.pcolormesh(dset['lon'], dset['lat'], dset['olr'][16202, :,:],shading='gouraud',
cmap='Blues_r',levels=levels,vmin=np.inf,vmax=np.inf,extend='neither')
f.colorbar(map,length=0.6,loc='b',extendrect=True)
It seems like the dimensions of your coordinates and the 2D variable do not match.
Simplest solution is to use central coordinates of x and y:
x = dset['lon'].values()
y = dset['lat'].values()
xnew = 0.5*(x[0:-1]+x[1:])
ynew = 0.5*(y[0:-1]+y[1:])
pcolormesh(xnew,ynew, ...)

Gaussian Mixture Model with discrete data

I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))

QQplot for discrete distribution

I have a set whose samples are discrete values (in particular, the size of a queue over time). Now I'd like to find what distribution they belong to. To achieve this goal I'd act the same way I did for the other quantities, i.e. plotting a qqplot, launching
import statsmodels.api as sm
sm.qqplot(df, dist = 'geom', sparams = (.5,), line ='s', alpha = 0.3, marker ='.')
This works if dist is not a discrete random variables (e.g. 'exp' or 'norm') and indeed I used to get some results, but when the distribution is discrete (say, 'geom'), I get
AttributeError: 'geom_gen' object has no attribute 'fit'
I searched on the Internet how to make a qqplot (or something similar) to spot what distribution my samples belong to but I found nothing
def discreteQQ(x_sample):
p_test = np.array([])
for i in range(0, 1001):
p_test = np.append(p_test, i/1000)
i = i + 1
x_sample = np.sort(x_sample)
x_theor = stats.geom.rvs(.5, size=len(x_sample))
ecdf_sample = np.arange(1, len(x_sample) + 1)/(len(x_sample)+1)
x_theor = stats.geom.ppf(ecdf_sample, p=0.5)
for p in p_test:
plt.scatter(np.quantile(x_theor, p), np.quantile(x_sample, p), c = 'blue')
plt.xlabel('Theoretical quantiles')
plt.ylabel('Sample quantiles')
plt.show()
Generate a theoretical geometric distribution using scipy.stats.geom, convert the sample and theoretical data using statsmodels' ProbPlot and pass these to statsmodels' qqplot_2samples.
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
from statsmodels.graphics.gofplots import ProbPlot
from statsmodels.graphics.gofplots import qqplot_2samples
p_theor = 1/4 # The probability we check for
p_sample = 1/5 # The true probability of the sample distribution
# The experimental data
x_sample = stats.geom.rvs(p_sample, size=50)
# The model data
x_theor = stats.geom.rvs(p_theor, size=100)
qqplot_2samples(ProbPlot(x_sample), ProbPlot(x_theor), line='45')
plt.show()

Python: scikit-learn isomap results seem random, but no possibility to set random_state

I am using Isomap from scikit-learn manifold learning. I reduce to two dimension, and observe that with every run of the algorthm on the same data set without any changes the resulting vectors change. I assume there are some random numbers used in the algorithm, but there is no way to set a seed. Random_state is not a variable to pass in Isomap. Am I missing something?
The random you've seen is about the sign of your result. The sign is not (in my opinion) 100% random. Signs within each component are consistent so that the relative relation is consistent in your result. Signs between components are random. In other words, which component got multiplied by -1 or 1 are random. This behavior comes from the KernelPCA function used by Isomap when the arpack kernel is used.
To give you a solution first, you can use eigen_solver='dense' when using Isomap. That may slow down your algorithm but should remove this randomness. I know this explanation above might be confusing. Let me give more details and show this by plot.
First, what is a visualized consequence of the "sign randomness"? Using the following code (modified from this official example) with eigen_solver = 'arpack', you can see two fit_transform using the same Isomap class may (or may not) give you different results. However, as you can see in the plot, the relative location maintains. It's just the whole plot getting flipped. If you use eigen_solver='dense' and run the code multiple times, you won't see this randomness:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
def plot_embedding(X, ax, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
for i in range(X.shape[0]):
ax.text(X[i, 0], X[i, 1], str(digits.target[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
eigen_solver = 'arpack'
#eigen_solver = 'dense'
iso = manifold.Isomap(n_neighbors, n_components=2, eigen_solver=eigen_solver)
X_iso1 = iso.fit_transform(X)
X_iso2 = iso.fit_transform(X)
fig = plt.figure(figsize=(16, 6))
ax1 = fig.add_subplot(121)
plot_embedding(X_iso1, ax1)
ax2 = fig.add_subplot(122)
plot_embedding(X_iso2, ax2)
plt.show()
Secondly, is there a way to set a seed to "stabilize" the random state? No, there is currently no way to set a seed for KernelPCA from Isomap. With KernelPCA, however, there is a kwarg random_state which is "A pseudo random number generator used for the initialization of the residuals when eigen_solver == ‘arpack’". Play with the following code (modified from this official test code) and you can see this randomness is gone (blue dots cover red dots) even with eigen_solver = 'arpack':
import matplotlib.pyplot as plt
import numpy as np
from sklearn.decomposition import KernelPCA
X_fit = np.random.rand(100, 4)
X = np.dot(X_fit, X_fit.T)
eigen_solver = 'arpack'
#eigen_solver = 'dense'
#random_state = None
random_state = 0
kpca = KernelPCA(n_components=2, kernel='precomputed',
eigen_solver=eigen_solver, random_state=random_state)
X_kpca1 = kpca.fit_transform(X)
X_kpca2 = kpca.fit_transform(X)
plt.plot(X_kpca1[:,0], X_kpca1[:,1], 'ro')
plt.plot(X_kpca2[:,0], X_kpca2[:,1], 'bo')
plt.show()

Categories

Resources