PDF estimation in Scikit-Learn KDE - python

I am trying to compute PDF estimate from KDE computed using scikit-learn module. I have seen 2 variants of scoring and I am trying both: Statement A and B below.
Statement A results in following error:
AttributeError: 'KernelDensity' object has no attribute 'tree_'
Statement B results in following error:
ValueError: query data dimension must match training data dimension
Seems like a silly error, but I cannot figure out. Please help. Code is below...
from sklearn.neighbors import KernelDensity
import numpy
# d is my 1-D array data
xgrid = numpy.linspace(d.min(), d.max(), 1000)
density = KernelDensity(kernel='gaussian', bandwidth=0.08804).fit(d)
# statement A
density_score = KernelDensity(kernel='gaussian', bandwidth=0.08804).score_samples(xgrid)
# statement B
density_score = density.score_samples(xgrid)
density_score = numpy.exp(density_score)
If it helps, I am using 0.15.2 version of scikit-learn. I've tried this successfully with scipy.stats.gaussian_kde so there is no problem with data.

With statement B, I had the same issue with this error:
ValueError: query data dimension must match training data dimension
The issue here is that you have 1-D array data, but when you feed it to fit() function, it makes an assumption that you have only 1 data point with many dimensions! So for example, if your training data size is 100000 points, the your d is 100000x1, but fit takes them as 1x100000!!
So, you should reshape your d before fitting: d.reshape(-1,1) and same for xgrid.shape(-1,1)
density = KernelDensity(kernel='gaussian', bandwidth=0.08804).fit(d.reshape(-1,1))
density_score = density.score_samples(xgrid.reshape(-1,1))
Note: The issue with statement A, is that you are using score_samples on an object which is not fit yet!

You need to call the fit() function before you can sample from the distribution.

Related

Data shape to match statsmodels

I am trying to use statsmodles for panel and have an issue with the shape of my data. My model is a TVP-VAR for a panel in a normal linear state space model composed of the State Equation and the Measurement Equation, where I have managed to write it as in eq. 33 in Canova and Cicarelli (2013)
The key model equation, where X t = Xt and ut = Xt′+ut with UtN = 0 (I + 2 Xt′ Xt), is attached.
Key Model Equation
I use exactly this class of models from your site : TVP-VAR, MCMC, and sparse simulation smoothing.
https://www.statsmodels.org/devel/examples/notebooks/generated/statespace_tvpvar_mcmc_cfa.html
When I run the local model, I get the attached local graph, for the Simulations based on KFS approach, MLE parameters' and Simulations based on CFA approach, MLE parameters' where some countries and years appear in an unexpected format.
KFS and CFA unexpected unexpected outcome format
I suspect it has to do with the data shape I am using. You can see my actual data shape in the attached local screenshot.
When I run the Simulations with alternative parameterization yielding a smoother trend among the errors I get is
"
value' must be an instance of str or bytes, not a tuple.
"
In addition to an earlier
"An unsupported index was provided and will be ignored when, e.g. forecasting. self._init_dates(dates, freq) "
I suspect that has to do with my data shape and index.My dataset is in a long format.
A screenshot here
Data shape
My question is a bit naive. How do I reshape my data in order to be compatible with statsmodels? How do I rewrite my code in order to bring my data into an acceptable shape to run the TVP-VAR, MCMC, and sparse simulation smoothing?
Hope it is clear what I am looking. The code I am now using to import data is:
%matplotlib inline
from importlib import reload
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
from scipy.stats import invwishart, invgamma
#1
import pyreadstat
dtafile = 'panel.dta'
dta, meta = pyreadstat.read_dta(dtafile)
dta.tail()
labels=list(meta.column_labels)
column=list(meta.column_names)
# Panel data settings
year = dta.year
year = pd.Categorical(dta.year)
dta = dta.set_index([ "country", "year"])
dta["year"] = year
dta.head()
I would apreace if you help me setting the right shape format acceptable from statsmodles

sktime ARIMA invalid frequency

I try to fit ARIMA model from sktime package. I import some dataset and convert it to pandas series. Then I fit the model on the train sample and when I try to predict the error occurs.
from sktime.forecasting.base import ForecastingHorizon
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.forecasting.arima import ARIMA
import numpy as np, pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/selva86/datasets/master/a10.csv',
parse_dates=['date']).set_index('date').T.iloc[0]
p, d, q = 3, 1, 2
y_train, y_test = temporal_train_test_split(df, test_size=24)
model = ARIMA((p, d, q))
results = model.fit(y_train)
fh = ForecastingHorizon(y_test.index, is_relative=False,)
# the error is here !!
y_pred_vals, y_pred_int = results.predict(fh, return_pred_int=True)
The error message is the following:
ValueError: Invalid frequency. Please select a frequency that can be converted to a regular
`pd.PeriodIndex`. For other frequencies, basic arithmetic operation to compute durations
currently do not work reliably.
I tried to use .asfreq("M") while reading the dataset, however, all the values in the series become NaN.
What is interesting is that this code works with the default load_airline dataset from sktime.datasets but not with my dataset from github.
I get a different error: ValueError: ``unit`` missing, possibly due to version difference. Anyhow, I'd say it is better to have your dataframe's index as pd.PeriodIndex instead of pd.DatetimeIndex. The former is I think more explicit (e.g. monthly series has its time-steps as periods not exact dates) and works more smoothly. So after reading the csv,
df.index = pd.PeriodIndex(df.index, freq="M")
should clear the error (it does in my version; 0.5.1):

Python - fitting data with exponential function

I am aware that there are a few questions about a similar subject, although I couldn't find a proper answer.
I would like to fit some data with a function (called Bastenaire) and iget the parameters values. Here is the code:
import numpy as np
from matplotlib import pyplot as plt
from scipy import optimize
def bastenaire(s, A,B, C,sd):
logNB=np.log(A)-C*(s-sd)-np.log(s-sd)
return np.exp(logNB)-B
S=np.array([659,646,634,623,613,595,580,565,551,535,515,493,473,452,432,413,394,374,355,345])
N=np.array([46963,52934,59975,65522,74241,87237,101977,116751,133665,157067,189426,233260,281321,355558,428815,522582,630257,768067,902506,1017280])
fitmb,fitmob=optimize.curve_fit(bastenaire,S,N,p0=(30000,2000000000,0.2,250))
plt.scatter(N,S)
plt.plot(bastenaire(S,*fitmb),S,label='bastenaire')
plt.legend()
plt.show()
However, the curve fit cannot identify the correct parameters and I get: OptimizeWarning: Covariance of the parameters could not be estimated.
Same results when I give no input parameters values.
Figure
Is there any way to tweak something and get results? Should my dataset cover a wider range and values?
Thank you!
Broc
Fitting is tough, you need to restrain the parameter space using bounds and (often) check a bit your initial values.
To make it work, I search for an initial value where the function had the correct look, then estimated some constraints:
bounds = np.array([(1e4, 1e12), (-np.inf, np.inf), (1e-20, 1e-2), (-2000., 20000)]).T
fitmb, fitmob = optimize.curve_fit(bastenaire,S, N,p0=(1e7,-100.,1e-5,250.), bounds=bounds)
returns
(array([ 1.00000000e+10, 1.03174824e+04, 7.53169772e-03, -7.32901325e+01]), array([[ 2.24128391e-06, 6.17858390e+00, -1.44693602e-07,
-5.72040842e-03],
[ 6.17858390e+00, 1.70326029e+07, -3.98881486e-01,
-1.57696515e+04],
[-1.44693602e-07, -3.98881486e-01, 1.14650323e-08,
4.68707940e-04],
[-5.72040842e-03, -1.57696515e+04, 4.68707940e-04,
1.93358414e+01]]))

Scipy.cluster kmeans2

I am trying to apply the kmeans2 algorithm in Scipy. The following code applies the algorithm correctly.
from scipy.cluster.vq import kmeans2,vq
import numpy as np
df = pd.read_csv("123.csv")
km,_ = kmeans2(X,2)
idx,_ = vq(X,km)
How would I observe the cluster centers? I have tried print(centers), print(centroids) etc but nothing works.
How would I observe the cluster labels? For example, in the sklearn KMeans this is given by labels_.
I have tried print(labels) and all variations of it, which I found on the Scipy Reference Guide, but none seem to work.
Also, under the initialization methods, it states that a matrix is an available method within minit. I cannot get minit to recognise any matrices I put in.
I usually either get an error message saying "data type not understood" or "unhashable type: 'list'.
The reason I am trying to do this is because I want to run a KMeans Clustering Algorithm where I can manually select each cluster center and then categorize each point to the closest center.
Am I just not understanding how "minit" works, or am I simply just not inputting my matrix in the right form
km should contain the cluster centers. Try
print(km)
As for the labels, that should be the second variable returned by kmeans2.
Here is a working example:
df = [[1.,2.,3.], [7.,8.,9.], [2.,2.,2.], [7.,8.,6.]]
centers,labels = kmeans2(df,2)
print(centers)
print(labels)
The result:
[[1.5 2. 2.5]
[7. 8. 7.5]]
[0 1 0 1]

How to get scipy.stats.chisquare to function properly

I have 2 input files of identical size/shape, however the data they contain has a different resolution and I am looking to perform a chi squared test on them.
The input files are 500 lines long and contain 4 columns delineated by spaces, I am trying to test the second column of each input file against the other.
My code is as follows:
# Import statements
C = pl.loadtxt("input_1.txt")
D = pl.loadtxt("input_2.txt")
col2_C = C[:,1]
col2_D = D[:,1]
f_obs = np.array([col2_C])
f_exp = np.array([col2_D])
chisquare(f_obs, f_exp)
This gives me an error saying:
ValueError: df <= 0
I don't even understand what it is complaining about here.
I have tried several other syntaxes within the script, each of which also resulted in various errors:
This one was found here.
chisquare = f_obs=[col2_C], f_exp=[col2_D])
TypeError: chisquare() takes at least one positional argument
Then I tried
chisquare = f_obs(col2_C), F_exp=[col2_D)
NameError: name 'f_obs' is not defined
I also tried several other syntactical tweaks but nothing to any avail. If anybody could please help me get this running I would appreciate it greatly.
Thanks in advance.
First, be sure you are importing chisquare from scipy.stats. Numpy has the function numpy.random.chisquare, but that does not do a statistical test. It generates samples from a chi-square probability distribution.
So be sure you use:
from scipy.stats import chisquare
There is a second problem.
As slices of the two-dimensional array returned by loadtxt, col2_C and col2_D are one-dimensional numpy arrays, so there is no need to use, for example, np.array([col2_C]) when you pass these to chisquare. Just use col2_C and col2_D directly:
chisquare(col2_C, col2_D)
Wrapping the arrays with np.array like you did is causing the problem. chisquare accepts multidimensional arrays and an axis argument. When you do f_exp = np.array([col2_C]) (with the extra square brackets), f_exp is actually a two-dimensional array, with shape (1, 500). Similarly f_obs has shape (1, 500). The default axis argument of chisquare is 0. So when you called chisquare(f_obs, f_exp), you were asking chisquare to perform 500 chi-square tests, with each test having a single observed and expected value.

Categories

Resources