How to combine the outputs of multiple naive bayes classifier? - python

I am new to this.
I have a set of weak classifiers constructed using Naive Bayes Classifier (NBC) in Sklearn toolkit.
My problem is how do I combine the output of each of the NBC to make final decision. I want my decision to be in probabilities and not labels.
I made a the following program in python. I assume 2 class problem from iris-dataset in sklean. For demo/learning say I make a 4 NBC as follows.
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
import numpy as np
import cPickle
import math
iris = datasets.load_iris()
gnb1 = GaussianNB()
gnb2 = GaussianNB()
gnb3 = GaussianNB()
gnb4 = GaussianNB()
#Actual dataset is of 3 class I just made it into 2 class for this demo
target = np.where(iris.target, 2, 1)
gnb1.fit(iris.data[:, 0].reshape(150,1), target)
gnb2.fit(iris.data[:, 1].reshape(150,1), target)
gnb3.fit(iris.data[:, 2].reshape(150,1), target)
gnb4.fit(iris.data[:, 3].reshape(150,1), target)
#y_pred = gnb.predict(iris.data)
index = 0
y_prob1 = gnb1.predict_proba(iris.data[index,0].reshape(1,1))
y_prob2 = gnb2.predict_proba(iris.data[index,1].reshape(1,1))
y_prob3 = gnb3.predict_proba(iris.data[index,2].reshape(1,1))
y_prob4 = gnb4.predict_proba(iris.data[index,3].reshape(1,1))
#print y_prob1, "\n", y_prob2, "\n", y_prob3, "\n", y_prob4
# I just added it over all for each class
pos = y_prob1[:,1] + y_prob2[:,1] + y_prob3[:,1] + y_prob4[:,1]
neg = y_prob1[:,0] + y_prob2[:,0] + y_prob3[:,0] + y_prob4[:,0]
print pos
print neg
As you will notice I just simply added the probabilites of each of NBC as final score. I wonder if this correct?
If I have dont it wrong can you please suggest some ideas so I can correct myself.

First of all - why you do this? You should have one Naive Bayes here, not one per feature. It looks like you do not understand the idea of the classifier. What you did is actually what Naive Bayes is doing internally - it treats each feature independently, but as these are probabilities you should multiply them, or add logarithms, so:
You should just have one NB, gnb.fit(iris.data, target)
If you insist on having many NBs, you should merge them through multiplication or addition of logarithms (which is the same from mathematical perspective, but multiplication is less stable in the numerical sense)
pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1]
or
pos = np.exp(np.log(y_prob1[:,1]) + np.log(y_prob2[:,1]) + np.log(y_prob3[:,1]) + np.log(y_prob4[:,1]))
you can also directly predit logarithm through gnb.predict_log_proba instead of gbn.predict_proba.
However, this approach have one error - Naive Bayes will also include prior in each of your prob's, so you will have very skewed distributions. So you have to manually normalize
pos_prior = gnb1.class_prior_[1] # all models have the same prior so we can use the one from gnb1
pos = pos_prior_ * (y_prob1[:,1]/pos_prior_) * (y_prob2[:,1]/pos_prior_) * (y_prob3[:,1]/pos_prior_) * (y_prob4[:,1]/pos_prior_)
which simplifies to
pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1] / pos_prior_**3
and for log to
pos = ... - 3 * np.log(pos_prior_)
So once again - you should use the "1" option.

The answer by lejlot is almost correct. The one thing missing is that you need to normalize his pos result (the product of the probabilities, divided by the prior) by the sum of this pos result for both classes. Otherwise, the sum of the probabilities of all classes will not be equal to one.
Here is a sample code that test the result of this procedure for a dataset with 6 features:
# Use one Naive Bayes for all 6 features:
gaus = GaussianNB(var_smoothing=0)
gaus.fit(X, y)
y_prob1 = gaus.predict_proba(X)
# Use one Naive Bayes on each half of the features and multiply the results:
gaus1 = GaussianNB(var_smoothing=0)
gaus1.fit(X[:, :3], y)
y_log_prob1 = gaus1.predict_log_proba(X[:, :3])
gaus2 = GaussianNB(var_smoothing=0)
gaus2.fit(X[:, 3:], y)
y_log_prob2 = gaus2.predict_log_proba(X[:, 3:])
pos = np.exp(y_log_prob1 + y_log_prob2 - np.log(gaus1.class_prior_))
y_prob2 = pos / pos.sum(axis=1)[:,None]
y_prob1 should be equal to y_prob2 apart from numerical errors (var_smoothing=0 helps reducing the error).

Related

Is there a way to compute the explained variance of PCA on a test set?

I want to see how well PCA worked with my data.
I applied PCA on a training set and used the returned pca object to transform on a test set. pca object has a variable pca.explained_variance_ratio_ which tells me the percentage of variance explained by each of the selected components for the training set. After applying the pca transform, I want to see how well this worked on the test set. I tried inverse_transform() that returned what the original values would look like but I have no way to compare how it worked on the train set vs test set.
pca = PCA(0.99)
pca.fit(train_df)
tranformed_test = pca.transform(test_df)
inverse_test = pca.inverse_transform(tranformed_test)
npt.assert_almost_equal(test_arr, inverse_test, decimal=2)
This returns:
Arrays are not almost equal to 2 decimals
Is there something like pca.explained_variance_ratio_ after transform()?
Variance explained for each components
You can compute it manually.
If the components X_i are orthogonal (which is the case in PCA), the explained variance by X_i out of X is: 1 - ||X_i - X||^2 / ||X - X_mean||^2
Hence the following example:
import numpy as np
from sklearn.decomposition import PCA
X_train = np.random.randn(200, 5)
X_test = np.random.randn(100, 5)
model = PCA(n_components=5).fit(X_train)
def explained_variance(X):
result = np.zeros(model.n_components)
for ii in range(model.n_components):
X_trans = model.transform(X)
X_trans_ii = np.zeros_like(X_trans)
X_trans_ii[:, ii] = X_trans[:, ii]
X_approx_ii = model.inverse_transform(X_trans_ii)
result[ii] = 1 - (np.linalg.norm(X_approx_ii - X) /
np.linalg.norm(X - model.mean_)) ** 2
return result
print(model.explained_variance_ratio_)
print(explained_variance(X_train))
print(explained_variance(X_test))
# [0.25335711 0.23100201 0.2195476 0.15717412 0.13891916]
# [0.25335711 0.23100201 0.2195476 0.15717412 0.13891916]
# [0.17851083 0.199134 0.24198887 0.23286815 0.14749816]
Total variance explained
Alternatively, if you only care about the total variance explained, you can use r2_score:
from sklearn.metrics import r2_score
model = PCA(n_components=2).fit(X_train)
print(model.explained_variance_ratio_.sum())
print(r2_score(X_train, model.inverse_transform(model.transform(X_train)),
multioutput='variance_weighted'))
print(r2_score(X_test, model.inverse_transform(model.transform(X_test)),
multioutput='variance_weighted'))
# 0.46445451252373826
# 0.46445451252373815
# 0.4470229486590848

How do I properly train and predict value like biomass using GRU RNN?

My first time trying to train a dataset containing 8 variables in a time-series of 20 years or so using GRU RNN. The biomass value is what I'm trying to predict based on the other variables. I'm trying first with 1 layer GRU. I'm not using softmax for the output layer. MSE is used for my cost function.
It is basic GRU with forward propagation and backward gradient update. Here are the main function I defined:
'x_t is the input training dataset with a dimension of 7572x8. So T = 7572, input_dim = 8, hidden_dim =128. y_train is my train label.'
def forward_prop_step(self, x_t,y_train, s_t1_prev,V, U, W, b, c,learning_rate):
T = x_t.shape[0]
z_t1 = np.zeros((T,self.hidden_dim))
r_t1 = np.zeros((T,self.hidden_dim))
h_t1 = np.zeros((T,self.hidden_dim))
s_t1 = np.zeros((T+1,self.hidden_dim))
o_s = np.zeros((T,self.input_dim))
for i in xrange(T):
x_e = x_t[i].T
z_t1[i] = sigmoid(U[0].dot(x_e) + W[0].dot(s_t1[i]) + b[0])#128x1
r_t1[i] = sigmoid(U[1].dot(x_e) + W[1].dot(s_t1[i]) + b[1])#128x1
h_t1[i] = np.tanh(U[2].dot(x_e) + W[2].dot(s_t1[i] * r_t1[i]) + b[2])#128x1
s_t1[i+1] = (np.ones_like(z_t1[i]) - z_t1[i]) * h_t1[i] + z_t1[i] * s_t1[i]#128x1
o_s[i] = np.dot(V,s_t1[i+1]) + c#8x1
return [o_s,z_t1,r_t1,h_t1,s_t1]
def bptt(self, x,y_train,o,z_t1,r_t1,h_t1,s_t1,V, U, W, b, c):
bptt_truncate = 360
T = x.shape[0]#length of time scale of input data (train)
dLdU = np.zeros(U.shape)
dLdV = np.zeros(V.shape)
dLdW = np.zeros(W.shape)
dLdb = np.zeros(b.shape)
dLdc = np.zeros(c.shape)
y_train_sp = np.repeat(y_train,self.input_dim)
for t in np.arange(T)[::-1]:
dLdy = 2 * (o[t] - y_train_sp[t])
dydV = s_t1[t]
dydc = 1.0
dLdV += np.outer(dLdy,dydV)
dLdc += dLdy*dydc
for i in np.arange(max(0, t-bptt_truncate), t+1)[::-30]:#every month in the past year
s_t1_pre = s_t1[i]
dydst1 = V #8x128
dst1dzt1 = -h_t1[i] + s_t1_pre #128x1
dst1dht1 = np.ones_like(z_t1[i]) - z_t1[i] #128x1
dzt1dU = np.outer(z_t1[i]*(1.0-z_t1[i]),x[i]) #128x8
#print dzt1dU.shape
dzt1dW = np.outer(z_t1[i]*(1.0-z_t1[i]),s_t1_pre) #128x128
dzt1db = z_t1[i]*(1.0-z_t1[i]) #128x1
dht1dU = np.outer((1.0-h_t1[i] ** 2),x[i]) #128x8
dht1dW = np.outer((1.0-h_t1[i] ** 2),s_t1_pre * r_t1[i]) #128x128
dht1db = 1.0-h_t1[i] ** 2 #128x1
dht1drt1 = (1.0-h_t1[i] ** 2)*(W[2].dot(s_t1_pre))#128x1
drt1dU = np.outer((r_t1[i]*(1.0-r_t1[i])),x[i]) #128x8
drt1dW = np.outer((r_t1[i]*(1.0-r_t1[i])),s_t1_pre) #128x128
drt1db = (r_t1[i]*(1.0-r_t1[i]))#128x1
dLdW[0] += np.outer(dydst1.T.dot(dLdy),dzt1dW.dot(dst1dzt1)) #128x128
dLdU[0] += np.outer(dydst1.T.dot(dLdy),dst1dzt1.dot(dzt1dU)) #128x8
dLdb[0] += (dydst1.T.dot(dLdy))*dst1dzt1*dzt1db#128x1
dLdW[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dW)#128x128
dLdU[1] += np.outer(dydst1.T.dot(dLdy),dst1dht1*dht1drt1).dot(drt1dU) #128x8
dLdb[1] += (dydst1.T.dot(dLdy))*dst1dht1*dht1drt1*drt1db#128x1
dLdW[2] += np.outer(dydst1.T.dot(dLdy),dht1dW.dot(dst1dht1)) #128x128
dLdU[2] += np.outer(dydst1.T.dot(dLdy),dst1dht1.dot(dht1dU))#128x8
dLdb[2] += (dydst1.T.dot(dLdy))*dst1dht1*dht1db#128x1
return [ dLdV,dLdU, dLdW, dLdb, dLdc ]
def predict( self, x):
pred = np.amax(x, axis = 1)
pred_f = relu(pred)
return pred_f
Parameters V,U,W,b,c are updated by gradient dLdV,dLdU,dLdW,dLdb,dLdc calculated by bptt.
I have tried different weight initialization (xavier or just random), tried different time truncation. But all lead to the same outcome. Probably the weight update wasn't right? The network set-up seems simple though. Really struggle on understanding the predication and translate to actual biomass too. The function predict is what I defined to translate the output layer from the GRU network to biomass value by taking the maximum value. But the output layer gives similar value for almost all time iterations. Not sure the best way to do the job though. Thanks for any help or suggestions in advance.
I doubt anyone on stackoverflow is going to debug a custom implementation of GRU for you. If you were using Tensorflow or another high level library, I might take a stab at it, or if it was a simple fully connected network, but as it is all I can do is give some advice on how to proceed with debugging.
First, it sounds like you're running a brand new implementation on your own data set right off the bat. Instead, try starting out by testing your network on trivial, synthetic data sets first. Can it learn an identity function? A response which is simply the weighted average of the three previous time stamps? And so on. It's easier to debug small simple problems. Once you know your implementation can learn things that a GRU based recurrent network should be able to learn, then you can start using your own data.
Second, this comment of yours was very insightful:
Probably the weight update wasn't right?
While it's impossible to say for sure, this is a very common - perhaps the most common - source of bugs in for backprop implementations. Andrew Ng recommends gradient checking to debug an implementation like this. Essentially, this involves numerically approximating the gradient. It's computationally inefficient but relies only on a correct implementation of forward propagation, which makes it very useful for debugging. For one, if the algorithm converges when the numerically approximated gradient is used, you can be more sure that your forward prop is correct and focus on debugging backprop. (On the other hand, if it is still does not succeed, it is likely the issue in your forward prop function.) For another, once the algorithm is working with the numerically approximated gradient then you can compare the output of your analytic gradient function with it and debug any discrepancies. This makes it a lot easier because you now know the correct answer that it should return.

How can I increase the accuracy of my Linear Regression model?(machine learning with python)

I have a machine learning project with python by using scikit-learn library. I have two seperated datasets for training and testing and I try to doing linear regression. I use this codeblock shown below:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import LinearRegression
df =pd.read_csv("TrainingData.csv")
df2=pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test=df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test=df2['Effort']
lr = LinearRegression().fit(X_train, Y_train)
print("lr.coef_: {}".format(lr.coef_))
print("lr.intercept_: {}".format(lr.intercept_))
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.7f}".format(lr.score(X_test, Y_test)))
My results are:
lr.coef_: [ 2.32088001e+00 2.07441948e-12 -4.73338567e-05 6.79658129e+02]
lr.intercept_: 2166.186033098048
Training set score: 0.63
Test set score: 0.5732999
What do you suggest me? How can I increase my accuracy? (adding code,parameter etc.)
My datasets is here: https://yadi.sk/d/JJmhzfj-3QCV4V
I'll elaborate a bit on #GeorgiKaradjov's answer with some examples. Your question is very broad, and there's multiple ways to gain improvements. In the end, having domain knowledge (context) will give you the best possible chance of getting improvements.
Normalise your data, i.e., shift it to have a mean of zero, and a spread of 1 standard deviation
Turn categorical data into variables via, e.g., OneHotEncoding
Do feature engineering:
Are my features collinear?
Do any of my features have cross terms/higher-order terms?
Regularisation of the features to reduce possible overfitting
Look at alternative models given the underlying features and the aim of the project
1) Normalise data
from sklearn.preprocessing import StandardScaler
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
Gives
0 0.752395
1 0.008489
2 -0.381637
3 -0.020588
4 0.171446
Name: AFP, dtype: float64
2) Categorical Feature Encoding
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
X_train = feature_engineering(X_train)
X_train.head(5)
Gives
AFP dev_plat_077070 dev_plat_077082 dev_plat_077117108116105 dev_plat_080067 lang_type_051071076 lang_type_052071076 lang_type_065112071 resource_level_1 resource_level_2 resource_level_4
0 0.752395 1 0 0 0 1 0 0 1 0 0
1 0.008489 0 0 1 0 0 1 0 1 0 0
2 -0.381637 0 0 1 0 0 1 0 1 0 0
3 -0.020588 0 0 1 0 1 0 0 1 0 0
3) Feature Engineering; collinearity
import seaborn as sns
corr = X_train.corr()
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool), cmap=sns.diverging_palette(220, 10, as_cmap=True), square=True)
You want the red line for y=x because values should be correlated with themselves. However, any red or blue columns show there's a strong correlation/anti-correlation that requires more investigation. For example, Resource=1, Resource=4, might be highly correlated in the sense if people have 1 there is a less chance to have 4, etc. Regression assumes that the parameters used are independent from one another.
3) Feature engineering; higher-order terms
Maybe your model is too simple, you could consider adding higher order and cross terms:
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(2, interaction_only=True)
output_nparray = poly.fit_transform(df)
target_feature_names = ['x'.join(['{}^{}'.format(pair[0],pair[1]) for pair in tuple if pair[1]!=0]) for tuple in [zip(df.columns, p) for p in poly.powers_]]
output_df = pd.DataFrame(output_nparray, columns=target_feature_names)
I had a quick try at this, I don't think the higher order terms help out much. It's also possible your data is non-linear, a quick logarithm or the Y-output gives a worse fit, suggesting it's linear. You could also look at the actuals, but I was too lazy....
4) Regularisation
Try using sklearn's RidgeRegressor and playing with alpha:
lr = RidgeCV(alphas=np.arange(70,100,0.1), fit_intercept=True)
5) Alternative models
Sometimes linear regression is not always suited. For example, Random Forest Regressors can perform very well, and are usually insensitive to data being standardised, and being categorical/continuous. Other models include XGBoost, and Lasso (Linear regression with L1 regularisation).
lr = RandomForestRegressor(n_estimators=100)
Putting it all together
I got carried away and started looking at your problem, but couldn't improve it too much without knowing all the context of the features:
import numpy as np
import pandas as pd
import scipy
import matplotlib.pyplot as plt
from pylab import rcParams
import urllib
import sklearn
from sklearn.linear_model import RidgeCV, LinearRegression, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.model_selection import GridSearchCV
def feature_engineering(df):
dev_plat = pd.get_dummies(df['Development_platform'], prefix='dev_plat')
df[dev_plat.columns] = dev_plat
df = df.drop('Development_platform', axis=1)
lang_type = pd.get_dummies(df['Language_Type'], prefix='lang_type')
df[lang_type.columns] = lang_type
df = df.drop('Language_Type', axis=1)
resource_level = pd.get_dummies(df['Resource_Level'], prefix='resource_level')
df[resource_level.columns] = resource_level
df = df.drop('Resource_Level', axis=1)
return df
df = pd.read_csv("TrainingData.csv")
df2 = pd.read_csv("TestingData.csv")
df['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df['Development_platform']]
df['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df['Language_Type']]
df2['Development_platform']= ["".join("%03d" % ord(c) for c in s) for s in df2['Development_platform']]
df2['Language_Type']= ["".join("%03d" % ord(c) for c in s) for s in df2['Language_Type']]
X_train = df[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_train = df['Effort']
X_test = df2[['AFP','Development_platform','Language_Type','Resource_Level']]
Y_test = df2['Effort']
std = StandardScaler()
afp = np.append(X_train['AFP'].values, X_test['AFP'].values)
std.fit(afp)
X_train[['AFP']] = std.transform(X_train['AFP'])
X_test[['AFP']] = std.transform(X_test['AFP'])
X_train = feature_engineering(X_train)
X_test = feature_engineering(X_test)
lr = RandomForestRegressor(n_estimators=50)
lr.fit(X_train, Y_train)
print("Training set score: {:.2f}".format(lr.score(X_train, Y_train)))
print("Test set score: {:.2f}".format(lr.score(X_test, Y_test)))
fig = plt.figure()
ax = fig.add_subplot(111)
ax.errorbar(Y_test, y_pred, fmt='o')
ax.errorbar([1, Y_test.max()], [1, Y_test.max()])
Resulting in:
Training set score: 0.90
Test set score: 0.61
You can look at the importance of the variables (higher value, more important).
Importance
AFP 0.882295
dev_plat_077070 0.020817
dev_plat_077082 0.001162
dev_plat_077117108116105 0.016334
dev_plat_080067 0.004077
lang_type_051071076 0.012458
lang_type_052071076 0.021195
lang_type_065112071 0.001118
resource_level_1 0.012644
resource_level_2 0.006673
resource_level_4 0.021227
You could start looking at the hyperparameters to get improvements on this also: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
here are some tips :
Data preparation(exploration) is one of the most important steps in a machine learning project, you need to start with it.
did you clean your data? if not start with that step!
As said in this tutorial :
There are no shortcuts for data exploration. If you are in a state of
mind, that machine learning can sail you away from every data storm,
trust me, it won’t.After some point of time, you’ll realize that you
are struggling at improving model’s accuracy. In such situation, data
exploration techniques will come to your rescue.
here is some step for data exploration :
missing values treatment,
outlier removal
feature engineering
After that try to perform univariate and bivariate analysis with your features.
use one hot encoding to transform you categorical features into numerics ones.
this is what you need according to what we have talked about in the comments.
here is a tutorial on how to deal with categorical variables, one-hot encoding from sklearn learn is the best technic for your problem.
Using ASCII representation is not the best practice for handling categorical features
You can find more about data exploration in here
follow the suggestions I gave to you and thank me later.
normalize your data
Depending on the type of input features you can extract different features from them (feature combinations are possible too)
If your data is not linearly separable, you won't be able to predict it well. You may need to use another model - Logistic regression, SVR, NN / whatever

GaussianMixture initialization using component parameters - sklearn

I want to use sklearn.mixture.GaussianMixture to store a gaussian mixture model so that I can later use it to generate samples or a value at a sample point using score_samples method. Here is an example where the components have the following weight, mean and covariances
import numpy as np
weights = np.array([0.6322941277066596, 0.3677058722933399])
mu = np.array([[0.9148052872961359, 1.9792961751316835],
[-1.0917396392992502, -0.9304220945910037]])
sigma = np.array([[[2.267889129267119, 0.6553245618368836],
[0.6553245618368835, 0.6571014653342457]],
[[0.9516607767206848, -0.7445831474157608],
[-0.7445831474157608, 1.006599716443763]]])
Then I initialised the mixture as follow
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
Finally I tried to generate a sample based on the parameters which resulted in an error:
x = gmix.sample(1000)
NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
As I understand GaussianMixture is intended to fit a sample using a mixture of Gaussian but is there a way to provide it with the final values and continue from there?
You rock, J.P.Petersen!
After seeing your answer I compared the change introduced by using fit method. It seems the initial instantiation does not create all the attributes of gmix. Specifically it is missing the following attributes,
covariances_
means_
weights_
converged_
lower_bound_
n_iter_
precisions_
precisions_cholesky_
The first three are introduced when the given inputs are assigned. Among the rest, for my application the only attribute that I need is precisions_cholesky_ which is cholesky decomposition of the inverse covarinace matrices. As a minimum requirement I added it as follow,
gmix.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(sigma)).transpose((0, 2, 1))
It seems that it has a check that makes sure that the model has been trained. You could trick it by training the GMM on a very small data set before setting the parameters. Like this:
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(rand(10, 2)) # Now it thinks it is trained
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
x = gmix.sample(1000) # Should work now
To understand what is happening, what GaussianMixture first checks that it has been fitted:
self._check_is_fitted()
Which triggers the following check:
def _check_is_fitted(self):
check_is_fitted(self, ['weights_', 'means_', 'precisions_cholesky_'])
And finally the last function call:
def check_is_fitted(estimator, attributes, msg=None, all_or_any=all):
which only checks that the classifier already has the attributes.
So in short, the only thing you have missing to have it working (without having to fit it) is to set precisions_cholesky_ attribute:
gmix.precisions_cholesky_ = 0
should do the trick (can't try it so not 100% sure :P)
However, if you want to play safe and have a consistent solution in case scikit-learn updates its contrains, the solution of #J.P.Petersen is probably the best way to go.
As a slight alternative to #hashmuke's answer, you can use the precision computation that is used inside GaussianMixture directly:
import numpy as np
from scipy.stats import invwishart as IW
from sklearn.mixture import GaussianMixture as GMM
from sklearn.mixture._gaussian_mixture import _compute_precision_cholesky
n_dims = 5
mu1 = np.random.randn(n_dims)
mu2 = np.random.randn(n_dims)
Sigma1 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
Sigma2 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
gmm = GMM(n_components=2)
gmm.weights_ = np.array([0.2, 0.8])
gmm.means_ = np.stack([mu1, mu2])
gmm.covariances_ = np.stack([Sigma1, Sigma2])
gmm.precisions_cholesky_ = _compute_precision_cholesky(gmm.covariances_, 'full')
X, y = gmm.sample(1000)
And depending on your covariance type you should change full accordingly as input to _compute_precision_cholesky (will be one of full, diag, tied, spherical).

How to code a hierarchical mixture model of multivariate normals using PYMC

I successfully implemented a mixture of 3 normals using PyMC (shown at https://drive.google.com/file/d/0Bwnmbh6ueWhqSkUtV1JFZDJwLWc, and similar to the question asked at How to model a mixture of 3 Normals in PyMC?)
My next step is to try and code mixtures of multivariate normals.
There is, however, an additional complexity to the data - a hierarchy, with sets of observations belonging to a parent observation. The clustering is done on the parent observations, and not on the individual observations themselves. This first step generates the code (60 parents, with 50 observations per each parent), and works fine.
import numpy as np
import pymc as mc
n = 3 #mixtures
B = 5 #Bias between those at different mixtures
tau = 3 #Variances
nprov = 60 #number of parent observations
mu = [[0,0],[0,B],[-B,0]]
true_cov0 = np.array([[1.,0.],[0.,1.]])
true_cov1 = np.array([[1.,0.],[0.,tau**(2)]])
true_cov2 = np.array([[tau**(-2),0],[0.,1.]])
trueprobs = [.4, .3, .3] #probability of being in each of the three mixtures
prov = np.random.multinomial(1, trueprobs, size=nprov)
v = prov[:,1] + (prov[:,2])*2
numtoeach = 50
n_obs = nprov*numtoeach
vAll = np.tile(v,numtoeach)
ndata = numtoeach*nprov
p1 = range(nprov)
prov1 = np.tile(p1,numtoeach)
data = (vAll==0)*(np.random.multivariate_normal(mu[0],true_cov0,ndata)).T \
+ (vAll==1)*(np.random.multivariate_normal(mu[1],true_cov1,ndata)).T \
+ (vAll==2)*(np.random.multivariate_normal(mu[2],true_cov2,ndata)).T
data=data.T
However, when I try and use PyMC to do the sampling, I run intro trouble ('error: failed in converting 3rd argument `tau' of flib.prec_mvnorm to C/Fortran array')
p = 2 #covariates
prior_mu1=np.ones(p)
prior_mu2=np.ones(p)
prior_mu3=np.ones(p)
post_mu1 = mc.Normal("returns1",prior_mu1,1,size=p)
post_mu2 = mc.Normal("returns2",prior_mu2,1,size=p)
post_mu3 = mc.Normal("returns3",prior_mu3,1,size=p)
post_cov_matrix_inv1 = mc.Wishart("cov_matrix_inv1",n_obs,np.eye(p) )
post_cov_matrix_inv2 = mc.Wishart("cov_matrix_inv2",n_obs,np.eye(p) )
post_cov_matrix_inv3 = mc.Wishart("cov_matrix_inv3",n_obs,np.eye(p) )
#Combine prior means and variance matrices
meansAll= np.array([post_mu1,post_mu2,post_mu3])
precsAll= np.array([post_cov_matrix_inv1,post_cov_matrix_inv2,post_cov_matrix_inv3])
dd = mc.Dirichlet('dd', theta=(1,)*n)
category = mc.Categorical('category', p=dd, size=nprov)
#This step accounts for the hierarchy: observations' means are equal to their parents mean
#Parent is labeled prov1
#mc.deterministic
def mean(category=category, meansAll=meansAll):
lat = category[prov1]
new = meansAll[lat]
return new
#mc.deterministic
def prec(category=category, precsAll=precsAll):
lat = category[prov1]
return precsAll[lat]
obs = mc.MvNormal( "observed returns", mean, prec, observed = True, value = data)
I know the problem is not with the format of the simulated observed data, because this step would work fine, in place of the above:
obs = mc.MvNormal( "observed returns", post_mu3, post_cov_matrix_inv3, observed = True, value = data )
As a result, I think the issue is how the mean vector ('mean') and the covariance matrix ('prec') are entered, I just don't know how. Like I said, this worked fine with mixtures of normal distributions, but mixtures of multivariate normals is adding a complexity I can't figure out.
This is a good example of the difficulty PyMC has with vectors of multivariate variables. Not that its difficult--just not as elegant as it should be. You should create a list comprehension of the MVN nodes and wrap that as an observed stochastic.
#mc.observed
def obs(value=data, mean=mean, prec=prec):
return sum(mc.mv_normal_like(v, m, T) for v,m,T in zip(data, mean, prec))
Here is the IPython notebook

Categories

Resources