I used the Boruta package in R and Python for the same dataset. And all the steps and other methods I applied are the same. But results of Boruta is different in Python and R for feature selection. In R, 46 feature are selected but 20 feature is selected in Python. What is the reason?
R
M_boruta <- Boruta::Boruta(is_churn ~ . -cust_id, data = Mobile, doTrace = 2)
print(M_boruta)
plot(M_boruta, xlab = "", xaxt = "n")
lz_2 <- lapply(1:ncol(M_boruta$ImpHistory),function(i)
M_boruta$ImpHistory[is.finite(M_boruta$ImpHistory[,i]),i])
names(lz_2) <- colnames(M_boruta$ImpHistory)
Labels_2 <- sort(sapply(lz_2,median))
axis(side = 1,las=2,labels = names(Labels_2),
at = 1:ncol(M_boruta$ImpHistory), cex.axis = 0.7)
M_boruta_attr <- getSelectedAttributes(M_boruta, withTentative = F)
M_boruta_df <- Mobile[ ,(names(Mobile) %in% M_boruta_attr)]
str(M_boruta_df)]
Python
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
rfc = RandomForestClassifier(n_estimators=1000, n_jobs=-1, class_weight='balanced', max_depth=50)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=2)
churn_gsm_bor_x = churn_gsm_bor.iloc[:,1:].values
churn_gsm_bor_y = churn_gsm_bor.iloc[:,0].values.ravel()
boruta_selector.fit(churn_gsm_bor_x, churn_gsm_bor_y)
print("=============BORUTA==============")
print(boruta_selector.n_features_)
print(boruta_selector.support_)
print(boruta_selector.ranking_)
churn_gsm_bor_x_filter=boruta_selector.transform(churn_gsm_bor_x)
print(churn_gsm_bor_x_filter)
This might be because the parameters you specify for the Random Forest classifier in Python differ from the default parameters you use in R (cf. https://cran.r-project.org/web/packages/randomForest/randomForest.pdf, or ranger in more recent versions of Boruta: https://cran.r-project.org/web/packages/ranger/ranger.pdf). I'd point out that you are also setting the maximum depth of the trees in the Python implementation higher than recommended (cf. https://github.com/scikit-learn-contrib/boruta_py) - personally I've found that this can have a large effect on how many features are selected.
Related
I am porting some distribution fitting code from R to Python and I noticed that the shape and scale parameter estimation in R and Python are different after 3 decimal places, and I wondered why this would be the case.
R code:
library(fitdistrplus)
library(ADGofTest)
set.seed(66)
weibull_sample <- rweibull(150, shape = 0.75, scale = 1)
weibull_fit <- fitdist(weibull_sample,"weibull",method="mle")
summary(weibull_fit) # shape = 0.888309653152, scale = 1.065783323933
gofstat(weibull_fit) #AD 0.9755906963522
plot(weibull_fit)
ad.test(weibull_sample, pweibull, shape = 0.888309653152, scale = 1.065783323933)
# AD = 0.9755906964
write.csv(weibull_sample, file = "weibull_sample.csv", row.names = FALSE, col.names = FALSE)
here is the python code
# Import libraries
import pandas as pd
import numpy as np
import math
from scipy import stats
import statistics as stat
# Read in a dataset from disk (n)
file_path = "weibull_sample.csv"
weibull_df = pd.read_csv(file_path)
weibull_df = weibull_df.sort_values(by=['Wait_Times'], ascending=True)
weibull_df = weibull_df.reset_index(drop=True)
# Find the parameters a Weibull Distribution based on the head dataset
weibull_fit = stats.weibull_min.fit(weibull_df['Wait_Times'], floc=0)
# Extract parameters to individual values
weibull_shape, wiebull_unused, weibull_scale = weibull_fit
# Print the values
print("Weibull Shape Parameter for Dataset", weibull_shape)
print("Weibull Scale Parameter for Dataset", weibull_scale)
#shape R = 0.888309653152
#shape Py = 0.8883784856851663
#scale R = 1.065783323933
#scale Py = 1.0659294522799554
Note I am using a flag in R to set the decimal places to 12 as follows options(digits = 12), Python appears to provide 16dp by default.
Of course, if I round to 3 decimal places I am none the wiser but I am wondering why there is a difference in the first place.
But more importantly how do I know which set of parameters is "correct"?
I am playing around with this code which is for Univariate linear mixed effects modelling. The data set denotes:
students as s
instructors as d
departments as dept
service as service
In the syntax of R's lme4 package (Bates et al., 2015), the model implemented can be summarized as:
y ~ 1 + (1|students) + (1|instructor) + (1|dept) + service
where 1 denotes an intercept term,(1|x) denotes a random effect for x, and x denotes a fixed effect.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import edward as ed
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
from edward.models import Normal
from observations import insteval
data = pd.DataFrame(data, columns=metadata['columns'])
train = data.sample(frac=0.8)
test = data.drop(train.index)
train.head()
s_train = train['s'].values
d_train = train['dcodes'].values
dept_train = train['deptcodes'].values
y_train = train['y'].values
service_train = train['service'].values
n_obs_train = train.shape[0]
s_test = test['s'].values
d_test = test['dcodes'].values
dept_test = test['deptcodes'].values
y_test = test['y'].values
service_test = test['service'].values
n_obs_test = test.shape[0]
n_s = max(s_train) + 1 # number of students
n_d = max(d_train) + 1 # number of instructors
n_dept = max(dept_train) + 1 # number of departments
n_obs = train.shape[0] # number of observations
# Set up placeholders for the data inputs.
s_ph = tf.placeholder(tf.int32, [None])
d_ph = tf.placeholder(tf.int32, [None])
dept_ph = tf.placeholder(tf.int32, [None])
service_ph = tf.placeholder(tf.float32, [None])
# Set up fixed effects.
mu = tf.get_variable("mu", [])
service = tf.get_variable("service", [])
sigma_s = tf.sqrt(tf.exp(tf.get_variable("sigma_s", [])))
sigma_d = tf.sqrt(tf.exp(tf.get_variable("sigma_d", [])))
sigma_dept = tf.sqrt(tf.exp(tf.get_variable("sigma_dept", [])))
# Set up random effects.
eta_s = Normal(loc=tf.zeros(n_s), scale=sigma_s * tf.ones(n_s))
eta_d = Normal(loc=tf.zeros(n_d), scale=sigma_d * tf.ones(n_d))
eta_dept = Normal(loc=tf.zeros(n_dept), scale=sigma_dept * tf.ones(n_dept))
yhat = (tf.gather(eta_s, s_ph) +
tf.gather(eta_d, d_ph) +
tf.gather(eta_dept, dept_ph) +
mu + service * service_ph)
y = Normal(loc=yhat, scale=tf.ones(n_obs))
#Inference
q_eta_s = Normal(
loc=tf.get_variable("q_eta_s/loc", [n_s]),
scale=tf.nn.softplus(tf.get_variable("q_eta_s/scale", [n_s])))
q_eta_d = Normal(
loc=tf.get_variable("q_eta_d/loc", [n_d]),
scale=tf.nn.softplus(tf.get_variable("q_eta_d/scale", [n_d])))
q_eta_dept = Normal(
loc=tf.get_variable("q_eta_dept/loc", [n_dept]),
scale=tf.nn.softplus(tf.get_variable("q_eta_dept/scale", [n_dept])))
latent_vars = {
eta_s: q_eta_s,
eta_d: q_eta_d,
eta_dept: q_eta_dept}
data = {
y: y_train,
s_ph: s_train,
d_ph: d_train,
dept_ph: dept_train,
service_ph: service_train}
inference = ed.KLqp(latent_vars, data)
This works fine in the univariate case for Linear mixed effects modelling. I am trying to extend this approach to the multivariate case. Any ideas are more than welcome.
There are a number of ways to conduct linear mixed effects models in Python. It looks like you've adapted the Tensorflow approach but if that is not a hard requirement then there are several other potentially more convenient options.
You can use the Statsmodels implementation of LMER which is conveniently contained in Python but the syntax is a bit different from traditional formulaic expressions from R's LMER. It looks like you are using python to split your data to training and test sets so you can also write a loop to call the
You can also install R and rpy2 on your local machine and call the LMER packages from your Python environment. This allows you to keep your familiarity with working in R but allows you to do everything else in Python. All you have to do is use the rmagic %%R or (%R for inline) in your cell block in Jupyter Notebooks to pass variables and models between Python and R. The latter would be useful if you are passing the train/test data you split in Python to R to run lmer and retrieve the parameters back in a loop.
Lastly, another option is to use Pymer4 which is a wrapper for rpy2 allowing you to directly call LMER in R but without having to deal with rmagic.
I wrote a tutorial on how to use LMER with each of these methods which also works on Cloud setups like Google Colab. These methods will all allow you to run the multivariate approach like you asked for using the LMER in R but from a Python environment.
I want to use sklearn.mixture.GaussianMixture to store a gaussian mixture model so that I can later use it to generate samples or a value at a sample point using score_samples method. Here is an example where the components have the following weight, mean and covariances
import numpy as np
weights = np.array([0.6322941277066596, 0.3677058722933399])
mu = np.array([[0.9148052872961359, 1.9792961751316835],
[-1.0917396392992502, -0.9304220945910037]])
sigma = np.array([[[2.267889129267119, 0.6553245618368836],
[0.6553245618368835, 0.6571014653342457]],
[[0.9516607767206848, -0.7445831474157608],
[-0.7445831474157608, 1.006599716443763]]])
Then I initialised the mixture as follow
from sklearn import mixture
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
Finally I tried to generate a sample based on the parameters which resulted in an error:
x = gmix.sample(1000)
NotFittedError: This GaussianMixture instance is not fitted yet. Call 'fit' with appropriate arguments before using this method.
As I understand GaussianMixture is intended to fit a sample using a mixture of Gaussian but is there a way to provide it with the final values and continue from there?
You rock, J.P.Petersen!
After seeing your answer I compared the change introduced by using fit method. It seems the initial instantiation does not create all the attributes of gmix. Specifically it is missing the following attributes,
covariances_
means_
weights_
converged_
lower_bound_
n_iter_
precisions_
precisions_cholesky_
The first three are introduced when the given inputs are assigned. Among the rest, for my application the only attribute that I need is precisions_cholesky_ which is cholesky decomposition of the inverse covarinace matrices. As a minimum requirement I added it as follow,
gmix.precisions_cholesky_ = np.linalg.cholesky(np.linalg.inv(sigma)).transpose((0, 2, 1))
It seems that it has a check that makes sure that the model has been trained. You could trick it by training the GMM on a very small data set before setting the parameters. Like this:
gmix = mixture.GaussianMixture(n_components=2, covariance_type='full')
gmix.fit(rand(10, 2)) # Now it thinks it is trained
gmix.weights_ = weights # mixture weights (n_components,)
gmix.means_ = mu # mixture means (n_components, 2)
gmix.covariances_ = sigma # mixture cov (n_components, 2, 2)
x = gmix.sample(1000) # Should work now
To understand what is happening, what GaussianMixture first checks that it has been fitted:
self._check_is_fitted()
Which triggers the following check:
def _check_is_fitted(self):
check_is_fitted(self, ['weights_', 'means_', 'precisions_cholesky_'])
And finally the last function call:
def check_is_fitted(estimator, attributes, msg=None, all_or_any=all):
which only checks that the classifier already has the attributes.
So in short, the only thing you have missing to have it working (without having to fit it) is to set precisions_cholesky_ attribute:
gmix.precisions_cholesky_ = 0
should do the trick (can't try it so not 100% sure :P)
However, if you want to play safe and have a consistent solution in case scikit-learn updates its contrains, the solution of #J.P.Petersen is probably the best way to go.
As a slight alternative to #hashmuke's answer, you can use the precision computation that is used inside GaussianMixture directly:
import numpy as np
from scipy.stats import invwishart as IW
from sklearn.mixture import GaussianMixture as GMM
from sklearn.mixture._gaussian_mixture import _compute_precision_cholesky
n_dims = 5
mu1 = np.random.randn(n_dims)
mu2 = np.random.randn(n_dims)
Sigma1 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
Sigma2 = IW.rvs(n_dims, 0.1 * np.eye(n_dims))
gmm = GMM(n_components=2)
gmm.weights_ = np.array([0.2, 0.8])
gmm.means_ = np.stack([mu1, mu2])
gmm.covariances_ = np.stack([Sigma1, Sigma2])
gmm.precisions_cholesky_ = _compute_precision_cholesky(gmm.covariances_, 'full')
X, y = gmm.sample(1000)
And depending on your covariance type you should change full accordingly as input to _compute_precision_cholesky (will be one of full, diag, tied, spherical).
I am new to this.
I have a set of weak classifiers constructed using Naive Bayes Classifier (NBC) in Sklearn toolkit.
My problem is how do I combine the output of each of the NBC to make final decision. I want my decision to be in probabilities and not labels.
I made a the following program in python. I assume 2 class problem from iris-dataset in sklean. For demo/learning say I make a 4 NBC as follows.
from sklearn import datasets
from sklearn.naive_bayes import GaussianNB
import numpy as np
import cPickle
import math
iris = datasets.load_iris()
gnb1 = GaussianNB()
gnb2 = GaussianNB()
gnb3 = GaussianNB()
gnb4 = GaussianNB()
#Actual dataset is of 3 class I just made it into 2 class for this demo
target = np.where(iris.target, 2, 1)
gnb1.fit(iris.data[:, 0].reshape(150,1), target)
gnb2.fit(iris.data[:, 1].reshape(150,1), target)
gnb3.fit(iris.data[:, 2].reshape(150,1), target)
gnb4.fit(iris.data[:, 3].reshape(150,1), target)
#y_pred = gnb.predict(iris.data)
index = 0
y_prob1 = gnb1.predict_proba(iris.data[index,0].reshape(1,1))
y_prob2 = gnb2.predict_proba(iris.data[index,1].reshape(1,1))
y_prob3 = gnb3.predict_proba(iris.data[index,2].reshape(1,1))
y_prob4 = gnb4.predict_proba(iris.data[index,3].reshape(1,1))
#print y_prob1, "\n", y_prob2, "\n", y_prob3, "\n", y_prob4
# I just added it over all for each class
pos = y_prob1[:,1] + y_prob2[:,1] + y_prob3[:,1] + y_prob4[:,1]
neg = y_prob1[:,0] + y_prob2[:,0] + y_prob3[:,0] + y_prob4[:,0]
print pos
print neg
As you will notice I just simply added the probabilites of each of NBC as final score. I wonder if this correct?
If I have dont it wrong can you please suggest some ideas so I can correct myself.
First of all - why you do this? You should have one Naive Bayes here, not one per feature. It looks like you do not understand the idea of the classifier. What you did is actually what Naive Bayes is doing internally - it treats each feature independently, but as these are probabilities you should multiply them, or add logarithms, so:
You should just have one NB, gnb.fit(iris.data, target)
If you insist on having many NBs, you should merge them through multiplication or addition of logarithms (which is the same from mathematical perspective, but multiplication is less stable in the numerical sense)
pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1]
or
pos = np.exp(np.log(y_prob1[:,1]) + np.log(y_prob2[:,1]) + np.log(y_prob3[:,1]) + np.log(y_prob4[:,1]))
you can also directly predit logarithm through gnb.predict_log_proba instead of gbn.predict_proba.
However, this approach have one error - Naive Bayes will also include prior in each of your prob's, so you will have very skewed distributions. So you have to manually normalize
pos_prior = gnb1.class_prior_[1] # all models have the same prior so we can use the one from gnb1
pos = pos_prior_ * (y_prob1[:,1]/pos_prior_) * (y_prob2[:,1]/pos_prior_) * (y_prob3[:,1]/pos_prior_) * (y_prob4[:,1]/pos_prior_)
which simplifies to
pos = y_prob1[:,1] * y_prob2[:,1] * y_prob3[:,1] * y_prob4[:,1] / pos_prior_**3
and for log to
pos = ... - 3 * np.log(pos_prior_)
So once again - you should use the "1" option.
The answer by lejlot is almost correct. The one thing missing is that you need to normalize his pos result (the product of the probabilities, divided by the prior) by the sum of this pos result for both classes. Otherwise, the sum of the probabilities of all classes will not be equal to one.
Here is a sample code that test the result of this procedure for a dataset with 6 features:
# Use one Naive Bayes for all 6 features:
gaus = GaussianNB(var_smoothing=0)
gaus.fit(X, y)
y_prob1 = gaus.predict_proba(X)
# Use one Naive Bayes on each half of the features and multiply the results:
gaus1 = GaussianNB(var_smoothing=0)
gaus1.fit(X[:, :3], y)
y_log_prob1 = gaus1.predict_log_proba(X[:, :3])
gaus2 = GaussianNB(var_smoothing=0)
gaus2.fit(X[:, 3:], y)
y_log_prob2 = gaus2.predict_log_proba(X[:, 3:])
pos = np.exp(y_log_prob1 + y_log_prob2 - np.log(gaus1.class_prior_))
y_prob2 = pos / pos.sum(axis=1)[:,None]
y_prob1 should be equal to y_prob2 apart from numerical errors (var_smoothing=0 helps reducing the error).
I was curious if there is an as_formula specifier (like in statsmodels) for sklearn.tree.decisiontreeclassifier in Python, or some way to hack one in. Currently, I must use
clf = tree.DecisionTreeClassifier()
clf = clf.fit(X, Y)
but I would prefer to have something like
clf = clf.fit(formula='Y ~ X', data=df)
The reason is that I would like to specify more than one X without having to do a lot of array shaping. Thanks.
Thanks for the information. Although there is no current Patsy interface for sklearn, Patsy easily provides the functionality I need. As an example...
from sklearn import tree
from patsy import dmatrix
red = [1,0,0,0,0,1,1,0,0,1,1,0]
green = [0,0,0,1,0,1,1,0,0,1,1,0]
blue = [0,0,1,1,0,0,0,1,0,0,0,0]
y = [0,0,0,0,0,1,1,0,0,1,1,0]
X = dmatrix('red + green + blue + 0')
dt_clf = tree.DecisionTreeClassifier()
dt_clf = dt_clf.fit(X, y)
pred_r = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_g = [1,1,0,0,1,1,0,0,0,0,0,0]
pred_b = [0,0,1,1,0,0,0,1,0,0,0,0]
test = dmatrix('pred_r + pred_g + pred_b + 0')
dt_clf.predict(test)
Perhaps even more convenient is the fact that sklearn plays well with pandas. Using the same data as above...
import pandas as pd
df = pd.DataFrame()
df['red'] = red
df['green'] = green
df['blue'] = blue
df['y'] = y
dt_clf = dt_clf.fit(df[['red','green','blue']], df['y'])
dt_clf.predict(test)
Hopefully this helps someone in the same situation as me.
note: be very careful that the sequence of Xs remains the same. For example, don't train as df[['red','green','blue']] then predict (df[['blue','green','red']]. May seem obvious, but an easy way to mess things up.
It's currently not possible, but it would be great to have a patsy interface for scikit-learn. I don't think anyone is working on it at the moment, though.