Trying to find how similar one column is to another in dataframe - python

I am trying to calculate accuracy rate.
I have a pandas dataframe with numerous columns of data.
I have one column of predicted churns and one column of true churns for every customer.
Is there a way to calculate the accuracy metric and other metrics just between the two columns? Both columns are only binary of 0 as no churn and 1 as churn.

There is obviously many ways you can measure accuracy of a prediction against known answers. Since you tagged this with machine learning and python, I suggest using a confusion matrix (aka error matrix) as a first pass. The scikit-learn python library has a module that you can use:
from sklearn.metrics import confusion_matrix
y_true = ...
y_pred = ...
confusion_matrix( y_true, y_pred )
source: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

Related

How do I properly compare performance of machine learning models, in the case of a multi-output classification problem?

TL,DR: I'm looking for a good way to compare the output of different scikit learn ML models on a multi-output classification problem: labelling social media messages according to the different disaster response categories they might fall into. I'm currently just using precision_recall_fscore_support on each label and then averaging the results, but I'm not convinced that this is a good solution.
In detail: As part of an exercise I'm doing for an online data science course, I'm looking at a dataset of social media messages that occurred during natural disasters. The goal of the exercise is to train a machine learning model to classify these messages according to the various emergency departments they relate to, such as: aid_related, medical_help, weather_related, floods, etc...
So for example the following message: "UN reports Leogane 80-90 destroyed. Only Hospi..." is classed in my training data as 'medical_products', 'aid_related' and 'request'.
I've started off using scikit-learn's KNeighborsClassifier, and MultiOutputClassifier. I'm also using gridsearch to compare parameters inside the model:
pipeline = pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(KNeighborsClassifier()))
])
parameters = { 'clf__estimator__n_neighbors': [5, 7]}
cv = GridSearchCV(pipeline, parameters)
cv.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
When I finally (it takes forever just with two parameteres to compare) get the model output, I've written the following function to pull out a matrix with the average precision, recall and fscore for each column:
def classify_model_output(y_test, y_pred):
classification_scores = []
for i, column in enumerate(y_test.columns): classification_scores.append(precision_recall_fscore_support(y_test[column], y_pred[:, i]))
df_classification = pd.DataFrame(classification_scores)
df_classification.columns = ['precision', 'recall', 'fscore', 'support']
df_classification.set_index(y_test.columns, inplace=True)
# below loop splits the precision, recall and f-score columns into two, one for negatives and one for positives (0 and 1)
for column in df_classification.columns:
column_1 = df_classification[column].apply(lambda x: x[0]).rename(column+str(0), inplace=True)
column_2 = df_classification[column].apply(lambda x: x[1]).rename(column+str(1), inplace=True)
df_classification.drop([column], axis=1, inplace=True)
df_classification = pd.concat([df_classification, column_1, column_2], axis=1)
# finally, take the average of the dataframe to get a classifier for the model
df_classification_avg = df_classification.mean(axis=0)
return df_classification_avg
The df_classification table which looks like this (top 5 rows):
And here's what I get when I compare the average classification tables (produced by the previous method) for knn with 5 neighbors (avg_knn), knn with 7 neighbors (knn_avg_2), and random forest (rf) - yellow cells represent the max for that row:
But I'm not sure how to interpret this. One the face of it it looks like Random Forest (rf) performed best. But I'm not sure if this is the best way to achieve this, or if using the average even makes sense here.
Does anyone have any advice on the best way to accurately and transparently compare my models, in the case of a multioutput problem like this one?
Edit: updated code block with easier to read function, and added comparison of three models
If your training data is not biased towards any particular output label, then you can go for your accuracy score. i.e. corresponding all labels, there is balanced amount of training data.
However if your data is imbalanced i.e. training data is more towards one or two particular output label then go for precision and recall.
Now between precision and recall , the choice depends on your need . If you are not much considered about accuracy go for recall e.g. on airport there is minimum chance that any bomb would be recovered from luggage, but you check all bags. That is recall.
When you are more considered about how much correct predictions are done from a sample , go for precision.

Different F1 scores for different preprocessing techniques- sklearn

I am building a classification model using sklearn's GradientBoostingClassifier. For the same model, I tried different preprocessing techniques: StandarScaler, Scale, and Normalizer on the same data but I am getting different f1_scores each time. For StandardScaler, it is highest and lowest for Normalizer. Why is it so? Is there any other technique for which I can get an even higher score?
The difference lies in their respective definitions:
StandardScaler: Standardize features by removing the mean and scaling to unit variance
Normalizer: Normalize samples individually to unit norm.
Scale: Standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
The data used to fit your model will change, so will the F1 score.
Here is a useful link comparing different scalers : https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

Getting probability values for random forest and Gradient Boosting in python

I have been learning about classification techniques and studied about random forest, gradient boosting etc.Based on some help from codes available online,i tried to write code in python3 for random forest and GBM. My objective is to get the probability values from the model and not just look at accuracy as i intend to use the probability values to create KS later on.
I used the readily available titanic data set to start practicing.
Following are some of the steps i did :
/**load train data**/
train_df=pd.read_csv('***/classification/titanic/train.csv')
/**load test data**/
test_df =pd.read_csv('***/Desktop/classification/titanic/test.csv')
/**drop some variables in train data**/
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
/**drop some variables in test data**/
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
/** i calculated the title variable (again based on multiple threads in kaggle**/
train_df=pd.get_dummies(train_df,columns=['Pclass','Sex','Title'],drop_first=True)
test_df=pd.get_dummies(test_df,columns=['Pclass','Sex','Title'],drop_first=True)
/**i checked for missing and IV values next (not including that code here***/
predictors=[x for x in train.columns if x not in ['Survived','PassengerID']]
predictors
# create classifier object (GBM)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
# create classifier object (RF)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=10)
# fit the classifier with x and y data
clf.fit(train[predictors],train.Survived)
prob=pd.DataFrame({'prob':clf.predict_proba(train[predictors])[:,1]})
prob['prob'].value_counts()
Now when i check the probability values from the two different models, i noticed that for the Random forest output, a significant chunk had a 0 probability score whereas that was not the case for the GBM model.
I understand that the techniques are different, but how can the results be so far off ? Am i missing out on something ?
With a large chunk of the population getting tagged with '0' as probability score, my KS table goes for a toss.
Welcome to SO! Since you don't seem to be having an issue with code execution in specific, or totally incorrect outputs, this looks like it is more appropriate for CrossValidated, where you can find answers to questions of statistical concerns.
In fact, I'd suggest that answers to this question might give you some good insights into why you are seeing very different values from the predict_proba method. In short: while both GradientBoostingClassifier and RandomForestClassifier both use tree methods, what they do is very different, so direct comparison of the model parameters is not necessarily appropriate.

What factors will lead to extremely high RMSE value in a regression model?

I was trying to build a regression model to predict movie box office. The dataset was acquired from Kaggle-TMDB 5000 Movie Dataset along with another dataset contains some social media related attributes.
After merging and cleaning, the final dataset consists of 183 observations and 53 features. Two categorical features, genre, and production_countries were expanded. For example, new columns like 'Action', 'Drama', 'Comedy'...etc. And the value is 1 if that movie belongs to that genre. Same as production_countries.
I used the dataset to build my regression model, but I encountered a problem. No matter I use train test split or cross-validation method or trying different regression model, the RMSE I got were insanely high.
from sklearn.model_selection import cross_val_predict
X=movie.drop('Gross',axis=1)
y=movie['Gross']
print 'R2:',r2_score(y,cross_val_predict(RandomForestRegressor(),X,y,cv=10))
print 'RMSE:',np.sqrt(metrics.mean_squared_error(y,cross_val_predict(method,X,y,cv=10)))
output:
R2: 0.344831741145
RMSE: 76169019.1588
I don't know what factors are resulting in this kind of situation. Can anyone help me out here? Many thanks.
It seems that you are calculating properly the RMSE, I would check the R2 measure and try to do it manually just to be sure.
But I would try to review the concept more in detail, if we check the following reference: http://brenocon.com/rsquared_is_mse_rescaled.pdf
We can observe that
r2 = 1 - MSE(x,y) / VAR(y)
If MSE -> 0, consequently RMSE -> 0, and r2 -> 1.
But if RMSE -> +inf, MSE -> +inf and if MSE <= VAR(y) this will lead to r2 values lower Than 1 e.g. 0.2, 0.3, etc. So your results might place your analysis on the right track.

Semi supervised learning with sklearn

I have a large multi dimensional unlabelled dataset of cars (price, mileage, horsepower, ...) for which I want to find outliers. I decided to use the sklearn OneClassSVM to build a decision boundary and have two main issues with my approach:
My dataset contains a lot of missing values. Is there a way to make the svm classify the data with missing features as an inlier if any possible values for the missing features would be an inlier?
I now want to add a feedback loop of manual moderated outliers. The manually moderated data should improve the classification of the SVM. I've read about the LabelSpreading model for semi-supervised learning. Would it be feasible to feed the classification output of the OneClassSVM to the LabelSpreading model and retrain this model when a sufficient amount of records are manually validated?
For the first question. You could use an sklearn.preprocessing.imputer to impute the missing values by mean or median:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
You could add some boolean Features that recode if some of the other features had NaNs. So if you have features X_1, X_2 you add the boolean features
X_1_was_NaN and X_2_was_NaN
that are 1 if X_1==NaN or X_2==NaN. If X is your original pd.DataFrame you can create this by
X = pd.DataFrame()
# Create your features here
# Get the locations of the NaNs
X_2 = 1.0 * X.isnull()
# Rename columns
X_2.rename(columns=lambda x: str(x)+"_has_NaN", inplace=True)
# Paste them together
X = pd.concat([X, X_2], axis=1)

Categories

Resources