Regression model Pandas

Regression model Pandas - python

I am currently doing an assignment for my data analysis course at uni. I manage to do the first two parts without many problems (EDA and text processing). I now need to do this:
Build a regression model that will predict the rating score of each product based on
attributes which correspond to some very common words used in the reviews
(selection of how many words is left to you as a decision). So, for each product you
will have a long(ish) vector of attributes based on how many times each word appears
in reviews of this product. Your target variable is the rating.
I find myself a bit lost on how to tackle this problem. Here is a link to the dataset I am using. Review2 is the lemmatized version of Review.
Any insight on how to solve this would be greatly appreciated!
P.S: I'm not posting here to get a full solution... Just a push in the right direction
EDIT:
This is the code I wrote for my Ordinal regression (would it be possible to have some feedback):
# Create word matrix
bow = df.Review2.str.split().apply(pd.Series.value_counts)
rating = df['Rating']
df_rating = pd.DataFrame([rating])
df_rating = df_rating.transpose()
bow = bow.join(df_rating)
# Remove some columns and rows
bow = bow.loc[(bow['Rating'].notna()), ~(bow.sum(0) < 80)]
# Divide into train - validation - test
bow.fillna(0, inplace=True)
rating = bow['Rating']
bow = bow.drop('Rating', 1)
x_train, x_test, y_train, y_test = train_test_split(bow, rating, test_size=0.4, random_state=0)
# Run regression
regr = m.OrdinalRidge()
regr.fit(x_train, y_train)
y_pred = regr.predict(x_test)
scores = cross_val_score(regr, bow, rating, cv=5, scoring='accuracy'))
# Plot
pca = PCA(n_components = 1)
pca.fit(x_validate)
x_validate = pca.transform(x_validate)
plt.scatter(x_validate, y_validate, color='black')
plt.plot(x_validate, y_pred, color='blue', linewidth=1)
plt.show()
This is what the plot looks like (Took it from here):
Would it be possible to have some feedback on the code and, possible a better, more informative, way to plot the results (I don't really understand if the regression is performing well or not)

Build a regression model that will predict the rating score of each
product based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision). So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product. Your target variable is the rating.
Let's pull this apart into several pieces!
So, for each product you will have a long(ish) vector of
attributes based on how many times each word appears in reviews of
this product.
This is a bag-of-words model, meaning you will have to create a matrix representation (still held in a pd.DataFrame) of your words column or your review 2 column and there is a question asking how to do that here:
How to create a bag of words from a pandas dataframe
Below is a minimal example of how you can create that matrix with your Review2 column:
In [12]: import pandas as pd
In [13]: df = pd.DataFrame({"Review2":['banana apple mango', 'apple apple strawberry']})
In [14]: df
Out[14]:
Review2
0 banana apple mango
1 apple apple strawberry
In [15]: df.Review2.str.split()
Out[15]:
0 [banana, apple, mango]
1 [apple, apple, strawberry]
Name: Review2, dtype: object
In [16]: df = df.Review2.str.split().apply(pd.Series.value_counts) # this will produce the word count matrix
In [17]: df
Out[17]:
apple banana mango strawberry
0 1.0 1.0 1.0 NaN
1 2.0 NaN NaN 1.0
The bag-of-words model just counts how often a word occurs in a text of interest, with no regard for positions and represents a set of texts this way as a matrix where the texts are each represented by a row and the columns show the counts for all the words.
[...] based on attributes which correspond to some very common words
used in the reviews (selection of how many words is left to you as a
decision).
Now that you have your matrix representation (rows are the products, columns are the counts for each unique word), you can filter the matrix down to the most common words. I would encourage you to take a look at how the distribution of word counts looks. We will use seaborn for that and import it like so:
import seaborn as sns
Given that your pd.DataFrame holding the word-count matrix is called df, sns.distplot(df.sum()) should do the trick. Choose some cutoff that seems like it preserves a good chunk of the counts but doesn't include many words with low counts. It can be arbitrary and it doesn't really matter for now.
Your word count matrix is your input data, or also called the predictor variable. In machine learning this is often called the input matrix or vector X.
Your target variable is the rating.
The output variable or target variable is the rating column. In machine learning this is often called the output vector y (note that this can sometimes also be an output matrix but most commonly one outputs a vector).
This means our model tries adjust its parameters to map the word count data from each row to the corresponding rating value.
Scikit-learn offers a lot of machine learning models such as logistic regression which take an X and y for training and have a very unified interface. Jake Vanderplas's Python Data Science Handbook explains the Scikit-learn interface nicely and shows you how to use it for a regression problem.
EDIT: We are using logistic regression here but as correctly pointed out by fuglede, logistic regression ignores that ratings are on a set scale. For this you can use mord.OrdinalRidge, the API of which works very similarly to that of scikit-learn.
Before you train your model, you should split your data set into a training, a test and a validation set - a good ratio is probably 60:20:20.
This way you will be able to train your model on your training set and evaluate how well it is predicting your test data set, to help you tune your model parameters. This way you will know when your model is overfitting to your training data and when it is genuinely producing a good general model for this task. The problem with this approach is that you can still overfit onto your training data if you adjust model parameters often enough.
This is why we have a validation set - it is to make sure we are not accidentally also overfitting our model's parameters to both our training and test set without knowing it. We only test on the validation set once typically, so as to avoid overfitting to it, also - it is only used in the final model evaluation step.
Scikit-learn has a function for that, too: train_test_split
train_test_split however only makes one split, so you would first split your data set 60:40 and then the 40 you would split 50:50 into test and validation set.
You can now train different models on your training data and test them using the predict function of your model on your test set. Once you think you have done a good job and your model is good enough, you now test it on your validation set.

Related

How do I properly compare performance of machine learning models, in the case of a multi-output classification problem?

TL,DR: I'm looking for a good way to compare the output of different scikit learn ML models on a multi-output classification problem: labelling social media messages according to the different disaster response categories they might fall into. I'm currently just using precision_recall_fscore_support on each label and then averaging the results, but I'm not convinced that this is a good solution.
In detail: As part of an exercise I'm doing for an online data science course, I'm looking at a dataset of social media messages that occurred during natural disasters. The goal of the exercise is to train a machine learning model to classify these messages according to the various emergency departments they relate to, such as: aid_related, medical_help, weather_related, floods, etc...
So for example the following message: "UN reports Leogane 80-90 destroyed. Only Hospi..." is classed in my training data as 'medical_products', 'aid_related' and 'request'.
I've started off using scikit-learn's KNeighborsClassifier, and MultiOutputClassifier. I'm also using gridsearch to compare parameters inside the model:
pipeline = pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(KNeighborsClassifier()))
])
parameters = { 'clf__estimator__n_neighbors': [5, 7]}
cv = GridSearchCV(pipeline, parameters)
cv.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
When I finally (it takes forever just with two parameteres to compare) get the model output, I've written the following function to pull out a matrix with the average precision, recall and fscore for each column:
def classify_model_output(y_test, y_pred):
classification_scores = []
for i, column in enumerate(y_test.columns): classification_scores.append(precision_recall_fscore_support(y_test[column], y_pred[:, i]))
df_classification = pd.DataFrame(classification_scores)
df_classification.columns = ['precision', 'recall', 'fscore', 'support']
df_classification.set_index(y_test.columns, inplace=True)
# below loop splits the precision, recall and f-score columns into two, one for negatives and one for positives (0 and 1)
for column in df_classification.columns:
column_1 = df_classification[column].apply(lambda x: x[0]).rename(column+str(0), inplace=True)
column_2 = df_classification[column].apply(lambda x: x[1]).rename(column+str(1), inplace=True)
df_classification.drop([column], axis=1, inplace=True)
df_classification = pd.concat([df_classification, column_1, column_2], axis=1)
# finally, take the average of the dataframe to get a classifier for the model
df_classification_avg = df_classification.mean(axis=0)
return df_classification_avg
The df_classification table which looks like this (top 5 rows):
And here's what I get when I compare the average classification tables (produced by the previous method) for knn with 5 neighbors (avg_knn), knn with 7 neighbors (knn_avg_2), and random forest (rf) - yellow cells represent the max for that row:
But I'm not sure how to interpret this. One the face of it it looks like Random Forest (rf) performed best. But I'm not sure if this is the best way to achieve this, or if using the average even makes sense here.
Does anyone have any advice on the best way to accurately and transparently compare my models, in the case of a multioutput problem like this one?
Edit: updated code block with easier to read function, and added comparison of three models

If your training data is not biased towards any particular output label, then you can go for your accuracy score. i.e. corresponding all labels, there is balanced amount of training data.
However if your data is imbalanced i.e. training data is more towards one or two particular output label then go for precision and recall.
Now between precision and recall , the choice depends on your need . If you are not much considered about accuracy go for recall e.g. on airport there is minimum chance that any bomb would be recovered from luggage, but you check all bags. That is recall.
When you are more considered about how much correct predictions are done from a sample , go for precision.

should my data for a linear regression have a roughly equal number of X values for each different Y value?

I'm trying to set up my data for a linear regression, and I want to make sure my results will be meaningful in the end.
I work for a small insurance company, and I'm trying to do a linear regression analysis on the effect of rainfall amounts on number of auto claims (a "claim" is an insurance term for an accident). Lets say I'm focusing specifically on Houston, Texas.
Imagine that the x axis is rainfall, lets say the buckets will be 0cm, 1-5cm, 6-10cm, 11-15cm and so on.
The **y axis ** will be number of claims. 100,125, 140, 150 etc.
I have easy access to this information from my company's database, but my question is...
Do I need to have a roughly equal number of X values for each Y "bucket"
You can imagine that it would be very very easy to find days with 0 rain fall. So this bucket could potentially have much much more observations than others. To put it another way, days with >30cm are gonna be rare, so I may not have very many observations for that rainfall amount.
Should I try to get a roughly equal number of observations for each Y value, or should I just get as much as possible? Thank you for the help!
I'll be working in python, but this is obviously more of a theory question.

In order to get a good ML model, the chosen sample to train the model should be a near representation of the population. Else the model would be biased. It would provide accurate prediction of the class for which there is sufficient instance in the training set but not for those categories where the number of sample is very limited. I assume that you would be splitting the data into training and test set.
I would recommend to use the Stratified sampling instead of Randomized sampling. The below library would be helpful. Prior to using make sure to check the different bins of rainfall.
from sklearn.model_selection StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(df, df[<<column name>>]):
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]

Random Forest with categorical Data is predicting only Data within one Category

Im using scikit learn's RandomForrestRegressor and I cannot make it work or at least it seems like it.
The Data that I am using has categorical data which I encodedd with the LabelBinarizer so my data looks like this:
Id Cat1 Cat2 Cat3 .... Cat50
123 0 1 0 0
...
Each row can only have one of the given categories.
Now I train my model with the given ratings for each item, which is numerical with scikit-learns RandomForrestRegressor.
My y is a rating.
My X are the features of the item containing the categories.
So my y and X looks something like this:
y = [0,1,1,4,,3,7,8,1,9]
X = [[0, 1, 0, ..., 0],
[0, 0, 1, 0...,0]
...]
I want to predict the rating y on new items based on the item data arrays in X. For this I use the RandomForrestRegressor like this:
regressor = RandomForestRegressor(n_estimators=60000, random_state=0, max_depth=100)
regressor.fit(X_train, y_train)
theta[user_id] = regressor.feature_importances_
I chose the max-depth=106 as there are 100 item features and n_estimator=60000 as I have around 30000 items. But I am not quite sure if the n_estimator is chosen wisely and even if I choose the n_estimator very low, the results remain the same.
I multiply each item feature with the theta entry for the user where I store the feature importance.
The result for the best fitting items for the user looks as follows:
Id Name Category
12 example Cat1
34 example Cat1
56 example Cat1
..
So every prediction has the same category altough there are for example 50 different categories and the training data does contain a lot more than cat1 samples. In fact Cat1 is a small part of the sample.
My question is how do I determine where my error is? Should I consider this to be an error as this result cannot be reasonable in my case. Which next step should I take to determine where the error lies?

What are your input features?
You should check if it is an imbalanced dataset:
df['Cat1'].sum df['Cat2'].sum
Probably it will be:
Cat1 Cat2 ... Cat50
10000 4 3
This would mean that your data is imbalanced. You then really need to check what techniques you can use, some names are: under- and oversammpling, or isolation forest.
Are you sure you want to use regression? And not classification? Check this package out: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
And this should give you an idea what problem you want to solve: https://en.wikipedia.org/wiki/Supervised_learning

If unbalanced data is your problem I suggest checking out imblearn library, it has some functions that allow balancing the data, for example Synthetic Minority Over-sampling Technique (SMOTE) allows 'upsampling' the data by creating extra instances of minority classes for training and its super simple to use.
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_train, y_train = sm.fit_resample(X_train, y_train)
Then you can validate your model on the original data.

Why do I get different results when I do a manual split of test and train data as opposed to using the Python splitting function

If I run a simple dtree regression model using data via the train_test_split function, I get nice r2 scores, and low mse values.
training_data = pandas.read_csv('data.csv',usecols=['y','x1','x2','x3'])
y = training_data.iloc[:,0]
x = training_data.iloc[:,1:]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
regressor = DecisionTreeRegressor(random_state = 0)
# fit the regressor with X and Y data
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
yet if I split the data file manually into two files 2/3 train and 1/3 test. there is a column called human which gives a value 1 to 9 which human it is, i use human 1-6 for training, and 7-9 for test
i get negative r2 scores, and high mse
training_data = pandas.read_csv("train"+".csv",usecols=['y','x1','x2','x3'])
testing_data = pandas.read_csv("test"+".csv", usecols=['y','x1','x2','x3'])
y_train = training_data.iloc[:,training_data.columns.str.contains('y')]
X_train = training_data.iloc[:,training_data.columns.str.contains('|'.join(['x1','x2','x3']))]
y_test = testing_data.iloc[:,testing_data.columns.str.contains('y')]
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
y_train = pandas.Series(y_train['y'], index=y_train.index)
y_test = pandas.Series(y_test['y'], index=y_test.index)
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
I was expecting more or less the same results, and all the data types seem the same for both calls.
What am I missing?

I'm assuming that both methods here actually do what you intend doing and the shapes of your X_train/test and y_train/tests are the same coming from both methods. You can either plot the underlying distributions of your datasets or compare your second implementation against a cross-validated model (for better rigour).
Plot the distributions (i.e. make bar charts/density plots) of the labels (y) in the initial train - test sets vs in the second ones (from the manual implementation). You can dive deeper and also plot your other columns in the data, see if anything about the distributions of your data is different between the resulting sets of the two implementations. If the distributions are different than it makes sense you get discrepancies between your two models. If your discrepancy is huge, it could be your labels (or other columns) are actually sorted for your manual implementation, so then you get very different distributions in the datasets you're comparing.
Also, if you want to make sure that your manual splitting results is a 'representative' set(that would generalise well) based on model results instead of underlying data distributions, I would compare it against the results of a cross-validated model, not one single set of results.
Essentially, although the probability is small and the train_test_split does some shuffling, you could essentially get a 'train/test' pair that is performing well just out of luck. (To reduce the chance of that without doing cross validation, I'd suggest making use of the stratify argument of the train_test_split function. then at least you're sure the first implementation 'tries harder' to get balanced train/test pairs.)
If you decide to cross validate (with test_train_split), you get an average model prediction for the folds and a confidence intervals around it and can check if your second model results fall within that interval. If it doesn't again, it just means your split is actually 'corrupted' somehow (e.g. by having sorted values).
P.S. I'd also add that Decision Trees are models that are known to overfit massively [1]. Maybe use a random forest instead? (you should get more stable results due to bootstraping/bagging which would act similarly to cross-validating to reduce the chance of overfitting.)
1 - http://cv.znu.ac.ir/afsharchim/AI/lectures/Decision%20Trees%203.pdf

The train_test_split function from scikit-learn uses sklearn.model_selection.ShuffleSplit as per the documentation and this means, this method randomize your data when splitting.
When you split manually, you didn't randomize it so if your labels is not spreaded evenly throughout your dataset, you'll of course have performance issue since your model won't be generalized enough due to train data not containing enough sample of other labels.
If my suspicion is correct, you should get similar result by passing shuffle=False into train_test_split.

suppose your dataset contains this data.
1 + 1 = 2
2 + 2 = 4
4 - 4 = 0
2 - 2 = 0
So suppose you want a 50% train split. train_test_split shuffles it like this so it genaralizes better
1+1=2
2-2= 0
So it knows what do to when it sees this data
2+2
4-4#since it learned both addition and subtraction
But when you manually shuffle it like this
1 + 1 = 2
2 + 2 =4#only learned addition
It doesn't know what do do when it sees this data
2 - 2
4 - 4#test data is subtraction
Hope this answers you question

It may sound like a simple check but..
In the first example you are reading data from 'data.csv', in the second example you are reading from 'train.csv' and 'test.csv'. Since you say you split the file manually, I have a question about how that was done. If you simply cut the file at the 2/3's mark and saved as 'train.csv' and the remaining as 'test.csv' then you have unknowingly made an assumption about the uniformity of the data in the file. Data files can have an ordered structure which would skew the training or testing, which is why the train_test_split randomizes the rows. If you haven't already done it, try to randomize the rows first and then write to your train and test csv file to ensure you have a homogeneous dataset.
The other line that might be out of place is line 6:
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(l_vars))]
Perhaps the l_vars contains something other than what you expect. Maybe it should read the following to be more consistent.
X_test = testing_data.iloc[:,testing_data.columns.str.contains('|'.join(['x1','x2','x3']))]
good luck, let us know if this helps.

Semi supervised learning with sklearn

I have a large multi dimensional unlabelled dataset of cars (price, mileage, horsepower, ...) for which I want to find outliers. I decided to use the sklearn OneClassSVM to build a decision boundary and have two main issues with my approach:
My dataset contains a lot of missing values. Is there a way to make the svm classify the data with missing features as an inlier if any possible values for the missing features would be an inlier?
I now want to add a feedback loop of manual moderated outliers. The manually moderated data should improve the classification of the SVM. I've read about the LabelSpreading model for semi-supervised learning. Would it be feasible to feed the classification output of the OneClassSVM to the LabelSpreading model and retrain this model when a sufficient amount of records are manually validated?

For the first question. You could use an sklearn.preprocessing.imputer to impute the missing values by mean or median:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
You could add some boolean Features that recode if some of the other features had NaNs. So if you have features X_1, X_2 you add the boolean features
X_1_was_NaN and X_2_was_NaN
that are 1 if X_1==NaN or X_2==NaN. If X is your original pd.DataFrame you can create this by
X = pd.DataFrame()
# Create your features here
# Get the locations of the NaNs
X_2 = 1.0 * X.isnull()
# Rename columns
X_2.rename(columns=lambda x: str(x)+"_has_NaN", inplace=True)
# Paste them together
X = pd.concat([X, X_2], axis=1)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.