How to deal with dataset that contains both discrete and continuous data - python

I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data

This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.

Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).

your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.

Related

Why is this accuracy of this Random forest sentiment classification so low?

I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder
data = pd.read_csv('data.csv')
x = data['Reviews']
y = data['Ratings']
le = LabelEncoder()
x_encoded = le.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
Then I printed out the accuracy like below:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
And here's the output:
Accuracy: 0.5975
I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.
Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?
It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.
What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.
As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.
A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.

Looking for ways and experiences to improve Keras model accuracy

I am an electrical engineer and I am looking for a solution to calculate the DC current of a permanent synchronous motor. So I decided to check the ANN solutions with Keras and so on.Long story short, I'll show you a screenshot of some measured signals.
The first 5 signals are the measured signals. The last one is the DC current, which I will estimate. Here the value was recorded with the help of a current clamp. Okay, I started building a model in Python and tried some things that I assume will increase the accuracy of the model. But after all that, I am not getting that good results from the model and my hope is that maybe I am choosing wrong parameters or not an ideal model for this purpose.
Here is my code:
import numpy as np
from keras.layers import Dense, LSTM
from keras.models import Sequential
from keras.callbacks import EarlyStopping
import pandas as pd
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from matplotlib import pyplot as plt
import seaborn as sns
# Import input (x) and output (y) data, and asign these to df1 and df1
df = pd.read_csv('train_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
plt.figure()
sns.heatmap(df.corr(),annot=True)
plt.show()
# Split the data into input (x) training and testing data, and ouput (y) training and testing data,
# with training data being 80% of the data, and testing data being the remaining 20% of the data
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)#, shuffle=True)
# Scale both training and testing input data
X_train = preprocessing.maxabs_scale(X_train)
X_test = preprocessing.maxabs_scale(X_test)
model = Sequential()
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(4, input_shape=(4,)))
model.add(Dense(1, input_shape=(4,)))
model.compile(optimizer="adam", loss="msle", metrics=['mean_squared_logarithmic_error','accuracy'])
# Pass several parameters to 'EarlyStopping' function and assign it to 'earlystopper'
earlystopper = EarlyStopping(monitor='val_loss', min_delta=0, patience=15, verbose=1, mode='auto')
model.summary()
history = model.fit(X_train, y_train, epochs = 2000, validation_split = 0.3, verbose = 2, callbacks = [earlystopper])
# Runs model (the one with the activation function, although this doesn't really matter as they perform the same)
# with its current weights on the training and testing data
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
# Calculates and prints r2 score of training and testing data
print("The R2 score on the Train set is:\t{:0.3f}".format(r2_score(y_train, y_train_pred)))
print("The R2 score on the Test set is:\t{:0.3f}".format(r2_score(y_test, y_test_pred)))
df = pd.read_csv('test_two_data.csv')
df = df[['rpm','iq','uq','udc','idc']]
X = df[df.columns[:-1]]
Y = df.idc
X_validate = preprocessing.maxabs_scale(X)
y_pred = model.predict(X_validate)
plt.plot(Y)
plt.plot(y_pred)
plt.show()
(weight_0,bias_0) = model.layers[0].get_weights()
(weight_1,bias_1) = model.layers[1].get_weights()
One limitation is that I can't use LSTM layers or other complex algorithms because I need to implement the trained model in a microcontroller on a motor application later.
I guess you could find some words for me to make my model a little better in accuracy.
At the end here is a figure where I show you the worse prediction performance. Orange is the prediction and blue is the measured current.
The training dataset was this one.
The correlation between the individual values can be found here. Since the values of id and ud have no correlation to idc, I decided to delete them.
The most important thing to keep in mind when trying to improve the accuracy of the model is ALWAYS Normalise the input data which basically means rescaling real-valued numeric attributes into the range 0 and 1. I am not able to understand the way you are providing the training data to the model. Could you please explain that. It would be better in understanding and identifying the scope of higher accuracy.
Now if we talk about parameters, I would suggest you the addition of a Tuning Algorithm for the parameters to get the optimized value of each parameter.
It is always a good parctice to include hidden layers which could provide better feature extract.

Limitations of Regression in Machine Learning?

I've been learning some of the core concepts of ML lately and writing code using the Sklearn library. After some basic practice, I tried my hand at the AirBnb NYC dataset from kaggle (which has around 40000 samples) - https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data#New_York_City_.png
I tried to make a model that could predict the price of a room/apt given the various features of the dataset. I realised that this was a regression problem and using this sklearn cheat-sheet, I started trying the various regression models.
I used the sklearn.linear_model.Ridge as my baseline and after doing some basic data cleaning, I got an abysmal R^2 score of 0.12 on my test set. Then I thought, maybe the linear model is too simplistic so I tried the 'kernel trick' method adapted for regression (sklearn.kernel_ridge.Kernel_Ridge) but they would take too much time to fit (>1hr)! To counter that, I used the sklearn.kernel_approximation.Nystroem function to approximate the kernel map, applied the transformation to the features prior to training and then used a simple linear regression model. However, even that took a lot of time to transform and fit if I increased the n_components parameter which I had to to get any meaningful increase in the accuracy.
So I am thinking now, what happens when you want to do regression on a huge dataset? The kernel trick is extremely computationally expensive while the linear regression models are too simplistic as real data is seldom linear. So are neural nets the only answer or is there some clever solution that I am missing?
P.S. I am just starting on Overflow so please let me know what I can do to make my question better!
This is a great question but as it often happens there is no simple answer to complex problems. Regression is not a simple as it is often presented. It involves a number of assumptions and is not limited to linear least squares models. It takes couple university courses to fully understand it. Below I'll write a quick (and far from complete) memo about regressions:
Nothing will replace proper analysis. This might involve expert interviews to understand limits of your dataset.
Your model (any model, not limited to regressions) is only as good as your features. If home price depends on local tax rate or school rating, even a perfect model would not perform well without these features.
Some features cannot be included in the model by design, so never expect a perfect score in real world. For example, it is practically impossible to account for access to grocery stores, eateries, clubs etc. Many of these features are also moving targets, as they tend to change over time. Even 0.12 R2 might be great if human experts perform worse.
Models have their assumptions. Linear regression expects that dependent variable (price) is linearly related to independent ones (e.g. property size). By exploring residuals you can observe some non-linearities and cover them with non-linear features. However, some patterns are hard to spot, while still addressable by other models, like non-parametric regressions and neural networks.
So, why people still use (linear) regression?
it is the simplest and fastest model. There are a lot of implications for real-time systems and statistical analysis, so it does matter
often it is used as a baseline model. Before trying a fancy neural network architecture, it would be helpful to know how much we improve comparing to a naive method.
sometimes regressions are used to test certain assumptions, e.g. linearity of effects and relations between variables
To summarize, regression is definitely not the ultimate tool in most cases, but this is usually the cheapest solution to try first
UPD, to illustrate the point about non-linearity.
After building a regression you calculate residuals, i.e. regression error predicted_value - true_value. Then, for each feature you make a scatter plot, where horizontal axis is feature value and vertical axis is the error value. Ideally, residuals have normal distribution and do not depend on the feature value. Basically, errors are more often small than large, and similar across the plot.
This is how it should look:
This is still normal - it only reflects the difference in density of your samples, but errors have the same distribution:
This is an example of nonlinearity (a periodic pattern, add sin(x+b) as a feature):
Another example of non-linearity (adding squared feature should help):
The above two examples can be described as different residuals mean depending on feature value. Other problems include but not limited to:
different variance depending on feature value
non-normal distribution of residuals (error is either +1 or -1, clusters, etc)
Some of the pictures above are taken from here:
http://www.contrib.andrew.cmu.edu/~achoulde/94842/homework/regression_diagnostics.html
This is an great read on regression diagnostics for beginners.
I'll take a stab at this one. Look at my notes/comments embedded in the code. Keep in mind, this is just a few ideas that I tested. There are all kinds of other things you can try (get more data, test different models, etc.)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#%matplotlib inline
import sklearn
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso
from sklearn.datasets import load_boston
#boston = load_boston()
# Predicting Continuous Target Variables with Regression Analysis
df = pd.read_csv('C:\\your_path_here\\AB_NYC_2019.csv')
df
# get only 2 fields and convert non-numerics to numerics
df_new = df[['neighbourhood']]
df_new = pd.get_dummies(df_new)
# print(df_new.columns.values)
# df_new.shape
# df.shape
# let's use a feature selection technique so we can see which features (independent variables) have the highest statistical influence on the target (dependent variable).
from sklearn.ensemble import RandomForestClassifier
features = df_new.columns.values
clf = RandomForestClassifier()
clf.fit(df_new[features], df['price'])
# from the calculated importances, order them from most to least important
# and make a barplot so we can visualize what is/isn't important
importances = clf.feature_importances_
sorted_idx = np.argsort(importances)
# what kind of object is this
# type(sorted_idx)
padding = np.arange(len(features)) + 0.5
plt.barh(padding, importances[sorted_idx], align='center')
plt.yticks(padding, features[sorted_idx])
plt.xlabel("Relative Importance")
plt.title("Variable Importance")
plt.show()
X = df_new[features]
y = df['price']
reg = LassoCV()
reg.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % reg.alpha_)
print("Best score using built-in LassoCV: %f" %reg.score(X,y))
coef = pd.Series(reg.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " + str(sum(coef == 0)) + " variables")
Result:
Best alpha using built-in LassoCV: 0.040582
Best score using built-in LassoCV: 0.103947
Lasso picked 78 variables and eliminated the other 146 variables
Next step...
imp_coef = coef.sort_values()
import matplotlib
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
# get the top 25; plotting fewer features so we can actually read the chart
type(imp_coef)
imp_coef = imp_coef.tail(25)
matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)
imp_coef.plot(kind = "barh")
plt.title("Feature importance using Lasso Model")
X = df_new
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
Result:
MAE 1004799260.0756996
MSE 9.87308783180938e+21
RMSE 99363412943.64531
R squared error -2.603867717517002e+17
This is horrible! Well, we know this doesn't work. Let's try something else. We still need to rowk with numeric data so let's try lng and lat coordinates.
X = df[['longitude','latitude']]
y = df['price']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10)
# Training the Model
# We will now train our model using the LinearRegression function from the sklearn library.
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_train, y_train)
# Prediction
# We will now make prediction on the test data using the LinearRegression function and plot a scatterplot between the test data and the predicted value.
prediction = lm.predict(X_test)
plt.scatter(y_test, prediction)
df1 = pd.DataFrame({'Actual': y_test, 'Predicted':prediction})
df2 = df1.head(10)
df2
df2.plot(kind = 'bar')
from sklearn import metrics
from sklearn.metrics import r2_score
print('MAE', metrics.mean_absolute_error(y_test, prediction))
print('MSE', metrics.mean_squared_error(y_test, prediction))
print('RMSE', np.sqrt(metrics.mean_squared_error(y_test, prediction)))
print('R squared error', r2_score(y_test, prediction))
# better but not awesome
Result:
MAE 85.35438165291622
MSE 36552.6244271195
RMSE 191.18740655994972
R squared error 0.03598346983552425
Let's look at OLS:
import statsmodels.api as sm
model = sm.OLS(y, X).fit()
# run the model and interpret the predictions
predictions = model.predict(X)
# Print out the statistics
model.summary()
I would hypothesize the following:
One hot encoding is doing exactly what it is supposed to do, but it is not helping you get the results you want. Using lng/lat, is performing slightly better, but this too, is not helping you achieve the results you want. As you know, you must work with numeric data for a regression problem, but none of the features is helping you to predict price, at least not very well. Of course, I could have made a mistake somewhere. If I did make a mistake, please let me know!
Check out the links below for a good example of using various features to predict housing prices. Notice: all variables are numeric, and the results are pretty decent (just around 70%, give or take, but still much better than what we're seeing with the Air BNB data set).
https://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/
https://towardsdatascience.com/linear-regression-on-boston-housing-dataset-f409b7e4a155

How to weigh data points with sklearn training algorithms

I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?
You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
Here I define some arbitrary weights just for the sake of the example:
weights = np.random.choice([1,2],len(y_train))
And then you can fit your model with these models:
rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)
You can then evaluate your model on your test data.
Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because
you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure

Random forest with unbalanced class (positive is minority class), low precision and weird score distributions

I have a very unbalanced dataset (5000 positive, 300000 negative). I am using sklearn RandomForestClassifier to try and predict the probability of the positive class. I have data for multiple years and one of the features I've engineered is the class in the previous year, so I am withholding the last year of the dataset to test on in addition to my test set from within the years I'm training on.
Here is what I've tried (and the result):
Upsampling with SMOTE and SMOTEENN (weird score distributions, see first pic, predicted probabilities for positive and negative class are both the same, i.e., the model predicts a very low probability for most of the positive class)
Downsampling to a balanced dataset (recall is ~0.80 for the test set, but 0.07 for the out-of-year test set from sheer number of total negatives in the unbalanced out of year test set, see second pic)
Leave it unbalanced (weird scoring distribution again, precision goes up to ~0.60 and recall falls to 0.05 and 0.10 for test and out-of-year test set)
XGBoost (slightly better recall on the out-of-year test set, 0.11)
What should I try next? I'd like to optimize for F1, as both false positives and false negatives are equally bad in my case. I would like to incorporate k-fold cross validation and have read I should do this before upsampling, a) should I do this/is it likely to help and b) how can I incorporate this into a pipeline similar to this:
from imblearn.pipeline import make_pipeline, Pipeline
clf_rf = RandomForestClassifier(n_estimators=25, random_state=1)
smote_enn = SMOTEENN(smote = sm)
kf = StratifiedKFold(n_splits=5)
pipeline = make_pipeline(??)
pipeline.fit(X_train, ytrain)
ypred = pipeline.predict(Xtest)
ypredooy = pipeline.predict(Xtestooy)
Upsampling with SMOTE and SMOTEENN : I am far from being an expert with those but by upsampling your dataset you might amplify existing noise which induce overfitting. This could explain the fact that your algorithm cannot correctly classify, thus giving the results in the first graph.
I found a little bit more info here and maybe how to improve your results:
https://sci2s.ugr.es/sites/default/files/ficherosPublicaciones/1773_ver14_ASOC_SMOTE_FRPS.pdf
When you downsample you seem to encounter the same overfitting problem as I understand it (at least for the target result of the previous year). It is hard to deduce the reason behind it without a view on the data though.
Your overfitting problem might come from the number of features you use that could add unnecessary noise. You might try to reduce the number of features you use and gradually increase it (using a RFE model). More info here:
https://machinelearningmastery.com/feature-selection-in-python-with-scikit-learn/
For the models you used, you mention Random Forest and XGBoost, but you did not mention having used simpler model. You could try simpler model and focus on you data engineering.
If you have not try it yet, maybe you could:
Downsample your data
Normalize all your data with a StandardScaler
Test "brute force" tuning of simple models such as Naive Bayes and Logistic Regression
# Define steps of the pipeline
steps = [('scaler', StandardScaler()),
('log_reg', LogisticRegression())]
pipeline = Pipeline(steps)
# Specify the hyperparameters
parameters = {'C':[1, 10, 100],
'penalty':['l1', 'l2']}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
# Instantiate a GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters)
# Fit to the training set
cv.fit(X_train, y_train)
Anyway, for your example the pipeline could be (I made it with Logistic Regression but you can change it with another ML algorithm and change the parameters grid consequently):
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV, StratifiedKFold, cross_val_score
from imblearn.combine import SMOTEENN
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
param_grid = {'C': [1, 10, 100]}
clf = LogisticRegression(solver='lbfgs', multi_class = 'auto')
sme = SMOTEENN(smote = SMOTE(k_neighbors = 2), random_state=42)
grid = GridSearchCV(estimator=clf, param_grid = param_grid, score = "f1")
pipeline = Pipeline([('scale', StandardScaler()),
('SMOTEENN', sme),
('grid', grid)])
cv = StratifiedKFold(n_splits = 4, random_state=42)
score = cross_val_score(pipeline, X, y, cv=cv)
I hope this may help you.
(edit: I added score = "f1" in the GridSearchCV)

Categories

Resources