Classification model on large dataset

Classification model on large dataset - python

I would like to implement a classification model on a dataset where n=3000000 and 12 columns. I have a problem because it's very slow after hours I don't get anything, do you have a recommandation to run it faster ?
Thaks
df = pd.DataFrame(np.random.randint(0,100,size=(3000000, 12)), columns=list('ABCDEFGHIJKL'))
X=df.drop(['L'], axis=1)
y=df['L']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
parameters = {'n_neighbors':np.arange(1,30)}
grid=GridSearchCV(KNeighborsClassifier(),parameters,cv=5)
grid.fit(X_train, y_train)

Use more cores i.e. use n_jobs=-1 within GridSearchCV and KNeighborsClassifier.
parameters = {'n_neighbors':np.arange(1,30)}
grid=GridSearchCV(KNeighborsClassifier(n_jobs=-1),parameters,cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

Another answer besides cutting down on the high number of neighbors: build a model from a smaller sample of the data. If KNeighborsClassifier does not look promising on one million observations, it might not be worth the time and resources to try it on three million.

Related

Evaluate Polynomial regression using cross_val_score

I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)

You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)

Regarding increase in MSE of Cross-Validation model with increasing dataset for regression

I have the following experimental setup for a regression problem.
Using the following routine, a data set of about 1800 entries is separated into three groups, validation, test, and training.
X_train, X_test, y_train, y_test = train_test_split(inputs, targets, test_size=0.2,
random_state=42, shuffle=True)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25,
random_state=42, shuffle=True)
So in essence, training size ~ 1100, validation and test size ~ 350, and each subset is then having unique set of data points, that which is not seen in the other subsets.
With these subsets, I can preform a fitting using any number of the regression models available from scikit-learn, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
Doing this I then calculate the RMSE of the predictions, which in the case of the linear regressor, is about ~ 0.948.
Now, I could instead use cross-validation and not worry about splitting the data instead, using the following routine:
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions2 = cross_val_predict(clf, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=42))
However, when I calculate the RMSE of these predictions, it is about ~2.4! To compare, I tried using a similar routine, but switched X for X_train, and y for y_train, i.e.,
model = LinearRegression()
clf = make_pipeline(StandardScaler(), model)
predictions3 = cross_val_predict(clf, X_train, y_train, cv=KFold(n_splits=10, shuffle=True, random_state=42))
and received a RMSE of about ~ 0.956.
I really do not understand why that when using the entire data set, the RMSE for the cross-validation is so much higher, and that the predictions are terrible in comparison to that with reduced data set.
Additional Notes
Additionally, I have tried out running the above routine, this time using the reduced subset X_val, y_val as inputs for the cross validation, and still receive small RMSE. Additionally, when I simply fit a model on the reduced subset X_val, y_val, and then make predictions on X_train, y_train, the RMSE is still better (lower) than that of the cross-validation RMSE!
This does not only happen for LinearRegressor, but also for RandomForrestRegressor, and others. I have additionally tried to change the random state in the splitting, as well as completely shuffling the data around before handing it to the train_test_split, but still, the same outcome occurs.
Edit 1.)
I tested out this on a make_regression data set from scikit and did not get the same results, but rather all the RMSE are small and similar. My guess is that is has to do with my data set.
If anyone could help me out in understanding this, I would greatly appreciate it.
Edit 2.)
Hi thank you (#desertnaut) for the suggestions, the solution was actually quite easy, and the fact was that in my routine to process the data, I was using (targets, inputs) = (X, y), which is really wrong. I swapped that with (targets, inputs) = (y, X), and now the RMSE is about the same as the other profiles. I made a histogram profile of the data and found that problem. Thanks! I'll save the question for about 1 hour, then delete it.

You're overfitting.
Imagine you had 10 data points and 10 parameters, then RMSE would be zero because the model could perfectly fit the data, now increase the data points to 100 and the RMSE will increase (assuming there is some variance in the data you are adding of course) because your model is not perfectly fitting the data anymore.
RMSE being low (or R-squared high) more often than not doesn't mean jack, you need to consider the standard errors of your parameter estimates . . . If you are just increasing the number of parameters (or conversely, in your case, decreasing the number of observations) you are just chewing away your degrees of freedom.
I'd wager that your standard error estimates for the X model's parameter estimates are smaller than your standard error estimates in the X_train model, even though RMSE is "lower" in the X_train model.
Edit: I'll add that your dataset exhibits high multicollinearity.

Sklearn cross_val_score gives significantly differnt number than model.score?

I have a binary classification problem
First I train test split my data as:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
I checked the y_train and it had basically a 50/50 split of the two classes (1,0) which is how the dataset it
when I try a base model such as:
model = RandomForestClassifier()
model.fit(X_train, y_train)
model.score(X_train, y_train)
the output is 0.98 or something 1% different depending on the random state of the train test split.
HOWEVER, when I try a cross_val_score such as:
cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='accuracy')
the output is
array([0.65 , 0.78333333, 0.78333333, 0.66666667, 0.76666667])
none of the scores in the array are even close to 0.98?
and when I tried scoring = 'r2' I got
>>>cross_val_score(model, X_train, y_train, cv=StratifiedKFold(shuffle=True), scoring='r2')
array([-0.20133482, -0.00111235, -0.2 , -0.2 , -0.13333333])
Does anyone know why this is happening? I have tried Shuffle = True and False but it doesn't help.
Thanks in advance

In your base model, you compute your score on the training corpus. While this is a proper way to ensure your model has actually learnt from the data you fed it, it doesn't ensure the final accuracy of your model on new and unseen data.
Not 100% sure (I don't know well scikit-learn), but I'd expect cross_val_score to actually split the X_train and y_train into a training and a testing set.
So as you compute a score on data unseen during the training, the accuracy will be much lower. Try to compare these results with model.score(X_test, y_test), it should be much closer.

Understanding scikit's decision tree - inconsistent learning

I have been using the package tsfresh to find relevant features for time-series. It outputs approximately 300 "relevant" features that pass a p-test threshold for predictability for each feature. When I train a classifier using scikit's DecisionTreeClassifier() I get some odd results. Each time I execute the learning of the tree it returns a tree with only two levels, and every time the features it uses are different. I am befuddled. The tree does a nice job every time but am I not seeing all the levels?
Using this code:
from sklearn import tree
from sklearn.tree import _tree
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2)
cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
tree.export_graphviz(cl,out_file='tree.dot',feature_names=X.columns)
where len(X.colums) is over 300 returns a decision tree of two levels every time.

The output of this line is random:
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2)
That is, every time you split the data in train and test sets, you get different sets. You can use the random_state attribute to obtain a predictable split:
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2, random_state=4)
Doing so should give you the same split features always for the tree.

Stratified Train/Validation/Test-split in scikit-learn

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.
The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:
Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.
from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)
Please get back if my approach is correct and/or if you have a better approach.
Thank you

The solution is to just use StratifiedShuffleSplit twice, like below:
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
train_set = df.iloc[train_index]
test_valid_set = df.iloc[test_valid_index]
split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
test_set = test_valid_set.iloc[test_index]
valid_set = test_valid_set.iloc[valid_index]

Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.
In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Classification model on large dataset - python

Use more cores i.e. use n_jobs=-1 within GridSearchCV and KNeighborsClassifier. parameters = {'n_neighbors':np.arange(1,30)} grid=GridSearchCV(KNeighborsClassifier(n_jobs=-1),parameters,cv=5, n_jobs=-1) grid.fit(X_train, y_train)

Another answer besides cutting down on the high number of neighbors: build a model from a smaller sample of the data. If KNeighborsClassifier does not look promising on one million observations, it might not be worth the time and resources to try it on three million.

Related

Evaluate Polynomial regression using cross_val_score

Regarding increase in MSE of Cross-Validation model with increasing dataset for regression

Sklearn cross_val_score gives significantly differnt number than model.score?

Understanding scikit's decision tree - inconsistent learning

Stratified Train/Validation/Test-split in scikit-learn

Categories

Resources