Understanding scikit's decision tree - inconsistent learning

Understanding scikit's decision tree - inconsistent learning - python

I have been using the package tsfresh to find relevant features for time-series. It outputs approximately 300 "relevant" features that pass a p-test threshold for predictability for each feature. When I train a classifier using scikit's DecisionTreeClassifier() I get some odd results. Each time I execute the learning of the tree it returns a tree with only two levels, and every time the features it uses are different. I am befuddled. The tree does a nice job every time but am I not seeing all the levels?
Using this code:
from sklearn import tree
from sklearn.tree import _tree
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2)
cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
tree.export_graphviz(cl,out_file='tree.dot',feature_names=X.columns)
where len(X.colums) is over 300 returns a decision tree of two levels every time.

The output of this line is random:
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2)
That is, every time you split the data in train and test sets, you get different sets. You can use the random_state attribute to obtain a predictable split:
X_train, X_test, y_train, y_test = train_test_split(X_filtered, y, test_size=.2, random_state=4)
Doing so should give you the same split features always for the tree.

Related

How do I properly fit a sci-kit learn model using a pandas dataframe?

I am trying to create a machine learning program in sci-kit learn. I am using a CSV file to store data, and have decided to use Pandas data frame to import and format this data. I cannot figure out how to fit this data frame with the model.
My CSV file has one feature, age, and one target, weight. I am using a linear regression algorithm to predict the weight using the age. I do realize this isn't the best algorithm to use with this data.
When I run this code I get the error "ValueError: Found input variables with inconsistent numbers of samples: [10, 40]"
Here is my code:
# Imports
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Load And Split Data
data = pd.read_csv("awd.csv")
feature_cols = ['Ages']
X = data.loc[:, feature_cols]
y = data.loc[:, "Weights"]
X_train, y_train, X_test, y_test = train_test_split(X, y, random_state=0, train_size=0.2)
# Train Model
lr = LinearRegression()
lr.fit(X_train, y_train)
# Scores
print(f"Test set score: {round(lr.score(X_test, y_test), 3)}")
print(f"Training set score: {round(lr.score(X_train, y_train), 3)}")
The first 5 lines of my CSV file:
Ages,Weights
1,19
1,21
2,26
2,32

You're assigning the return values incorrectly. See below:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, train_size=0.2)

You should correct the order of X_train, X_test, y_train and y_test like this:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
See the relevant documentation for details.

Classification model on large dataset

I would like to implement a classification model on a dataset where n=3000000 and 12 columns. I have a problem because it's very slow after hours I don't get anything, do you have a recommandation to run it faster ?
Thaks
df = pd.DataFrame(np.random.randint(0,100,size=(3000000, 12)), columns=list('ABCDEFGHIJKL'))
X=df.drop(['L'], axis=1)
y=df['L']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
parameters = {'n_neighbors':np.arange(1,30)}
grid=GridSearchCV(KNeighborsClassifier(),parameters,cv=5)
grid.fit(X_train, y_train)

Use more cores i.e. use n_jobs=-1 within GridSearchCV and KNeighborsClassifier.
parameters = {'n_neighbors':np.arange(1,30)}
grid=GridSearchCV(KNeighborsClassifier(n_jobs=-1),parameters,cv=5, n_jobs=-1)
grid.fit(X_train, y_train)

Another answer besides cutting down on the high number of neighbors: build a model from a smaller sample of the data. If KNeighborsClassifier does not look promising on one million observations, it might not be worth the time and resources to try it on three million.

How to stratify the training and testing data in Scikit-Learn?

I am trying to implement Classification algorithm for Iris Dataset (Downloaded from Kaggle). In the Species column the classes (Iris-setosa, Iris-versicolor , Iris-virginica) are in sorted order. How can I stratify the train and test data using Scikit-Learn?

If you want to shuffle and split your data with 0.3 test ratio, you can use
sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True)
where X is your data, y is corresponding labels, test_size is the percentage of the data that should be held over for testing, shuffle=True shuffles the data before splitting
In order to make sure that the data is equally splitted according to a column, you can give it to the stratify parameter.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
shuffle=True,
stratify = X['YOUR_COLUMN_LABEL'])

To make sure that the three classes are represented equally in your train and test, you can use the stratify parameter of the train_test_split function.
from sklearn.model_selection import train_test_split
X_train, y_train, X_test, y_test = train_test_split(X, y, stratify = y)
This will make sure that the ratio of all the classes is maintained equally.

use sklearn.model_selection.train_test_split and play around with Shuffle parameter.
shuffle: boolean, optional (default=True)
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.

Instead of k-fold cross validation, why not just randomize the random_state?

Before I knew about k-fold cross validation, I would just set a random_state like so:
X_train, X_test, y_train, y_test = train_test_split(X,
y,
random_state=np.random.randint(1000),
test_size=0.25)
Then, I would iterate over this a bunch of times, storing my accuracy measures and then compute a confidence interval on that array.
Is there any downside to doing this?

Unexpected cross-validation scores with scikit-learn LinearRegression

I am trying to learn to use scikit-learn for some basic statistical learning tasks. I thought I had successfully created a LinearRegression model fit to my data:
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X, y,
test_size=0.2, random_state=0)
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
print model.score(X_test, y_test)
Which yields:
0.797144744766
Then I wanted to do multiple similar 4:1 splits via automatic cross-validation:
model = linear_model.LinearRegression()
scores = cross_validation.cross_val_score(model, X, y, cv=5)
print scores
And I get output like this:
[ 0.04614495 -0.26160081 -3.11299397 -0.7326256 -1.04164369]
How can the cross-validation scores be so different from the score of the single random split? They are both supposed to be using r2 scoring, and the results are the same if I pass the scoring='r2' parameter to cross_val_score.
I've tried a number of different options for the random_state parameter to cross_validation.train_test_split, and they all give similar scores in the 0.7 to 0.9 range.
I am using sklearn version 0.16.1

It turns out that my data was ordered in blocks of different classes, and by default cross_validation.cross_val_score picks consecutive splits rather than random (shuffled) splits. I was able to solve this by specifying that the cross-validation should use shuffled splits:
model = linear_model.LinearRegression()
shuffle = cross_validation.KFold(len(X), n_folds=5, shuffle=True, random_state=0)
scores = cross_validation.cross_val_score(model, X, y, cv=shuffle)
print scores
Which gives:
[ 0.79714474 0.86636341 0.79665689 0.8036737 0.6874571 ]
This is in line with what I would expect.

train_test_split seems to generate random splits of the dataset, while cross_val_score uses consecutive sets, i.e.
"When the cv argument is an integer, cross_val_score uses the KFold or StratifiedKFold strategies by default"
http://scikit-learn.org/stable/modules/cross_validation.html
Depending on the nature of your data set, e.g. data highly correlated over the length of one segment, consecutive sets will give vastly different fits than e.g. random samples from the whole data set.

Folks, thanks for this thread.
The code in the answer above (Schneider) is outdated.
As of scikit-learn==0.19.1, this will work as expected.
from sklearn.model_selection import cross_val_score, KFold
kf = KFold(n_splits=3, shuffle=True, random_state=0)
cv_scores = cross_val_score(regressor, X, y, cv=kf)
Best,
M.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Understanding scikit's decision tree - inconsistent learning - python

Related

How do I properly fit a sci-kit learn model using a pandas dataframe?

Classification model on large dataset

How to stratify the training and testing data in Scikit-Learn?

Instead of k-fold cross validation, why not just randomize the random_state?

Unexpected cross-validation scores with scikit-learn LinearRegression

Categories

Resources