Unknown label type: 'continuous' - python

My fellow Team,
Having an issue
----------------------
Avg.SessionLength TimeonApp TimeonWebsite LengthofMembership Yearly Amount Spent
0 34.497268 12.655651 39.577668 4.082621 587.951054
1 31.926272 11.109461 37.268959 2.664034 392.204933
2 33.000915 11.330278 37.110597 4.104543 487.547505
3 34.305557 13.717514 36.721283 3.120179 581.852344
4 33.330673 12.795189 37.536653 4.446308 599.406092
5 33.871038 12.026925 34.476878 5.493507 637.102448
6 32.021596 11.366348 36.683776 4.685017 521.572175
Want to apply KNN
X = df[['Avg. Session Length', 'Time on App','Time on Website', 'Length of Membership']]
y = df['Yearly Amount Spent']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33,
random_state=42)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
ValueError: Unknown label type: 'continuous'

The values in Yearly Amount Spent column are real numbers, so they cannot serve as labels for a classification problem (see here):
When doing classification in scikit-learn, y is a vector of integers
or strings.
Hence you get the error. If you want to build a classification model, you need to decide how you transform them into a finite set of labels.
Note that if you just want to avoid the error, you could do
import numpy as np
y = np.asarray(df['Yearly Amount Spent'], dtype="|S6")
This will transform the values in y into strings of the required format. Yet, every label will appear in only one sample, so you cannot really build a meaningful model with such set of labels.

I think you are actually trying to do a regression rather than a classification, since your code pretty much looks like you want to predict
the yearly amount spent as a number. In this case, use
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=1)
instead. If you really have a classification task, for example you want to classify into classes like ('yearly amount spent is low', 'yearly amount spent is high',...), you should discretize the labels and convert them into strings or integer numbers (as explained by #Miriam Farber), according to the thresholds you need to set manually in this case.

Related

Why is this accuracy of this Random forest sentiment classification so low?

I want to use RandomForestClassifier for sentiment classification. The x contains data in string text, so I used LabelEncoder to convert strings. Y contains data in numbers. And my code is this:
import pandas as pd
import numpy as np
from sklearn.model_selection import *
from sklearn.ensemble import *
from sklearn import *
from sklearn.preprocessing.label import LabelEncoder
data = pd.read_csv('data.csv')
x = data['Reviews']
y = data['Ratings']
le = LabelEncoder()
x_encoded = le.fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x_encoded,y, test_size = 0.2)
x_train = x_train.reshape(-1,1)
x_test = x_test.reshape(-1,1)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
Then I printed out the accuracy like below:
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
And here's the output:
Accuracy: 0.5975
I have read that Random forests has high accuracy, because of the number of decision trees participating in the process. But I think that the accuracy is much lower than it should be. I have looked for some similar questions on Stack Overflow, but I couldn't find a solution for my problem.
Is there any problem in my code using Random Forest library? Or is there any exceptions of cases when using Random forest?
It is not a problem regarding Random Forests or the library, it is rather a problem how you transform your text input into a feature or feature vector.
What LabelEncoding does is; given some labels like ["a", "b", "c"] it transforms those labels into numeric values between 0 and n-1 with n-being the number of distinct input labels. However, I assume Reviews contain texts and not pure labels so to say. This means, all your reviews (if not 100% identical) are transformed into different labels. Eventually, this leads to your classifier doing random stuff. give that input. This means you need something different to transform your textual input into a numeric input that Random Forests can work on.
As a simple start, you can try something like TfIDF or also some simple count vectorizer. Those are available from sklearn https://scikit-learn.org/stable/modules/feature_extraction.html section 6.2.3. Text feature extraction. There are more sophisticated ways of transforming texts into numeric vectors but that should be a good start for you to understand what has to happen conceptually.
A last important note is that you fit those vectorizers only on the training set and not on the full dataset. Otherwise, you might leak information from training to evaluation/testing. A good way of doing this would be to build a sklearn pipeline that consists of a feature transformation step and the classifier.

Multi labeled image classification with imbalanced data, how to split it?

I am working multi labeled image classification. This is my data frame:
[UPDATED]
As you can see images labeled with 26 features. "1" means exist, "0" means not exist.
My problem is in many of label has imbalanced data. For example:
[1] train_df.value_counts('Eyeglasses')
Output:
Eyeglasses
0 54735
1 1265
dtype: int64
[2] train_df.value_counts('Double_Chin')
Output:
Double_Chin
0 55464
1 536
dtype: int64
How can I split it both of for training and validation data as a balanced?
[UPDATE]
I tried to
from imblearn.over_sampling import SMOTE
smote = SMOTE()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
ValueError: Imbalanced-learn currently supports binary, multiclass and
binarized encoded multiclasss targets. Multilabel and multioutput
targets are not supported.
Your question mixes two concepts: splitting a multi-class, multi-label image dataset into subsets which have proportional representation, and resampling methods to deal with class imbalance. I am going to focus on just the splitting part of the problem, since that's what the title is about.
I would use a stratified-shuffle-split so to make sure that each subset has equal reprentation. Here's a handy visual for stratified sampling from Wikipedia:
For this I recommend skmultilearn's IterativeStratification method. It supports multi-label datasets.
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
stratifier = IterativeStratification(
n_splits=2, order=2, sample_distribution_per_fold=[1.0 - train_fraction, train_fraction],
)
# this class is a generator that produces k-folds. we just want to iterate it once to make a single static split
# NOTE: needs to be computed on hard labels.
train_indexes, everything_else_indexes = next(stratifier.split(X=img_urls, y=labels))
# s3url array shape (N_samp,)
x_train, x_else = img_urls[train_indexes], img_urls[everything_else_indexes]
# labels array shape (N_samp, n_classes)
Y_train, Y_else = labels[train_indexes, :], labels[everything_else_indexes, :]
I wrote a more complete solution, including unit tests, in a blog post.
One downside with skmultilearn is that it is not very well maintained and has some broken functionality. I documented a few of these sharp corners and gotchas in my blog post. Also note that this stratification procedure is painfully slow when you get to several million images because the stratifier only uses a single CPU.

How to deal with dataset that contains both discrete and continuous data

I was training a model that contains 8 features that allows us to predict the probability of a room been sold.
Region: The region the room belongs to (an integer, taking value between 1 and 10)
Date:The date of stay (an integer between 1‐365, here we consider only one‐day
request)
Weekday: Day of week (an integer between 1‐7)
Apartment: Whether the room is a whole apartment (1) or just a room (0)
#beds:The number of beds in the room (an integer between 1‐4)
Review: Average review of the seller (a continuous variable between 1 and 5)
Pic Quality: Quality of the picture of the room (a continuous variable between 0 and 1)
Price: he historic posted price of the room (a continuous variable)
Accept:Whether this post gets accepted (someone took it, 1) or not (0) in the end
Column Accept is the "y". Hence, this is a binary classification.
We have plot the data and some of the data were skewed so we applied power transform.
We tried a neural network, ExtraTrees, XGBoost, Gradient boost, Random forest. They all gave about 0.77 AUC. However, when we tried them on the test set, the AUC dropped to 0.55 with a precision of 27%.
I am not sure where when wrong but my thinking was that the reason may due to the mixing of discrete and continuous data. Especially some of them are either 0 or 1.
Can anyone help?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
import warnings
warnings.filterwarnings('ignore')
df_train = pd.read_csv('case2_training.csv')
X, y = df_train.iloc[:, 1:-1], df_train.iloc[:, -1]
y = y.astype(np.float32)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer()
transform_list = ['Pic Quality', 'Review', 'Price']
X_train[transform_list] = pt.fit_transform(X_train[transform_list])
X_test[transform_list] = pt.transform(X_test[transform_list])
for i in transform_list:
df = X_train[i]
ax = df.plot.hist()
ax.set_title(i)
plt.show()
# Normalization
sc = MinMaxScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_train = X_train.astype(np.float32)
X_test = X_test.astype(np.float32)
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(random_state=123, n_estimators=50)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
from torch import nn
from skorch import NeuralNetBinaryClassifier
import torch
model = nn.Sequential(
nn.Linear(8,64),
nn.BatchNorm1d(64),
nn.GELU(),
nn.Linear(64,32),
nn.BatchNorm1d(32),
nn.GELU(),
nn.Linear(32,16),
nn.BatchNorm1d(16),
nn.GELU(),
nn.Linear(16,1),
# nn.Sigmoid()
)
net = NeuralNetBinaryClassifier(
model,
max_epochs=100,
lr=0.1,
# Shuffle training data on each epoch
optimizer=torch.optim.Adam,
iterator_train__shuffle=True,
)
net.fit(X_train, y_train)
from xgboost.sklearn import XGBClassifier
clf = XGBClassifier(silent=0,
learning_rate=0.01,
min_child_weight=1,
max_depth=6,
objective='binary:logistic',
n_estimators=500,
seed=1000)
clf.fit(X_train,y_train)
yhat = clf.predict_proba(X_test)
# AUC metric
train_accuracy = roc_auc_score(y_test, yhat[:,-1])
print("AUC",train_accuracy)
Here is an attachment of a screenshot of the data.
Sample data
This is the fundamental first step of Data Analytics. You need to do two things here:
Data understanding - do the data fields in their current format make sense (data types, value range etc.)
Data preparation - what should I do to update these data fields before passing them to our model? Also which inputs do you think will be useful for your model and which will provide little benefit? Are there outliers I need to consider/handle?
A good book if you're starting in the field of data analytics is Fundamentals of Machine Learning for Predictive Data Analytics (I have no affiliation with this book).
Looking at your dataset there's a couple of things you could try to see how it influences your prediction results:
Unless region order is actually ranked in importance/value I would change this to a one hot encoded feature, you can do this in sklearn. Otherwise you run the risk of your model thinking that regions with a higher number (say 10) are more important than regions with a lower value (say 1).
You could attempt to normalise certain categories if they are much larger than some of your other data fields Why Data Normalization is necessary for Machine Learning models
Consider looking at the Kaggle competition House Prices: Advanced Regression Techniques. It's doing a similar thing to what you're attempting to do, and it might have some pointers for how you should approach the problem in the Notebooks and Discussion tabs.
Without deeply exploring all the data you are using it is hard to say for certain what is causing the drop in accuracy (or AUC) when moving from your training set to the testing set. It is unlikely to be caused by the mixed discrete/continuous data.
The drop just suggests that your models are over-fitting to your training data (and therefore not transferring well). This could be caused by too many learned parameters (given the amount of data you have)--more often a problem with neural networks than with some of the other methods you mentioned. Or, the problem could be with the way the data was split into training/testing. If the distribution of the data has a significant difference (that's maybe not obvious) then you wouldn't expect the testing performance to be as good. If it were me, I'd look carefully at how the data was split into training/testing (assuming you have a reasonably large set of data). You may try repeating your experiments with a number of random training/testing splits (search k-fold cross validation if you're not familiar with it).
your model is overfit. try to make a simple model first and use a lower parameter value. for tree-based classification, scaling does not have any impact on the model.

How to weigh data points with sklearn training algorithms

I am looking to train either a random forest or gradient boosting algorithm using sklearn. The data I have is structured in a way that it has a variable weight for each data point that corresponds to the amount of times that data point occurs in the dataset. Is there a way to give sklearn this weight during the training process, or do I need to expand my dataset to a non-weighted version that has duplicate data points each represented individually?
You can definitely specify the weights while training these classifiers in scikit-learn. Specifically, this happens during the fit step. Here is an example using RandomForestClassifier but the same goes also for GradientBoostingClassifier:
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np
data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)
Here I define some arbitrary weights just for the sake of the example:
weights = np.random.choice([1,2],len(y_train))
And then you can fit your model with these models:
rfc = RandomForestClassifier(n_estimators = 20, random_state = 42)
rfc.fit(X_train,y_train, sample_weight = weights)
You can then evaluate your model on your test data.
Now, to your last point, you could in this example resample your training set according to the weights by duplication. But in most real world examples, this could end up being very tedious because
you would need to make sure all your weights are integers to perform duplication
you would have to uselessly multiply the size of your data, which is memory-consuming and is most likely going to slow down the training procedure

python sklearn Logistic regression predicts all 0

I've built a logistic regression for car loans which contains "is the loan in default yes or no" as the binary dependent variable, i am using around 20 independent variables, and the data set contains 3327 records.
I split the underlying data into a training set and test set. However after i fit the model on the training data and ask it to then predict for test data i get an output of all "0" when there should be some "1" outputs in there given the training set has roughly 12% of the time a "1" for the binary default or no default variable.
I've looked at the test and training sets which all look fine pre and post splitting (no missing values, category variables are dummies, and the training/test subsets correctly pick records at random so no breakdown there as far as I can see).
Interestingly the function "predict_proba" shows the probabilities predicted for getting a "0" is always high for each output element (0.7-0.9 probability). I'm not sure how best to correct this as i'd rather leave the default threshold at 0.5 but i'm not sure how to clear up this mess.
Is it simply a case of I need more data given the number of independent variables or am I missing something/ did something wrong?
Thanks!
import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.cross_validation import train_test_split
import statsmodels.api as sm
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)
#open the file
data = pd.read_csv(r"log reg test Lending club 2007-2011 and 2014 car only no dummy trap.csv")
print(data.shape)
##print(list(data.columns))
print(data['Distressed'].value_counts()) ## check number of defaulted car loans is binary
sns.countplot(x='Distressed', data=data, palette='hls')
print(plt.show()) ## confrim dependent variable is binary
##basic numerical analysis of variables to check feasibility for model
## we will need to create dummy variables for strings
#print(data.groupby('Distressed').mean()) ##numerical variable means
#print(data.groupby('grade').mean()) ## string variable means
#print(data.groupby('sub_grade').mean())
#print(data.groupby('emp_length').mean())
#print(data.groupby('home_ownership').mean())
##testing for nulls in dataset
print('Table showing cumulative number of missing data points', data.isnull().sum())
scrub_data=data.drop(['mths_since_last_delinq'],1) ## this variable is not statistically significant
print('Here is the sample showing no missing data')
print(scrub_data.isnull().sum()) ## removed records of missing info, sample still sufficiently large
#scrub_data['intercept']=0
print(list(scrub_data.columns))
print(scrub_data.head())
##convert categorical variables to dummies completed in csv file
## Agrade and Own dummies removed to avoid dummy variable trap and are treated as the base case here
X=scrub_data.ix[:,(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,22)].values
y=scrub_data.ix[:,0].values
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=.3, random_state=0)
print('Here are the X components', X)
print('Here are the y components', y)
print('Here are the X values of the training', X_train)
print('Here are the y train values', y_train)
print('Here are the y test values', y_test)
model=LogisticRegression()
model.fit(X_train,y_train) ##Model is learning the relationship between X_train and y_train
y1_pred=model.predict(X_train)
print('y predict of train data', y1_pred)
print('Here is the Model Score', model.score(X_train,y_train)) ##check accuracy of training set
print('What percentage defaulted', y_train.mean()) ##what percentage defaulted
print('What percentage of test set defaulted', y_test.mean()) ##what percentage defaulted
print('X test values', X_test) ## check test subset values
y_pred=model.predict(X_test)
probs=model.predict_proba(X_test)

Categories

Resources