my dataset consist of 10 feature (10 columns) for input and the last 3 columns for 3 different output. If I use one column for output, for example y = newDf.iloc[:, 10].values , it works; but if I use all 3 columns it gives me an error at pipe_lr.fit and says: y should be a 1d array, got an array of shape (852, 3) instead.
How can I pass y ?
X = newDf.iloc[:, 0:10].values
y = newDf.iloc[:, 10:13].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
pipe_lr = make_pipeline(StandardScaler(),
PCA(n_components=2),
LogisticRegression(random_state=1, solver='lbfgs'))
pipe_lr.fit(X_train, y_train)
The pipeline itself does not care about the format for y, it just hands it over to each step. In your case, it's the LogisticRegression, which indeed is not set up for multi-label classification. You can manage it using the MultiOutputClassification wrapper:
pipe_lr = make_pipeline(
StandardScaler(),
PCA(n_components=2),
MultiOutputClassifier(LogisticRegression(random_state=1, solver='lbfgs'))
)
(There is also a MultiOutputRegressor, and more-complicated things like ClassifierChain and RegressorChain. See the User Guide. However, there is not to my knowledge a builtin way to mix and match regression and classification tasks.)
Simply put, No.
What you want is called multi-label learning, not supported by Scikit-Learn.
You should train three models, each having a label.
Related
I have trained multiclassification models in my training and test sets and have achieved good results with SVC. Now, I want to use the model o make predictions in my entire dataframe, but when I get the following error: ValueError: X has 36976 features, but SVC is expecting 8989 features as input.
My dataframe has two columns: one with the categories (which I manually labeled for around 1/5 of the dataframe) and the text columns with all the texts (including those that have not been labeled).
data={'categories':['1','NaN','3', 'NaN'], 'documents':['Paragraph 1.\nParagraph 2.\nParagraph 3.', 'Paragraph 1.\nParagraph 2.', 'Paragraph 1.\nParagraph 2.\nParagraph 3.\nParagraph 4.', ''Paragraph 1.\nParagraph 2.']}
df=pd.DataFrame(data)
First, I drop the rows with Nan values in the 'categories' column. I then, create the document term matrix, define the 'y', and split into training and test sets.
tf = CountVectorizer(tokenizer=word_tokenize)
X = tf.fit_transform(df['documents'])
y = df['categories']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Second, I run the SVC model getting good results:
from sklearn.svm import SVC
svm = SVC(C=0.1, class_weight='balanced', kernel='linear', probability=True)
model = svm.fit(X_train, y_train)
print('accuracy:', model.score(X_test, y_test))
y_pred = model.predict(X_test)
print(metrics.classification_report(y_test, y_pred))
Finally, I try to apply the the SVC model to predict the categories of the entire column 'documents' of my dataframe. To do so, I create the document term matrix of the entire column 'documents' and then apply the model:
tf_entire_df = CountVectorizer(tokenizer=word_tokenize)
X_entire_df = tf_entire_df.fit_transform(df['documents'])
y_pred_entire_df = model.predict(X_entire_df)
Bu then I get the error that my X_entire_df has more features than the SVC model is expecting as input. I magine that this is because now I am trying to apply the model to the whole column documents, but I do know how to fix this.
I would appreciate your help!
These issues usually comes from the fact that you are feeding the model with unknown or unseen data (more/less features than the one used for training).
I would strongly suggest you to use sklearn.pipeline and create a pipeline to include preprocessing (CountVectorizer) and your machine learning model (SVC) in a single object.
From experience, this helps a lot to avoid tedious complex preprocessing fitting issues.
I try to build a distinct classifier per each dataset through the loop. Here's the pseudo code of my program.
for dataset in total_data:
X_train, X_test, y_train, y_test = Somefunction(dataset)
model = XGBClassifier()
model.fit(X_train, y_train)
Here, label(y_train) have one of the values [0,1,2]. However, I just find that y_train only have 1 and 2 in some datasets. And because the XGBoost force label starts with 0, it yields an error.
Is there any elegant way to tell XGBoost that there are the 3 kinds of labels even if only two exist in the train set?
+)
I try to use num_class=3 as argument of XGBClassfier, but it yields another kind of error : Invalid shape of labels
everyone.
I am doing a binary classification on a huge dataset (190 columns, 500K records). The target values are 0 and 1. However, when I do the oversampling with SMOTE, new target values in the y-vector are created (0, 1, 2 for example). I do not know how to avoid that. I would like to know how to limit the oversampling values to 0 and 1.
this is what I am doing:
over = SMOTE(sampling_strategy='minority')
X, y = over.fit_resample(X, y)
y's vector type is int64.
Also, I am modeling using an ANN using Keras. The training data set available for me is split into three different datasets: train, val and test, for training, validation and testing purposes. I wonder which one should be OVERsampled (or UNDERsampled), or if I should OVERsample (or UNDERsample) BEFORE splitting.
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.20)
X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.15)
WARNING: my model will be evaluated later on using a much larger dataset (1000K records) which is the REAL test dataset from the company I am working for, and whose target values are UNKNOWN for me and my model.
Thanx
Johnny
Since you have 190 columns, 500K records, SMOTE will not work well as SMOTE technically interpolations techniques. It may suffer from curse of dimensionality. OVERsample and UNDERsample approaches also have some limitations. I would prefer to use class weight. Also, make sure that all data pre-processing are performed separately on test data and train data. Here is paper outlined the framework: https://www.sciencedirect.com/science/article/pii/S2666827022000585
I am trying to use cross_val_score to evaluate my regression model (with PolymonialFeatures(degree = 2)). As I noted from different blog posts that I should use cross_val_score with original X, y values, not the X_train and y_train.
r_squareds = cross_val_score(pipe, X, y, cv=10)
r_squareds
>>> array([ 0.74285583, 0.78710331, -1.67690578, 0.68890253, 0.63120873,
0.74753825, 0.13937611, 0.18794756, -0.12916661, 0.29576638])
which indicates my model doesn't perform really well with the mean r2 of only 0.241. Is this supposed to be a correct interpretation?
However, I came across a Kaggle code working on the same data and the guy performed cross_val_score on X_train and y_train. I gave this a try and the average r2 was better.
r_squareds = cross_val_score(pipe, X_train, y_train, cv=10)
r_squareds.mean()
>>> 0.673
Is this supposed to be a problem?
Here is the code for my model:
X = df[['CHAS', 'RM', 'LSTAT']]
y = df['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)
pipe = Pipeline(
steps=[('poly_feature', PolynomialFeatures(degree=2)),
('model', LinearRegression())]
)
## fit the model
pipe.fit(X_train, y_train)
You first interpretation is correct. The first cross_val_score is training 10 models with 90% of your data as train and 10 as a validation dataset. We can see from these results that the estimator's r_square variance is quite high. Sometimes the model performs even worse than a straight line.
From this result we can safely say that the model is not performing well on this dataset.
It is possible that the obtained result using only the train set on your cross_val_score is higher but this score is most likely not representative of your model performance as the dataset might be to small to capture all its variance. (The train set for the second cross_val_score is only 54% of your dataset 90% of 60% of the original dataset)
I am working multi labeled image classification. This is my data frame:
[UPDATED]
As you can see images labeled with 26 features. "1" means exist, "0" means not exist.
My problem is in many of label has imbalanced data. For example:
[1] train_df.value_counts('Eyeglasses')
Output:
Eyeglasses
0 54735
1 1265
dtype: int64
[2] train_df.value_counts('Double_Chin')
Output:
Double_Chin
0 55464
1 536
dtype: int64
How can I split it both of for training and validation data as a balanced?
[UPDATE]
I tried to
from imblearn.over_sampling import SMOTE
smote = SMOTE()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
X_train_smote, y_train_smote = smote.fit_sample(X_train, y_train)
ValueError: Imbalanced-learn currently supports binary, multiclass and
binarized encoded multiclasss targets. Multilabel and multioutput
targets are not supported.
Your question mixes two concepts: splitting a multi-class, multi-label image dataset into subsets which have proportional representation, and resampling methods to deal with class imbalance. I am going to focus on just the splitting part of the problem, since that's what the title is about.
I would use a stratified-shuffle-split so to make sure that each subset has equal reprentation. Here's a handy visual for stratified sampling from Wikipedia:
For this I recommend skmultilearn's IterativeStratification method. It supports multi-label datasets.
from skmultilearn.model_selection.iterative_stratification import IterativeStratification
stratifier = IterativeStratification(
n_splits=2, order=2, sample_distribution_per_fold=[1.0 - train_fraction, train_fraction],
)
# this class is a generator that produces k-folds. we just want to iterate it once to make a single static split
# NOTE: needs to be computed on hard labels.
train_indexes, everything_else_indexes = next(stratifier.split(X=img_urls, y=labels))
# s3url array shape (N_samp,)
x_train, x_else = img_urls[train_indexes], img_urls[everything_else_indexes]
# labels array shape (N_samp, n_classes)
Y_train, Y_else = labels[train_indexes, :], labels[everything_else_indexes, :]
I wrote a more complete solution, including unit tests, in a blog post.
One downside with skmultilearn is that it is not very well maintained and has some broken functionality. I documented a few of these sharp corners and gotchas in my blog post. Also note that this stratification procedure is painfully slow when you get to several million images because the stratifier only uses a single CPU.