I'm doing a sentiment analysis project on a Twitter dataset. I used TF-IDF feature extraction and a logistic regression model for classification. So far I've trained the model with the following:
def get_tfidf_features(train_fit, ngrams=(1,1)):
vector = TfidfVectorizer(ngrams, sublinear_tf=True)
vector.fit(train_fit)
return vector
X = tf_vector.transform(df['text'])
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.01, random_state = 42)
LR_model = LogisticRegression(solver='lbfgs')
LR_model.fit(X_train, y_train)
y_predict_lr = LR_model.predict(X_test)
This logistic regression model was trained on a dataset of about 1.5 million tweets. I have a set of about 18,000 tweets and I want to use this model to predict the sentiment scores for the tweets in this new dataset. I'm at a loss of how to actually apply this trained model to new data. The head of this new dataframe df_chi looks like this:
which has shape (18393, 7). I want to take the trained model I already have, apply it to the text column, and create a new sentiment column with those predicted scores in the df_chi dataframe. (Note: the image doesn't show cleaned text, but I'll do that.)
I'm a ML noob and I've never taken a trained model and applied it to new data. My confusion starts with extracting features from the df_chi text with TF-IDF. I attempted to do this (total guess):
tf_vector = get_tfidf_features(df_chi['text'])
X = tf_vector.transform(df_chi['text'])
df_chi['sentiment'] = LR_model.predict(X)
which gives the following ValueError:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-188-0cf1a4f34c8b> in <module>
1 tf_vector = get_tfidf_features(df_chi['text'])
2 X = tf_vector.transform(df_chi['text'])
----> 3 df_chi['sentiment'] = LR_model.predict(X)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_base.py in predict(self, X)
291 Predicted class label per sample.
292 """
--> 293 scores = self.decision_function(X)
294 if len(scores.shape) == 1:
295 indices = (scores > 0).astype(np.int)
~/opt/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_base.py in decision_function(self, X)
271 if X.shape[1] != n_features:
272 raise ValueError("X has %d features per sample; expecting %d"
--> 273 % (X.shape[1], n_features))
274
275 scores = safe_sparse_dot(X, self.coef_.T,
ValueError: X has 22806 features per sample; expecting 265054
Pretty sure my whole approach to applying the trained model on the new data is incorrect. What's the right way to do this?
Noodled around with this and came up with the following solution:
tfidf = TfidfVectorizer()
X_chi = tfidf.fit_transform(df_chi['text'])
X1 = pd.DataFrame.sparse.from_spmatrix(X)
X_chi1 = pd.DataFrame.sparse.from_spmatrix(X_chi)
not_existing_cols = [c for c in X1.columns.tolist() if c not in X_chi1]
X_chi1 = X_chi1.reindex(X_chi1.columns.tolist() + not_existing_cols, axis=1)
#X_chi.fillna(0, inplace=True)
X_chi1 = X_chi1[X1.columns.tolist()]
a = LR_model.predict(X_chi1)
df_chi['sentiment'] = a
Solution inspired by Logistic regression: X has 667 features per sample; expecting 74869
Looks a little clumsy, though. If it works it works, I guess. Though I suspect there might be a better way to do this, no?
Related
I've seen quite a lot of conflicting views on if one-hot encoding (dummy variable creation) should be done before/after the training/test split.
Responses seem to state that one-hot encoding before leads to "data leakage".
This example states it's industry norm to do one-hot encoding on the entire data before training/test split:
Industry Example
This example from kaggle states that it should be done after the training/test split to avoid data leakage:
kaggle response - after split
My question is the following;
Do we perform one-hot encoding before or after the Train/Test Split?
Where is the data leakage occuring in the following example?
If we take the following example, we have two columns - web_views and website (non-ordinal categorical feature) (assuming we are one-hot encoding across the entire column, not dropping any dummies)
Our dataframe:
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
df = pd.DataFrame({'web_views': [100,200,300,400],
'website': ['Youtube','Facebook','Instagram', 'Google']})
Scenario 1: One-Hot Encoding/Dummy Variables before splitting into Train/Test:
np.random.seed(123)
df_before_split = pd.concat([df.drop('website', axis = 1), pd.get_dummies(df['website'])], axis=1)
# create your X and y dataframes
X_before_split = df_before_split.drop('web_views', axis = 1)
y_before_split = df_before_split['web_views']
# perform train test split
X_train_before_split, X_test_before_split, y_train_before_split, y_test_before_split = train_test_split(X_before_split, y_before_split, test_size = 0.20)
Now viewing the dataframes we have:
# view X train dataset (this is encoding before split)
X_train_before_split
and then for test
# View X test dataset dataset (this is encoding before split)
X_test_before_split
Scenario 2: One-Hot Encoding/Dummy Variables AFTER splitting into Train/Test:
# Perform One Hot encoding after the train/test split instead
X = df.drop('web_views', axis = 1)
y = df['web_views']
# perform data split:
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# perform one hot encoding on the train and test dataset datasets:
X_train = pd.concat([X_train.drop('website', axis = 1), pd.get_dummies(X_train['website'])], axis=1)
X_test = pd.concat([X_test.drop('website', axis = 1), pd.get_dummies(X_test['website'])], axis=1)
Viewing the X_train and X_test dataframes:
# encode after train/test split - train dataframe
X_train
# encode after train/test split - test dataframe
X_test
Performing Linear Regression Modelling
Now that we have split our data to demonstrate we will create a simple linear model:
from sklearn.linear_model import LinearRegression
Before split linear model
regressor_before_split = LinearRegression()
regressor_before_split.fit(X_train_before_split, y_train_before_split)
y_pred_before_split = regressor_before_split.predict(X_test_before_split)
y_pred_before_split
y_pred_before_split returns a predicting value what we would expect.
After split linear model
regressor_after_split = LinearRegression()
regressor_after_split.fit(X_train, y_train)
y_pred_after_split = regressor_after_split.predict(X_test)
y_pred_after_split
Error message from Scenario 2:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-92-c63978a198c8> in <module>()
2 regressor_after_split.fit(X_train, y_train)
3
----> 4 y_pred_after_split = regressor_after_split.predict(X_test)
5 y_pred_after_split
C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in predict(self, X)
254 Returns predicted values.
255 """
--> 256 return self._decision_function(X)
257
258 _preprocess_data = staticmethod(_preprocess_data)
C:\Anaconda3\lib\site-packages\sklearn\linear_model\base.py in _decision_function(self, X)
239 X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
240 return safe_sparse_dot(X, self.coef_.T,
--> 241 dense_output=True) + self.intercept_
242
243 def predict(self, X):
C:\Anaconda3\lib\site-packages\sklearn\utils\extmath.py in safe_sparse_dot(a, b, dense_output)
138 return ret
139 else:
--> 140 return np.dot(a, b)
141
142
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (1,1) and (3,) not aligned: 1 (dim 1) != 3 (dim 0)
My thoughts:
Encoding with dummies before splitting ensures that the test data that we pass in e.g. X_test to perform the predicitions has the same shape as the training data that the model was trained on therefore understands how to predict values when it encounters these features - unlike with encoding after splitting, since the X_test data has only one feature we are using to make predicitions with whereas the X_train has 3 features
Maybe I've introduced data leakage?
I'd be happy for someone to correct me if i've got things wrong or misinterpreted anything, but i'm stuck scratching me head if you encode before or after splitting!
y = df.pitch_name
y = np.array(y)
y = y.reshape(-1, 1)
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder()
y = ord_enc.fit_transform(y.reshape(-1, 1))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=12345
)
knn_model = KNeighborsRegressor(n_neighbors=3)
knn_model.fit(X_train, y_train)
knn_model.predict([X_test[0]])
X has all float value and y is all string type. if I use ordinalEncoder and predict with the model, it works but the issue is that the result I am getting is sometime not a whole number (e.g. 6.3333) when I want to get the exact category.
So whenever I fit the model with the raw categorical value, string, I see this error message TypeError: cannot perform reduce with flexible type. When I check error message, I suppose that the error is happening due to the 238 where they try to get y_pred = np.mean(_y[neigh_ind], axis=1) when it should be median since y is a list of string? any help will be appreciated.
237 if weights is None:
--> 238 y_pred = np.mean(_y[neigh_ind], axis=1)
239 else:
240 y_pred = np.empty((X.shape[0], _y.shape[1]), dtype=np.float64)
Forgive me if I misunderstood something, but it sounds like you are trying to perform classification with a regression model: KNeighborsRegressor.
From here you can see that:
Neighbors-based regression can be used in cases where the data labels are continuous rather than discrete variables. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbors.
So, by using the OrdinalEncoder, you just encoded the float categories, however, afterwards you predicted the mean of the labels of its nearest neighbors, which will not be an integer, and thus not a category.
I suggest that you read this, to learn how to use a KNeighborsClassifier.
I am using a program called GALPRO to implement a random forest regression algorithm to predict photometric redshift estimates. It uses a random forest algorithm as a method of machine learning. I input testing and training data. I use x_train (dimensions = [90,13]), x_train (dimensions = [10,13]) y_train (dimensions = [90,2]) and y_test (dimensions = [10,2]).
The code below shows how GALPRO does the random forest regression calculation:
model = RandomForestRegressor(**self.params)
model.fit(x_train, y_train)
I then make point estimate predictions using:
# Use the model to make predictions on new objects
y_pred = model.predict(x_test)
I am then trying to create error estimates using the forestci package random_forest_error:
y_error = fci.random_forest_error(model, x_train, x_test)
However I get an error:
ValueError Traceback (most recent call last)
/tmp/ipykernel_2626600/1096083143.py in <module>
----> 1 point_estimates = model.point_estimate(save_estimates=True, make_plots=False)
2 print(point_estimates)
/scratch/wiay/lara/galpro/galpro/model.py in point_estimate(self, save_estimates, make_plots)
158 # Use the model to make predictions on new objects
159 y_pred = self.model.predict(self.x_test)
--> 160 y_error = fci.random_forest_error(self.model, self.x_train, self.x_test)
161
162 # Update class variables
~/.local/lib/python3.7/site-packages/forestci/forestci.py in random_forest_error(forest, X_train, X_test, inbag, calibrate, memory_constrained, memory_limit)
279 n_trees = forest.n_estimators
280 V_IJ = _core_computation(
--> 281 X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit
282 )
283 V_IJ_unbiased = _bias_correction(V_IJ, inbag, pred_centered, n_trees)
~/.local/lib/python3.7/site-packages/forestci/forestci.py in _core_computation(X_train, X_test, inbag, pred_centered, n_trees, memory_constrained, memory_limit, test_mode)
135 """
136 if not memory_constrained:
--> 137 return np.sum((np.dot(inbag - 1, pred_centered.T) / n_trees) ** 2, 0)
138
139 if not memory_limit:
<__array_function__ internals> in dot(*args, **kwargs)
ValueError: shapes (90,100) and (100,10,2) not aligned: 100 (dim 1) != 10 (dim 1)
I'm not sure what this error means or why my dimensions are wrong as I am following a similar example. If anyone has any ideas please let me know!
I am doing feature selection by first training LogisticRegression with L1 penalty and then using the reduced feature set to re-train the model using L2 penalty. Now, when I try to predict test data, the transform() done on it results in a different dimensional array. I am confused as to how to re-size the test data to be able to predict.
Appreciate any help. Thank you.
vectorizer = CountVectorizer()
output = vectorizer.fit_transform(train_data)
output_test = vectorizer.transform(test_data)
logistic = LogisticRegression(penalty = "l1")
logistic.fit(output, train_labels)
predictions = logistic.predict(output_test)
logistic = LogisticRegression(penalty = "l2", C = i + 1)
output = logistic.fit_transform(output, train_labels)
predictions = logistic.predict(output_test)
The following error message is shown resulting from the last predict line. Original number of features is 26879:
ValueError: X has 26879 features per sample; expecting 7087
There seem to be a couple of things wrong here.
Firstly, I suggest you give different names to the two logistic models, as you need both to make a prediction.
In you code, you never call the transform of the l1 logistic regression, which is not what you say you want to do.
What you should be doing is
l1_logreg = LogisticRegression(penalty="l1")
l1_logreg.fit(output, train_labels)
out_reduced = l1_logreg.transform(out)
out_reduced_test = l1_logreg.transform(out_test)
l2_logreg = LogisticRegression(penalty="l2")
l2_logreg.fit(out_reduced, train_labels)
pedictions = l2_logreg.predict(out_reduced_test)
or
pipe = make_pipeline(CountVectorizer(), LogisticRegression(penalty="l1"),
LogisticRegression(penalty="l2"))
pipe.fit(train_data, train_labels)
preditions = pipe.predict(test_data)
FYI I wouldn't expect that to work better than just doing l2 logreg. Also you could try SGDClassifier(penalty="elasticnet").
I'm using scikit-learn MultinomialNB and Vectorizer to build a prediction model of whether the review is good or bad.
After training on the labelled data, how do I use it to predict new reviews (or existing review)? I'm getting the error message below.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.cross_validation import train_test_split
from sklearn.naive_bayes import MultinomialNB
X = vectorizer.fit_transform(df.quote)
X = X.tocsc()
Y = (df.fresh == 'fresh').values.astype(np.int)
xtrain, xtest, ytrain, ytest = train_test_split(X, Y)
clf = MultinomialNB().fit(xtrain, ytrain)
new_review = ['this is a new review, movie was awesome']
new_review = vectorizer.fit_transform(new_review)
print df.quote[15]
print(clf.predict(df.quote[10])) #predict existing review in dataframe
print(clf.predict(new_review)) #predict new review
Technically, Toy Story is nearly flawless.
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-91-27a0698bbd1f> in <module>()
15
16 print df.quote[15]
---> 17 print(clf.predict(df.quote[10])) #predict existing quote in dataframe
18 print(clf.predict(new_review)) #predict new review
//anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.pyc in predict(self, X)
60 Predicted target values for X
61 """
---> 62 jll = self._joint_log_likelihood(X)
63 return self.classes_[np.argmax(jll, axis=1)]
64
//anaconda/lib/python2.7/site-packages/sklearn/naive_bayes.pyc in _joint_log_likelihood(self, X)
439 """Calculate the posterior log probability of the samples X"""
440 X = atleast2d_or_csr(X)
--> 441 return (safe_sparse_dot(X, self.feature_log_prob_.T)
442 + self.class_log_prior_)
443
//anaconda/lib/python2.7/site-packages/sklearn/utils/extmath.pyc in safe_sparse_dot(a, b, dense_output)
178 return ret
179 else:
--> 180 return fast_dot(a, b)
181
182
TypeError: Cannot cast array data from dtype('float64') to dtype('S32') according to the rule 'safe'
You need to pass a Bag of Words representation to predict and not the text directly. You are doing it almost correctly with new_review, only change new_review = vectorizer.transform(new_review), (see #Stergios comment) . Try this:
print(clf.predict(X[10, :]))