How to join 2 columns of word embeddings in Pandas - python

I have extracted word embeddings of 2 different texts (title and description) and want to train an XGBoost model on both embeddings. The embeddings are 200 in dimension each as can be seen below:
Now I was able to train the model on 1 embedding data and it worked perfectly like this:
x=df['FastText'] #training features
y=df['Category'] # target variable
#Defining Model
model = XGBClassifier(objective='multi:softprob')
#Evaluation metrics
score=['accuracy','precision_macro','recall_macro','f1_macro']
#Model training with 5 Fold Cross Validation
scores = cross_validate(model, np.vstack(x), y, cv=5, scoring=score)
Now I want to use both the features for training but it gives me an error if I pass 2 columns of df like this:
x=df[['FastText_Title','FastText']]
One solution I tried is adding both the embeddings like x1+x2 but it decreases accuracy significantly. How do I use both features in cross_validate function?

In the past for multiple inputs, I've done this:
features = ['FastText_Title', 'FastText']
x = df[features]
y = df['Category']
It is creating an array containing both datasets.
I usually need to scale the data as well using MinMaxScaler once the new array has been made.

According to the error you are getting, it seems that there is something wrong with types. Try this, it will convert your features to numeric and it should work:
df['FastText'] = pd.to_numeric(df['FastText'])
df['FastText_Title'] = pd.to_numeric(df['FastText_Title'])

Related

ValueError: X has 2 features, but SVC is expecting 472082 features as input

I am loading Linear SVM model and then predicting new data using the stored trained SVM Model. I used TFIDF while training such as:
vector = TfidfVectorizer(ngram_range=(1, 3)).fit(data['text'])
**when i apply new data than I am getting error at the time of Prediction.
**
ValueError: X has 2 features, but SVC is expecting 472082 features as input.
Code for the Prediction of new data
Linear_SVC_classifier = joblib.load("/content/drive/MyDrive/dataset/Classifers/Linear_SVC_classifier.sav")
test_data = input("Enter Data for Testing: ")
newly_testing_data = vector.transform(test_data)
SVM_Prediction_NewData = Linear_SVC_classifier.predict(newly_testing_data)
I want to predict new data using stored SVM model without applying TFIDF on training data when I give data to model for prediction. When I use the new data for prediction than the prediction line gives error. Is there any way to remove this error?
The problem is due to your creation of a new TfidfVectorizer by fitting it on the test dataset. As the classifier has been trained on a matrix generated by the TfidfVectorier fitted on the training dataset, it expects the test dataset to have the exact same dimensions.
In order to do so, you need to transform your test dataset with the same vectorizer that was used during training rather than initialize a new one based on the test set.
The vectorizer fitted on the train set can be pickled and stored for later use to avoid any re-fitting at inference time.

Unseen data manipulation for pretrained model with picke dumps

I have several logs that are manipulated by two different TfIdfVectorizer objects.
The first one reads and splits the log in ngrams
with open("my/log/path.txt", "r") as test:
corpus = [test.read()]
tf = TfidfVectorizer(ngram_range=ngr)
corpus_transformed = tf.fit_transform(corpus)
infile.close()
The resulting data is written in a Pandas dataframe that has 4 columns
(score [float], review [ngrams of text], isbad [0/1], kfold [int]).
Initial kfold value is -1.
where I have:
my_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf.get_feature_names()).transpose()
For cross-validation I split the dataset in test and train with StratifiedKFolds by doing a simple:
ngr=(1,2)
for fold_ in range(5):
# 'reviews' column has short sentences expressing an opinion
train_df = df[df.kfold != fold_].reset_index(drop=True)
test_df = df[df.kfold == fold_].reset_index(drop=True)
new_tf = TfidfVectorizer(ngram_range=ngr)
new_tf.fit(train_df.reviews)
xtrain = new_tf.transform(train_df.reviews)
xtest = new_tf.transform(test_df.reviews)
And only after this double tfidf transformation I fit my SVC model with:
model.fit(xtrain, train_df.isbad) # where 'isbad' column is 0 if negative and 1 if positive
preds = model.predict(xtest)
accuracy = metrics.accuracy_score(test_df.isbad, preds)
So at the end of the day I have my model that classifies reviews in both classes (negative-0 or positive-1), I dump my model and both tfidf vectorizers (tf and new_tf) but when it comes to new data, even if I do:
with open("never/seen/data.txt", "r") as unseen: # load REAL SAMPLE data
corpus = [unseen.read()]
# to transform the unseen data I use one of the dumped tfidf's obj
corpus_transformed = tf_dump.transform(corpus)
unseen.close()
my_unseen_df = pd.DataFrame(corpus_transformed.toarray(), index=['score'], columns=tf_dump.get_feature_names()).transpose()
my_unseen_df = my_unseen_df.sample(frac=1).reset_index(drop=True) # randomize rows
# to transform reviews' data that are going to be classified I use the new_tf dump, like before
X = new_tf_dump.transform(my_unseen_df.reviews)
# use the previously loaded model and make predictions
res = model_dump.predict(X)
#print(res)
I got ValueError: X has 604,969 features, but SVC is expecting 605,424 as input, but how is that possibile if I manipulate the data with the same objects? What am I doing wrong here?
I want to use my trained model as a classifier for new, unseen data. Isn't this the right way to go?
Thank you.

adding more data to Support Vector Classifier training

I am using the LinearSVC() available on scikit learn to classify texts into a max of 7 seven labels. So, it is a multilabel classification problem. I am training on a small amount of data and testing it. Now, I want to add more data (retrieved from a pool based on a criteria) to the fitted model and evaluate on the same test set. How can this be done?
Question:
It is necessary to merge the previous data set with the new data set, get everything preprocessed and then retrain to see if the performance improve with the old + new data?
My code so far is below:
def preprocess(data, x, y):
global Xfeatures
global y_train
global labels
porter = PorterStemmer()
multilabel=MultiLabelBinarizer()
y_train=multilabel.fit_transform(data[y])
print("\nLabels are now binarized\n")
data[multilabel.classes_] = y_train
labels = multilabel.classes_
print(labels)
data[x].apply(lambda x:nt.TextFrame(x).noise_scan())
print("\English stop words were extracted\n")
data[x].apply(lambda x:nt.TextExtractor(x).extract_stopwords())
corpus = data[x].apply(nfx.remove_stopwords)
corpus = data[x].apply(lambda x: porter.stem(x))
tfidf = TfidfVectorizer()
Xfeatures = tfidf.fit_transform(corpus).toarray()
print('\nThe text is now vectorized\n')
return Xfeatures, y_train
Xfeatures, y_train = preprocess(df1, 'corpus', 'zero_level_name')
Xfeatures_train=Xfeatures[:300]
y_train_features = y_train[:300]
X_test=Xfeatures[300:400]
y_test=y_train[300:400]
X_pool=Xfeatures[400:]
y_pool=y_train[400:]
def model(modelo, tipo):
svc= modelo
clf = tipo(svc)
clf.fit(Xfeatures_train,y_train_features)
clf_predictions = clf.predict(X_test)
return clf_predictions
preds_pool = model(LinearSVC(class_weight='balanced'), OneVsRestClassifier)
It depends on how your previous dataset was. If your previous dataset was a well representation of your problem at hand, then adding more data will not increase your model performance by a large. So you can just test with the new data.
However, it is also possible that your initial dataset was not representative enough, and therefore with more data your classification accuracy increases. So in that case it is better to include all the data and preprocess it. Because preprocessing generally includes parameters that are computed on the dataset as whole. e.g., I can see you have TFIDF, or mean which is sensitive to the dataset at hand.

Improving classification by using clustering as a feature

I'm trying to improve my classification results by doing clustering and use the clustered data as another feature (or use it alone instead of all other features - not sure yet).
So let's say that I'm using unsupervised algorithm - GMM:
gmm = GaussianMixture(n_components=4, random_state=RSEED)
gmm.fit(X_train)
pred_labels = gmm.predict(X_test)
I trained the model with training data and predicted the clusters by the test data.
Now I want to use a classifier (KNN for example) and use the clustered data within it. So I tried:
#define the model and parameters
knn = KNeighborsClassifier()
parameters = {'n_neighbors':[3,5,7],
'leaf_size':[1,3,5],
'algorithm':['auto', 'kd_tree'],
'n_jobs':[-1]}
#Fit the model
model_gmm_knn = GridSearchCV(knn, param_grid=parameters)
model_gmm_knn.fit(pred_labels.reshape(-1, 1),Y_train)
model_gmm_knn.best_params_
But I'm getting:
ValueError: Found input variables with inconsistent numbers of samples: [418, 891]
Train and Test are not with same dimension.
So how can I implement such approach?
Your method is not correct - you are attempting to use as a single feature the cluster labels of your test data pred_labels, in order to fit a classifier with your training labels Y_train. Even in the huge coincidental case that the dimensions of these datasets were the same (hence not giving a dimension mismatch error, as here), this is conceptually wrong and does not actually make any sense.
What you actually want to do is:
Fit a GMM with your training data
Use this fitted GMM to get cluster labels for both your training and test data.
Append the cluster labels as a new feature in both datasets
Fit your classifier with this "enhanced" training data.
All in all, and assuming that your X_train and X_test are pandas dataframes, here is the procedure:
import pandas as pd
gmm.fit(X_train)
cluster_train = gmm.predict(X_train)
cluster_test = gmm.predict(X_test)
X_train['cluster_label'] = pd.Series(cluster_train, index=X_train.index)
X_test['cluster_label'] = pd.Series(cluster_test, index=X_test.index)
model_gmm_knn.fit(X_train, Y_train)
Notice that you should not fit your clustering model with your test data - only with your training ones, otherwise you have data leakage similar to the one encountered when using the test set for feature selection, and your results will be both invalid and misleading .

Scikit Learn - ValueError: X has 26879 features per sample; expecting 7087

I am doing feature selection by first training LogisticRegression with L1 penalty and then using the reduced feature set to re-train the model using L2 penalty. Now, when I try to predict test data, the transform() done on it results in a different dimensional array. I am confused as to how to re-size the test data to be able to predict.
Appreciate any help. Thank you.
vectorizer = CountVectorizer()
output = vectorizer.fit_transform(train_data)
output_test = vectorizer.transform(test_data)
logistic = LogisticRegression(penalty = "l1")
logistic.fit(output, train_labels)
predictions = logistic.predict(output_test)
logistic = LogisticRegression(penalty = "l2", C = i + 1)
output = logistic.fit_transform(output, train_labels)
predictions = logistic.predict(output_test)
The following error message is shown resulting from the last predict line. Original number of features is 26879:
ValueError: X has 26879 features per sample; expecting 7087
There seem to be a couple of things wrong here.
Firstly, I suggest you give different names to the two logistic models, as you need both to make a prediction.
In you code, you never call the transform of the l1 logistic regression, which is not what you say you want to do.
What you should be doing is
l1_logreg = LogisticRegression(penalty="l1")
l1_logreg.fit(output, train_labels)
out_reduced = l1_logreg.transform(out)
out_reduced_test = l1_logreg.transform(out_test)
l2_logreg = LogisticRegression(penalty="l2")
l2_logreg.fit(out_reduced, train_labels)
pedictions = l2_logreg.predict(out_reduced_test)
or
pipe = make_pipeline(CountVectorizer(), LogisticRegression(penalty="l1"),
LogisticRegression(penalty="l2"))
pipe.fit(train_data, train_labels)
preditions = pipe.predict(test_data)
FYI I wouldn't expect that to work better than just doing l2 logreg. Also you could try SGDClassifier(penalty="elasticnet").

Categories

Resources