I am using the following method to train a linear regressor to predict retweets of tweets. I am using 'text' as the feature and 'retweet_count' as the target to be predicted. However, I have several additional features in my data such as hasMedia, hasHashtag, followers_count, sentiment (which are numerical features). How can I combine these features with 'text' that was converted to tfidf vector?
I already tried concatenating pandas dataframes. And then When I give a new test data the features mismatch. Please check my question in Attributes mismatch between training and testing data in sklearn - linear regression
def predict_retweets(dataset):
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
keyword_response = tfidf.fit_transform(dataset['text']).toarray()
X = keyword_response
y = dataset['retweet_count']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
print(df)
return None
Sample of data
Related
I'm new in this field and I'm currently working with gene expression data. I have to do a classification where my data are Counts under matrix form. The features are the genes and the Samples to classify are the patients (7 types of cancer and healthy donors). The book from which I'm replicating the experiment says the following :
For the multi-class SVM classification algorithm, a One-Versus-One (OVO) approach was used. To cross validate the algorithm for all samples in the training cohort, the SVM algorithm was trained by all samples in the training cohort minus one, while the remaining sample was used for (blind) classification. This process was repeated for all samples until each sample was predicted once (leave-one-out cross-validation [LOOCV] procedure).
Now I actually know how to use Loocv on Python as I know how to use OVO by looking online. But I dont get what is mneant to be done here. I tried an attempt and results came out quite similar but im pretty sure I'm doing a horrible mistake somewhere. Please dont flame me I need help , here down below my interpretation (I copied this from internet and added Ovo instead of only svm):
#Function for training
def loocv(train_X,train_y):
# define X and y
X = train_X
y = train_y
# define LOOCV
loo = LeaveOneOut()
loo.get_n_splits(X)
# define true and predict list
y_true,y_pred = [],[]
# run
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model = SVC(kernel='linear',random_state=0)
ovo_classifier = OneVsOneClassifier(model)
ovo_classifier.fit(X_train,y_train)
yhat = ovo_classifier.predict(X_test)
y_true.append(y_test[0])
y_pred.append(yhat[0])
return y_true,y_pred,ovo_classifier
Validation :
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,random_state=0)
y_true,y_pred,model = loocv(X_train,y_train)
pred_y = model.predict(X_test)
training_accuracy = accuracy_score(y_true,y_pred)
accuracy = accuracy_score(y_test,pred_y)
print(accuracy)
print(training_accuracy)
Results :
0.6918604651162791
0.6658291457286432
I've built a machine learning model from 2 data frames called df_test and df_train for naive Bayes, I run it with this code in my pycharm, but when I run it with this model, it returns:
ValueError: Found input variables with inconsistent numbers of samples: [164309, 109541].
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(columns = ['Interest_Rate']), df_test, test_size=1.0,random_state=109) # 70% training and 30% test
from sklearn.naive_bayes import GaussianNB
#Create a Gaussian Classifier
gnb = GaussianNB()
#Train the model using the training sets
gnb.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = gnb.predict(X_test)
Where have I gone wrong?
You want 70-30 split of your data but here you created 100% test data. Change test_size to 0.3(30%) instead of 1.0(100%).
X_train, X_test, y_train, y_test = train_test_split(df_train.drop(columns = ['Interest_Rate']), df_test, test_size=0.3,random_state=109) # 70% training and 30% test
I want to apply StandardScaler only to the numerical parts of my dataset using the function sklearn.compose.ColumnTransformer, (the rest is already one-hot encoded). I would like to see .scale_ and .mean_ parameters fitted to the training data, but the function scaler.mean_ and scaler.scale_ obviously does not work when using a column transformer. Is there a way to do so?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)
scaler = StandardScaler()
data_pipeline = ColumnTransformer([
('numerical', scaler, numerical_variables)], remainder='passthrough')
X_train = data_pipeline.fit_transform(X_train)
The fitted transformers are available in the attributes transformers_ (a list) and named_transformers_ (a dict-like with keys the names you provided). So, for example,
data_pipeline.named_transformers_['numerical'].mean_
i have trained my ML model with linear regression using these
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression()
regressor.fit(X_train, y_train)
how do i get a prediction for a single row in the data set?
You can easily do something like this-
y_pred = regressor.predict(X_test)
So if you want to do inference on only a single row in the dataset let's say the row at 2nd index. You would do-
y_pred = regressor.predict(X_test[2])
Scikit-learn models all have a predict method you can use.
Just pass it your row as an array and you'll be fine:
regressor.predict(x_val)
Since you only want the first row, you could do
first_row = X_test[0] //assuming X_test is where your test data is at
y_pred = regressor.predict(first_row)
I want to test a specific row from my dataset and to see the result, but I don't know how to do it. For example I want to test row number 100 and then to see the accuracy.
feature_cols = [0,1,2,3,4,5]
X = df[feature_cols] # Features
y = df[6] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1,
random_state=1)
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(max_depth=5)
#Train Decision Tree Classifer
clf = clf.fit(X_train,y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
print("Accuracy:", metrics.accuracy_score(y_test, y_pred))
I recommend excluding the row you want to test from the dataset.
test_row=100
train_idx=np.arange(X.shape[0])!=test_row
test_idx=np.arange(X.shape[0])==test_row
X_train=X[train_idx]
y_train=y[train_idx]
X_test=X[test_idx]
y_test=y[test_idx]
Now X_test will contain a single row. However, the accuracy will now be either 0 or 1 since you are only testing one sample.