If I have a training set trainX, trainy, I know that you can run PCA with
pca = PCA(n_components=5)
Xred = pca.fit(trainX).transform(trainX)
If I want to run a model, say Linear Regression, do I then run PCA on the testX?
Like this:
clf = linear_model.LinearRegression()
clf.fit(trainX, trainY)
testXred = pca.fit(testX).transform(testX)
predictions = clf.predict(testXred)
Or do I only run PCA on the training set, so the Linear Regression prediction should be this instead?
predictions = clf.predict(testX)
or this?
testXred = pca.fit(trainX).transform(testX)
predictions = clf.predict(testXred)
If you mean you want to reduce noise using PCA before doing the linear regression, here's an example, which might help:
Using PCA on linear regression
Related
I'am trying to train a linear regression model from Sklearn. However, it seems that the resulting regression does not fit the data very well. I would've expected the regression to (approximately) have a slope of 30°, whereas this regression does not show a correlation at all (horizontal slope). Does anyone of you guys have an idea on how I can modify my model to have a more appropriate prediction?
Data Plot
This is the corresponding code:
# Define x & y:
x = pd.DataFrame({'patientweight': df['patientweight']})
y = pd.DataFrame({'rate': df['rate']})
# Split Data into Train-/Test-Set:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=0)
# Train the Linear Regression Model:
lin_reg = LinearRegression()
lin_reg.fit(x_train, y_train)
# Predict the Test Set Results:
y_pred = lin_reg.predict(x_train)
# Plot the data:
plt.figure(figsize=(15,10))
plt.xlabel('Patientweight')
plt.ylabel('Rate')
plt.title('Cohort 1: Relation between Patientweight and Rate')
plt.xlim(0,200)
plt.ylim(0,4000)
plt.scatter(df['patientweight'], df['rate'], alpha=0.25)
plt.scatter(x_train, y_pred)`
I have done binary classification for a dataset to determine whether there is a leak or no-leak.I have applied 3 ML algorithms separately for comparing performance namely naive-bayes,random forest and decision tree.for the decision tree classifier i have done the following code where s1 to s20 are sensor values how can i plot an error analysis graph.Since i have the predicted output as either 0 or 1
#creating features and labels
n_features = list(zip(s1,s2,s3,s4,s5,s6,s7,s8,s9,s10,s11,s12,s13,s14,s15,s16,s17,s18,s19,s20))
n_samples = status
#Decision tree regression
clf = tree.DecisionTreeRegressor()
#spliting of data
X_train, X_test, y_train, y_test = train_test_split(n_features,n_samples, test_size=0.5,random_state=0)
sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.fit_transform(X_test)
#train model
clf.fit(X_train,y_train)
#prediction
y_pred = clf.predict(X_test_std)
print('percentage Accuracy:',100*metrics.accuracy_score(y_test,y_pred))
Create a dataframe called model_performance_df. Add the machine learning algorithms you have used Naive Bayes, RandomForest and DecisionTree as the column names in the dataframe. Add those performance metrics for each algorithm in the dataframe.
Use the visualization library matplotlib or seaborn to plot the graph as you like. Example, give a try Histogram or Distribution plot.
I am trying to compute the accuracy and ROC for LinearSVM, but I'm not sure about getting probabilities for calculating ROC.
I have this for calculating the accuracy. y_pred gives me hard predictions.
svm = LinearSVC()
y_pred = cross_val_predict(svm, X, y, cv=5)
For calculating the probabilities, I have this:
clf = CalibratedClassifierCV(svm, cv=5)
scores = cross_val_predict(clf, X, y, cv=5, method='predict_proba')[:,1]
I am not sure of the above 2 lines because I feel like there is some repetition with the cv=5 parameter. Any ideas on how to combine cross_val_predict and CalibratedClassifierCV? I don't have a separate test set. svm with linear kernel gives me different results, and I only want to use LinearSVM.
I have two classifier in python such as svm and logistic regression.
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn import svm
scaler = preprocessing.StandardScaler()
scaler.fit(synthetic_data)
synthetic_data = scaler.transform(synthetic_data)
test_data = scaler.transform(test_data)
svc = svm.SVC(tol=0.0001, C=100.0).fit(synthetic_data, synthetic_label)
predictedSVM = svc.predict(test_data)
print(accuracy_score(test_label, predictedSVM))
LRmodel = LogisticRegression(penalty='l2', tol=0.0001, C=100.0, random_state=1,max_iter=1000, n_jobs=-1)
predictedLR = LRmodel.fit(synthetic_data, synthetic_label).predict(test_data)
print(accuracy_score(test_label, predictedLR))
I use same input but their accuracy is so different. svm sometimes predicts all predicted svm as 1. Accuracy of svm is 0.45 and accuracy of logistic regression is 0.75. I changed parameters of C in a different ways, but I have still some problems.
It is because SVC by default uses radial kernel (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html), which is something different than linear classification.
If you want to use linear kernel add parameter kernel='linear' to SVC.
If you want to keep using radial kernel, I suggest to also change gamma parameter.
I am trying to build a predictive model using python. The training and test data set has over 400 variables. On using feature selection on training data set the number of variables are reduced to 180
from sklearn.feature_selection import VarianceThreshold
sel = VarianceThreshold(threshold = .9)
and then I am training a model using gradient boosting algorithm achieveing .84 AUC accuracy in cross validation.
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score as auc
df_fit, df_eval, y_fit, y_eval= train_test_split( df, y, test_size=0.2, random_state=1 )
boosting_model = ensemble.GradientBoostingClassifier(n_estimators=100, max_depth=3,
min_samples_leaf=100, learning_rate=0.1,
subsample=0.5, random_state=1)
boosting_model.fit(df_fit, y_fit)
But when I am trying to use this model to predict for prediction data set it is giving me error
predict_target = boosting_model.predict(df_prediction)
Error: Number of variables in prediction data set 'df_prediction' does not match the number of variables in the model
Which makes sense because total variables in testing data remains to be over 400.
My question is there anyway to bypass this problem and keep using feature selection for predictive modeling. Because if I remove it the accuracy of model drops down to .5 which is very poor.
Thanks!
You should transform your prediction matrix through your feature selection too. So somewhere in your code you do
df = sel.fit_transform(X)
and before predicting
df_prediction = sel.transform(X_prediction)