My regression model is returning coefficients but I cannot figure out how to get it to tell me the R squared and the adjusted R Squared. I need those numbers as well as the coefficient for the intercept. Any insight is appreciated
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = list(data.columns) # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)
Figured it out.
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = list(data.columns) # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
model.score(data[features].values, data[target].values)
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)
print(model.score(data[features].values, data[target].values))
Related
I'm new to python but I'm trying to run a regression with a bunch of different variables. So far I've got it down to Scikit. I've been searching for hours but can't seem to find a way to import the data and then run a linear regression on it while returning the coefficients of each variable. Any help is much appreciated. I have 15 columns that I want to run against the X.
X = Margin
Ys = A1, B1, C1, D1 etc.
Example set below:
Margin,A1
-8,110.7
-10,112
-1,106.7
9,109
-9,107.5
1,108.1
-19,109.2
Here's what I've got so far I know it's not much
import pandas as pd
data = pd.read_csv("NBA.csv")
As a convention in machine learning we consider X as the features and Y as the target.
If you want to run a linear regression and extract the coefficients, you can do the following :
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = data.columns.tolist() # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)
It does not appear that Pyspark Onv-vs-Rest classifier provides probabilities. Is there a way to do this?
I am appending code below. I am adding the standard multiclass classifier for comparison.
from pyspark.ml.classification import LogisticRegression, OneVsRest
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
# load data file.
inputData = spark.read.format("libsvm") \
.load("/data/mllib/sample_multiclass_classification_data.txt")
(train, test) = inputData.randomSplit([0.8, 0.2])
# instantiate the base classifier.
lr = LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
# instantiate the One Vs Rest Classifier.
ovr = OneVsRest(classifier=lr)
# train the multiclass model.
ovrModel = ovr.fit(train)
lrm = lr.fit(train)
# score the model on test data.
predictions = ovrModel.transform(test)
predictions2 = lrm.transform(test)
predictions.show(6)
predictions2.show(6)
I don't think you can access the probabilities(confidence) vector because it takes the max of the confidence and drops the confidence vector. To test, you can make a copy of the class and modify it and remove the .drop(accColName)
http://spark.apache.org/docs/2.0.1/api/python/_modules/pyspark/ml/classification.html
# output the index of the classifier with highest confidence as prediction
labelUDF = udf(
lambda predictions: float(max(enumerate(predictions), key=operator.itemgetter(1))[0]),
DoubleType())
# output label and label metadata as prediction
return aggregatedDataset.withColumn(
self.getPredictionCol(), labelUDF(aggregatedDataset[accColName])).drop(accColName)
I have a set of data that I have used scikit learn PCA. I scaled the data before performing PCA with StandardScaler().
variance_to_retain = 0.99
np_scaled = StandardScaler().fit_transform(df_data)
pca = PCA(n_components=variance_to_retain)
np_pca = pca.fit_transform(np_scaled)
# make dataframe of scaled data
# put column names on scaled data for use later
df_scaled = pd.DataFrame(np_scaled, columns=df_data.columns)
num_components = len(pca.explained_variance_ratio_)
cum_variance_explained = np.cumsum(pca.explained_variance_ratio_)
eigenvalues = pca.explained_variance_
eigenvectors = pca.components_
I then ran K-Means clustering on the scaled dataset. I can plot the cluster centers just fine in scaled space.
My question is: how do I transform the locations of the centers back into the original data space. I know that StandardScaler.fit_transform() make the data have zero mean and unit variance. But with the new points of shape (num_clusters, num_features), can I use inverse_transform(centers) to get the centers transformed back into the range and offset of the original data?
Thanks, David
you can get cluster_centers on a kmeans, and just push that into your pca.inverse_transform
here's an example
import numpy as np
from sklearn import decomposition
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
iris = datasets.load_iris()
X = iris.data
y = iris.target
scal = StandardScaler()
X_t = scal.fit_transform(X)
pca = decomposition.PCA(n_components=3)
pca.fit(X_t)
X_t = pca.transform(X_t)
clf = KMeans(n_clusters=3)
clf.fit(X_t)
scal.inverse_transform(pca.inverse_transform(clf.cluster_centers_))
Note that sklearn has multiple ways to do the fit/transform. You can do StandardScaler().fit_transform(X) but you lose the scaler, and can't reuse it; nor can you use it to create an inverse.
Alternatively, you can do scal = StandardScaler() followed by scal.fit(X) and then by scal.transform(X)
OR you can do scal.fit_transform(X) which combines the fit/transform step
Here I am using SVR to Fit the data before that I am using scaling technique to scale the values and to get the prediction I am using the Inverse transform function
from sklearn.preprocessing import StandardScaler
#Creating two objects for dependent and independent variable
ss_X = StandardScaler()
ss_y = StandardScaler()
X = ss_X.fit_transform(X)
y = ss_y.fit_transform(y.reshape(-1,1))
#Creating a model object and fiting the data
reg = SVR(kernel='rbf')
reg.fit(X,y)
#To make a prediction
#First we have transform the value into scalar level
#Second inverse tranform the value to see the original value
ss_y.inverse_transform(reg.predict(ss_X.transform(np.array([[6.5]]))))
I am using the various mechanisms in scikit-learn to create a tf-idf representation of a training data set and a test set consisting of text features. Both data sets are preprocessed to use the same vocabulary so the features and the number of features are the same. I can create a model on the training data and assess its performance on the test data. I am wondering if I use SelectPercentile to reduce the number of features in the training set after transformation, how can identify the same features in the test set to utilise in prediction?
trainDenseData = trainTransformedData.toarray()
testDenseData = testTransformedData.toarray()
if ( useFeatureReduction== True):
reducedTrainData = SelectPercentile(f_regression,percentile=10).fit_transform(trainDenseData,trainYarray)
clf.fit(reducedTrainData, trainYarray)
# apply feature reduction to the test data
See code and comments below.
import numpy as np
from sklearn.datasets import make_classification
from sklearn import feature_selection
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000,
n_features=10,
n_informative=3,
n_redundant=0,
n_repeated=0,
n_classes=2,
random_state=0,
shuffle=False)
sp = feature_selection.SelectPercentile(feature_selection.f_regression, percentile=30)
sp.fit_transform(X[:-1], y[:-1]) #here, training are the first 9 data vectors, and the last one is the test set
idx = np.arange(0, X.shape[1]) #create an index array
features_to_keep = idx[sp.get_support() == True] #get index positions of kept features
x_fs = X[:,features_to_keep] #prune X data vectors
x_test_fs = x_fs[-1] #take your last data vector (the test set) pruned values
print x_test_fs #these are your pruned test set values
You should store the SelectPercentile object, and use it to transform the test data:
select = SelectPercentile(f_regression,percentile=10)
reducedTrainData = select.fit_transform(trainDenseData,trainYarray)
reducedTestData = select.transform(testDenseData)
With R, one can ignore a variable (column) while building a model with the following syntax:
model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)
It's very handy when your data set contains indexes or ID.
How would you do that with SKlearn in python, assuming your data are Pandas data frames ?
So this is from my own code I used to do some prediction on StackOverflow last year:
from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
basic_feature_names = [ 'BodyLength'
, 'NumTags'
, 'OwnerUndeletedAnswerCountAtPostTime'
, 'ReputationAtPostCreation'
, 'TitleLength'
, 'UserAge' ]
fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])
So if we wanted a subset of the features for classification I could have done this:
# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
So the idea here is that a list can be used to select a subset of the columns in the pandas dataframe, as such we can construct a new list or remove a value and use this for selection
I'm not sure how you could do this easily using numpy arrays as indexing is done differently.