I'm new to python but I'm trying to run a regression with a bunch of different variables. So far I've got it down to Scikit. I've been searching for hours but can't seem to find a way to import the data and then run a linear regression on it while returning the coefficients of each variable. Any help is much appreciated. I have 15 columns that I want to run against the X.
X = Margin
Ys = A1, B1, C1, D1 etc.
Example set below:
Margin,A1
-8,110.7
-10,112
-1,106.7
9,109
-9,107.5
1,108.1
-19,109.2
Here's what I've got so far I know it's not much
import pandas as pd
data = pd.read_csv("NBA.csv")
As a convention in machine learning we consider X as the features and Y as the target.
If you want to run a linear regression and extract the coefficients, you can do the following :
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = data.columns.tolist() # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)
Related
I am learning Linear regression, I wrote this Linear Regression code using scikit-learn , after making the prediction, how to do prediction for new data points which are not there in my original data set.
In this data set you are given the salaries of people according to their work experience.
For example , The predicted salary for a person with work experience of 15 years should be [167005.32889087]
Here is image of data set
Here is my code ,
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
data = pd.read_csv('project_1_dataset.csv')
X = data.iloc[:,0].values.reshape(-1,1)
Y = data.iloc[:,1].values.reshape(-1,1)
linear_regressor = LinearRegression()
linear_regressor.fit(X,Y)
Y_pred = linear_regressor.predict(X)
plt.scatter(X,Y)
plt.plot(X, Y_pred, color = 'red')
plt.show()
After fitting and training your model with your existed dataset (i.e. after linear_regressor.fit(X,Y)), you could make predictions in new instances in the same way:
new_prediction = linear_regressor.predict(new_data)
print(new_prediction)
where new_data is your new data point.
If you want to make predictions on particular random new data points, the above way should be enough. If your new data points belong to another dataframe, then you could replace new_data with the respective dataframe containing the new instances to be predicted.
My regression model is returning coefficients but I cannot figure out how to get it to tell me the R squared and the adjusted R Squared. I need those numbers as well as the coefficient for the intercept. Any insight is appreciated
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = list(data.columns) # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)
Figured it out.
# import the needed libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
# Import the data
data = pd.read_csv("NBA.csv")
# Specify the features and the target
target = 'Margin'
features = list(data.columns) # This is the column names of your data as a list
features.remove(target) # We remove the target from the list of features
# Train the model
model = LinearRegression() # Instantiate the model
model.fit(data[features].values, data[target].values) # fit the model to the data
model.score(data[features].values, data[target].values)
print(features) # Returns the name of each feature
print(model.coef_) # Returns the coefficients for each feature (in the same order of your features)
print(model.score(data[features].values, data[target].values))
I have data in the following form
Class Feature set list
classlabel1 - [size,time] example:[6780.3,350.00]
classlabel2 - [size,time]
classlabel3 - [size,time]
classlabel4 - [size,time]
How do I save this data in excel sheet and how can I train the model using this feature set? Currently I am working on SVM classifier.
I have tried saving the feature set list in a dataframe and saving this dataframe to a csv file. But the size and time are getting split into two different columns.
The dataframe is getting saved in csv file in the following way:
col 0 col1 col2
62309 396.5099154 label1
I would like to train and test on the feature vector [size,time] combined. Is it possible and is this a right way? If it is possible, how can I do it?
Firstly responding to your question:
I would like to train and test on the feature vector [size,time]
combined. Is it possible and is this a right way? If it is possible,
how can I do it?
Combining the two is not the right thing to do because both are in two different scales (if they are actually what there name suggests) and also combining them will result in loss of information which they will provide, so they are two totally independent features for any ML supervised algorithm. So I would suggest to treat these two features separately rather than combining into one.
Now let's move onto to next section:
How do I save this data in excel sheet and how can I train the model
using this feature set? Currently I am working on SVM classifier.
Storing data : In my opinion, you can store data in whichever format you want but I would prefer storing data in csv format as it is convenient and loading of data file is faster.
sample_data.csv
size,time,class_label
100,150,label1
200,250,label2
240,180,label1
Below is the code for reading the data from csv and training SVM :
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# loading data
data = pd.read_csv("sample_data.csv", error_bad_lines=True,
warn_bad_lines=True)
# Dividing into dependent and independent features
Y = data.class_label_col.values
X = data.drop("class_label_col", axis=1).values
# encode the class column values
label_encoded_Y = preprocessing.LabelEncoder().fit_transform(list(Y))
# split training and testing data
x_train,x_test,y_train,y_test=train_test_split(X,label_encoded_Y,
train_size=0.8,
test_size=0.2)
# Now use the whichever trainig algo you want
clf = SVC(gamma='auto')
clf.fit(x_train, y_train)
# Using the predictor
y_pred = clf.predict(x_test)
Since size and time are different features, you should separate them into 2 different columns so your model could set separate weight to each of them, i.e.
# data.csv
size time label
6780.3 3,350.00 classLabel1
...
If you want to transform the data you have into the format above you could use pandas.read_excel and use ast to transform the string list into python list object.
import pandas as pd
import ast
df = pd.read_excel("data.xlsx")
size_time = [(ast.literal_eval(x)[0], ast.literal_eval(x)[1]) for x in df["Feature set list"]]
size = [x[0] for x in size_time]
time = [x[1] for x in size_time]
label = df["Class"]
new_df = pd.DataFrame({"size":size, "time":time, "label":label})
# This will result in the DataFrame below.
# size time label
# 6780.3 350.0 classlabel1
# Save DataFrame to csv
new_df.to_csv("data_fix.csv")
# Use it
x = new_df.drop("label", axis=1)
y = new_df.label
# Further data preparation, such as split the dataset
# into train and test set, etc.
...
Hope this helps
I'm using the LightGBM Package.
I have successfully created a new tree using "create_tree_digraph" but I face some trouble understanding the result.
There is "leaf_value" in a leaf node. I don't know what it means. Please, somebody help me understand this. Thanks. :)
I used this example code from here: https://www.analyticsvidhya.com/blog/2017/06/which-algorithm-takes-the-crown-light-gbm-vs-xgboost/
#importing standard libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import graphviz
import lightgbm as lgb
#loading our training dataset 'adult.csv' with name 'data' using pandas
data=pd.read_csv('./adult.csv',header=None)
#Assigning names to the columns
data.columns=['age','workclass','fnlwgt','education','education-num','marital_Status','occupation','relationship','race','sex','capital_gain','capital_loss','hours_per_week','native_country','Income']
# Label Encoding our target variable
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
l=LabelEncoder()
l.fit(data.Income)
data.Income=Series(l.transform(data.Income)) #label encoding our target variable
#One Hot Encoding of the Categorical features
one_hot_workclass=pd.get_dummies(data.workclass)
one_hot_education=pd.get_dummies(data.education)
#removing categorical features
data.drop(['workclass','education','marital_Status','occupation','relationship','race','sex','native_country'],axis=1,inplace=True)
#Merging one hot encoded features with our dataset 'data'
data=pd.concat([data,one_hot_workclass,one_hot_education],axis=1)
#Here our target variable is 'Income' with values as 1 or 0.
#Separating our data into features dataset x and our target dataset y
x=data.drop('Income',axis=1)
y=data.Income
#Imputing missing values in our target variable
y.fillna(y.mode()[0],inplace=True)
#Now splitting our dataset into test and train
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.3)
train_data=lgb.Dataset(x_train,label=y_train)
#setting parameters for lightgbm
param = {'num_leaves':150, 'objective':'binary','max_depth':3,'learning_rate':.05,'max_bin':200}
param['metric'] = ['auc', 'binary_logloss']
#training our model using light gbm
num_round=50
lgbm=lgb.train(param,train_data,num_round)
graph = lgb.create_tree_digraph(lgbm)
graph.render(view=True)
Then I applied 'create_tree_digraph' function.
Pics
These are the raw predicted probabilities before the sigmoid function is applied. However, one thing to be aware of is your image is only showing 1 tree out of the entire model so it will not be the same as the actual outcome (unless your model is just this 1 tree).
This Image is showing what it would look like if you applied the sigmoid to the leaf values prior to creating the plots.
With R, one can ignore a variable (column) while building a model with the following syntax:
model = lm(dependant.variable ~ . - ignored.variable, data=my.training,set)
It's very handy when your data set contains indexes or ID.
How would you do that with SKlearn in python, assuming your data are Pandas data frames ?
So this is from my own code I used to do some prediction on StackOverflow last year:
from __future__ import division
from pandas import *
from sklearn import cross_validation
from sklearn import metrics
from sklearn.ensemble import GradientBoostingClassifier
basic_feature_names = [ 'BodyLength'
, 'NumTags'
, 'OwnerUndeletedAnswerCountAtPostTime'
, 'ReputationAtPostCreation'
, 'TitleLength'
, 'UserAge' ]
fea = # extract the features - removed for brevity
# construct our classifier
clf = GradientBoostingClassifier(n_estimators=num_estimators, random_state=0)
# now fit
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
# now
priv_fea = # this was my test dataset
# now calculate the predicted classes
pred = clf.predict(priv_fea[basic_feature_names])
So if we wanted a subset of the features for classification I could have done this:
# want to train using fewer features so remove 'BodyLength'
basic_feature_names.remove('BodyLength')
clf.fit(fea[basic_feature_names], orig_data['OpenStatusMod'].values)
So the idea here is that a list can be used to select a subset of the columns in the pandas dataframe, as such we can construct a new list or remove a value and use this for selection
I'm not sure how you could do this easily using numpy arrays as indexing is done differently.