Fit dataframe into linear regression sklearn - python

I am making a project for a class, and i am trying to predict nfl socre games using linear regression and predict functions from sklearn, my problem comes when i want to fit the training data into de fit function, here is my code:
onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])
# Crea el object de regression linear
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
This is the structure of dataframe(goal_model_data):
team opponent goals home
NE KC 27 1
BUF NYJ 21 1
CHI ATL 17 1
CIN BAL 0 1
CLE PIT 18 1
DET ARI 35 1
HOU JAX 7 1
TEN OAK 16 1
and this is the error that i get when i run the program:
Traceback (most recent call last):
File "predictnflgames.py", line 76, in <module>
regr.fit(onehotdata_x1[['home','team','opponent']], onehotdata_x1['goals'])
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2133, in __getitem__
return self._getitem_array(key)
File "C:\Python27\lib\site-packages\pandas\core\frame.py", line 2177, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "C:\Python27\lib\site-packages\pandas\core\indexing.py", line 1269, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['team' 'opponent'] not in index"

The problem is that after pd.get_dummies there are no team and opponent columns.
I use this data in txt format for my example: https://ufile.io/e2vtv (same as yours).
Try this and see:
import pandas as pd
from sklearn.linear_model import LinearRegression
goal_model_data = pd.read_table('goal_model_data.txt', delim_whitespace=True)
onehotdata_x1 = pd.get_dummies(goal_model_data,columns=['team','opponent'])
regr = LinearRegression()
#see the columns in onehotdata_x1
onehotdata_x1.columns
#see the data (only 2 rows of the data for the example)
onehotdata_x1.head(2)
Results:
Index([u'goals', u'home', u'team_BUF', u'team_CHI', u'team_CIN', u'team_CLE',
u'team_DET', u'team_HOU', u'team_NE', u'team_TEN', u'opponent_ARI',
u'opponent_ATL', u'opponent_BAL', u'opponent_JAX', u'opponent_KC',
u'opponent_NYJ', u'opponent_OAK', u'opponent_PIT'],
dtype='object')
goals home team_BUF team_CHI team_CIN team_CLE team_DET team_HOU \
0 27 1 0 0 0 0 0 0
1 21 1 1 0 0 0 0 0
team_NE team_TEN opponent_ARI opponent_ATL opponent_BAL opponent_JAX \
0 1 0 0 0 0 0
1 0 0 0 0 0 0
opponent_KC opponent_NYJ opponent_OAK opponent_PIT
0 1 0 0 0
1 0 1 0 0
EDIT 1
Based on the original code, you might want to do something like the following:
import pandas as pd
from sklearn.linear_model import LinearRegression
data = pd.read_table('data.txt', delim_whitespace=True)
onehotdata = pd.get_dummies(data,columns=['team','opponent'])
regr = LinearRegression()
#in x get all columns except goals column
x = onehotdata.loc[:, onehotdata.columns != 'goals']
#use goals column as target variable
y= onehotdata['goals']
regr.fit(x,y)
regr.predict(x)
Hope this helps.

When you use pd.get_dummies(goal_model_data,columns=['team','opponent']) the team and opponent column will be dropped from your dataframe and onehotdata_x1 won't contain these two columns.
Then, when you do onehotdata_x1[['home','team','opponent']] you get a KeyError simply because team and opponent do not exist as columns in the onehotdata_x1 dataframe.
Using a toy dataframe, here's what happens:

Related

Polynomial regression predicted values as dataframe (Python)

A lot of questions is answered regarding this, however, I could not figure out one thing.
I have a dataframe and I am performing regression,after that the results are stored in the new columns in Test dataframe. To compare methods I need to do both linear and polynomial regression.
I have found a way to beautifully do this with linear regression, where in result I have predicted values in new column of dataframe Test. But I cannot make this work within the same loop using polynomial regression, cause in the final Test dataframe I have multiple Null values as in the step of model_2.fit_transform(X) values somehow loses the corresponding Test index.
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures
Test = pd.read_csv(r'D:\myfile.csv')
df_coef =[]
value = list(set(Test['Value']))
for value in value:
df_redux = Test[Test['Value'] == value]
Y = df_redux['Y']
X = df_redux[['X1', 'A', 'B', 'B']]
X = sm.add_constant(X)
# linear
model_1 = sm.OLS(Y, X).fit()
predictions_1 = model_1.predict(X)
# polynomial
polynomial_features = PolynomialFeatures(degree=2)
xp = polynomial_features.fit_transform(X)
model_2 = sm.OLS(Y, xp).fit()
predictions_2 = model_2.predict(xp)
stats_1 = pd.read_html(model_1.summary().tables[1].as_html(), header=0, index_col=0)[0]
stats_2 = pd.read_html(model_2.summary().tables[1].as_html(), header=0, index_col=0)[0]
predictions_1 = pd.DataFrame(predictions_1, columns=['lin'])
predictions_2 = pd.DataFrame(predictions_2, columns=['poly'])
# ??? how to concat and appen both prediction_1 and prediction_2 in the same df_coef = [] dataframe?
gf = pd.concat([predictions_1, df_redux], axis=1)
df_coef.append(gf)
all_coef = pd.concat(df_coef)
type(all_coef)
Out[234]: pandas.core.frame.DataFrame
The problem is that tranformed xp type is <class 'numpy.ndarray'> , but X type is <class 'pandas.core.frame.DataFrame'>. The question is how can I get the polynomial regression predicted values in new column of Test, next to linear reg. results. This is probably really simple, but I could not figure it out.
print(type(X))
print(type(xp))
print(X.sample(2))
print()
print(xp)
<class 'pandas.core.frame.DataFrame'>
<class 'numpy.ndarray'>
X1 A B G1
962 4.334912 1.945910 3.135494 3.258097
1365 4.197888 2.197225 3.135494 3.332205
[[ 1. 4.77041663 1.94591015 ... 35.74106743 34.52550933
33.35129251]
[ 1. 4.43240629 1.94591015 ... 33.28387641 32.03140262
30.82605947]
[ 1. 3.21669428 1.94591015 ... 29.95821572 30.38903979
30.82605947]
The result which I get with polynominal reg. predicted values appended to original Test dataframe:
0 6.178542 3.0 692 ... 2.079442 4.783216 6.146329
1 6.156108 11.0 692 ... 2.197225 4.842126 6.113682
2 6.071453 12.0 692 ... 2.197225 4.814595 6.052089
3 5.842053 NaN NaN ... NaN NaN NaN
4 4.625762 30.0 692 ... 1.945910 5.018201 5.828946
This is the correct and good result I obtained using only linear regression, without Nan and with value in each row, how it supposed to be:
0 6.151675 3 692 5 ... 3.433987 2.079442 4.783216 6.146329
1 6.132077 11 692 5 ... 3.401197 2.197225 4.842126 6.113682
2 6.068450 12 692 5 ... 3.332205 2.197225 4.814595 6.052089
4 5.819535 30 692 5 ... 3.258097 1.945910 5.018201 5.828946
8 4.761362 61 692 5 ... 2.564949 1.945910 3.889585 4.624973
Solve this by adding a line for numpy to series tranformation. And for model statistics statsmodels summary:
import pandas as pd
from statsmodels.api import OLS
predictions_2 = model_2.predict(xp)
predictions_2_series = pd.Series(predictions_2, index=df_redux.index.values)
print(OLS(Y, xp).fit().summary())

How to do Multi label classification or Multi class classification of the below problem? Pandas Python

My original data looks like this.
id season home_team away_team home_goals away_goals result winner
0 0 2006-07 Shu Liv 1 1 D NaN
1 1 2006-07 Ars Avl 1 1 D NaN
2 2 2006-07 Eve Wat 2 1 H Eve
3 3 2006-07 New Wig 2 1 H New
4 4 2006-07 Por Bla 3 0 H Por
The purpose is to build a model that predicts
i.e.
Home Team Win 55%
Draw 13%
Away Team Win 32%
I Selected these 3 columns and label encoded them
home_team, away_team, winner
Then I created these new classes/lables.
df.loc[df["winner"]==df["home_team"],"home_team_win"]=1
df.loc[df["winner"]!=df["home_team"],"home_team_win"]=0
df.loc[df["result"]=='D',"draw"]=1
df.loc[df["result"]!='D',"draw"]=0
df.loc[df["winner"]==df["away_team"],"away_team_win"]=1
df.loc[df["winner"]!=df["away_team"],"away_team_win"]=0
Now the encoded data is looking like this,
home_team away_team home_team_win away_team_win draw
0 28 19 0 0 1
1 1 2 0 0 1
2 14 34 1 0 0
3 23 37 1 0 0
4 25 4 1 0 0
Initially, I used the code below for a single label 'home_team_win' and it worked fine, but it doesn't support multi classes/labels.
X = prediction_df.drop(['home_team_win'] ,axis=1)
y = prediction_df['home_team_win']
logReg=LogisticRegression(solver='lbfgs')
rfe = RFE(logReg, 20)
rfe = rfe.fit(X, y.values.ravel())
How to do Multi label classification or Multi class classification of this problem?
The target binary variables home_team_win, away_team_win, and draw are mutually exclusive. It does not seem to be a good idea to use multi-label methods in this problem, since, in general, they are designed to exploit dependencies among labels, which is nonexistent in this dataset.
I suggest modelling it as a multi-class problem in its most common form, where there is a single column with three classes: 0,1, and 2 (representing home_team_loss, draw, away_team_win).
Many implementations of classifiers in scikit-learn can work directly in this manner. Logistic Regression is one of them:
from sklearn.linear_model import LogisticRegression
logReg=LogisticRegression(solver='lbfgs', multi_class='ovr')
logReg.fit(X,Y)
logReg.predict_proba(X)
This code will output the desired probabilities for each class of each row of X.
In particular, this code trains one Logistic Regression for each class separately (this is what the multi_class='ovr' parameter do).
Take a look at https://scikit-learn.org/stable/supervised_learning.html for other classifiers that directly work in this multi-class dataset form that I suggested.

De-mean the data and convert to numpy array

I am trying to implement basic Matrix Factorization movie Recommender system on Movielens 1M dataset. But I am stuck here. what I want to do is I need to do is de-mean the data (normalize by each users mean) and convert it from a dataframe to a numpy array.
Code Snippet:
import pandas as pd
import numpy as np
ratings_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/ratings.dat', 'r').readlines()]
#users_list = [i.strip().split("::") for i in open('/users/nickbecker/Downloads/ml-1m/users.dat', 'r').readlines()]
movies_list = [i.strip().split("::") for i in open('S:/TIP/ml-1m/movies.dat', 'r').readlines()]
ratings_df = pd.DataFrame(ratings_list, columns = ['UserID', 'MovieID', 'Rating', 'Timestamp'], dtype = int)
movies_df = pd.DataFrame(movies_list, columns = ['MovieID', 'Title', 'Genres'])
movies_df['MovieID'] = movies_df['MovieID'].apply(pd.to_numeric)
R_df = ratings_df.pivot(index = 'UserID', columns ='MovieID', values = 'Rating').fillna(0)
R_df.head()
R = R_df.to_numpy()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)
error:
Traceback (most recent call last):
File "S:\TIP\Code\MF_orig.py", line 17, in <module>
user_ratings_mean = np.mean(R, axis = 1)
File "<__array_function__ internals>", line 6, in mean
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\fromnumeric.py", line 3257, in mean
out=out, **kwargs)
File "C:\Users\sarda\AppData\Local\Programs\Python\Python37\lib\site-packages\numpy\core\_methods.py", line 151, in _mean
ret = umr_sum(arr, axis, dtype, out, keepdims)
TypeError: can only concatenate str (not "int") to str
Edit:
Value of R is:
[['5' 0 0 ... 0 0 0]
['5' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
...
['4' 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 0]]
ratings_df:
UserID MovieID Rating Timestamp
0 1 1193 5 978300760
1 1 661 3 978302109
2 1 914 3 978301968
3 1 3408 4 978300275
4 1 2355 5 978824291
... ... ... ... ...
1000204 6040 1091 1 956716541
1000205 6040 1094 5 956704887
1000206 6040 562 5 956704746
1000207 6040 1096 4 956715648
1000208 6040 1097 4 956715569
movies_df:
MovieID Title Genres
0 1 Toy Story (1995) Animation|Children's|Comedy
1 2 Jumanji (1995) Adventure|Children's|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama
4 5 Father of the Bride Part II (1995) Comedy
... ... ... ...
3878 3948 Meet the Parents (2000) Comedy
3879 3949 Requiem for a Dream (2000) Drama
3880 3950 Tigerland (2000) Drama
3881 3951 Two Family House (2000) Drama
3882 3952 Contender, The (2000) Drama|Thriller
[3883 rows x 3 columns]
Dataset link:
http://files.grouplens.org/datasets/movielens/ml-1m.zip
It is working on object and even giving the dtype argument to pandas dataframe constructor isn't converting that to integer.
You have to convert it to int explicitly:
ratings_list = [[int(j) for j in i.strip().split("::") if j] for i in open('ratings.txt', 'r').readlines()]
And then proceed. I tried and this works.

Online Logistic Regression by Month with Sklearn

I would like to train a Logistic Regression classifier in online fashion with Sklearn. I know about the 'SAG' or 'SAGA' but I am not sure how to implement this.
Specifically, my goal is to get the algorithm to train on the last t-x months (e.g. x=3) at time t where t is a month in the year. I would want to make a prediction over the set of examples for the following month (time t+1).
Here is my df:
X.head()
year month age job marital
0 2008 5 56 3 1
1 2008 5 57 7 1
2 2008 5 37 7 1
3 2008 5 40 0 1
4 2008 5 56 7 1
y.head()
0 0
1 1
2 0
3 0
4 0
Name: y, dtype: int8
Say I have my clf as in the code below (in this example I have trained it on the entire dataset in batch):
clf = LogisticRegression(C=1, max_iter=100, class_weight = 'balanced')
y_pred = clf.predict(X)
cmx = pd.DataFrame(confusion_matrix(y, y_pred),
index = ['No', 'Yes'],
columns = ['No', 'Yes'])
Notice I am not just looking to get a model created for each month in the dataset, but to have a model train itself in an online (minibatch technically) fashion throughout the entire dataset

model.score in scikit returning NaN

I was trying to make a linear regression model for predicting the mileage offered by a car provided a few values. When I try to use dataframe.score, I'm getting a NaN error. My dataset doesn't contain any empty values. Please find the code and it's output attached. Any help is appreciated. Thanks in advance.
import pandas as p
import numpy as np`
d=p.read_csv('cars.csv')`
d=d.drop('Make', axis=1)
d=d.drop('Model', axis=1)
d=d.drop('Engine_Fuel_Type', axis=1)
d=d.drop('Number_of_Doors', axis=1)
d=d.drop('Market_Category', axis=1)
d=d.drop('Vehicle_Size', axis=1)
d=d.drop('Vehicle_Style', axis=1)
d['Transmission_Type']=d['Transmission_Type'].replace({1: 'MANUAL', 2: 'AUTOMATIC', 3: 'AUTOMATED_MANUAL', 4: 'DIRECT_DRIVE', 5: 'UNKNOWN'})
d=p.get_dummies(d, columns=['Transmission_Type'])
print(d.head())
X=d.drop('city_mpg', axis=1)
y=d[['city_mpg']]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
for idx, col_name in enumerate(X_train.columns):
print("Co efficient for", col_name, " is ", model.coef_[0][idx])
intercept = model.intercept_[0]
print("Intercept is", intercept)
print(model.score(X_test, y_test))
output I got was:
Year Engine_HP Engine_Cylinders highway_MPG city_mpg Popularity \
0 2011 335 6.0 26 19 3916
1 2011 300 6.0 28 19 3916
2 2011 300 6.0 28 20 3916
3 2011 230 6.0 28 18 3916
4 2011 230 6.0 28 18 3916
MSRP Transmission_Type_AUTOMATED_MANUAL Transmission_Type_AUTOMATIC \
0 46135 0 0
1 40650 0 0
2 36350 0 0
3 29450 0 0
4 34500 0 0
Transmission_Type_DIRECT_DRIVE Transmission_Type_MANUAL \
0 0 1
1 0 1
2 0 1
3 0 1
4 0 1
Transmission_Type_UNKNOWN
0 0
1 0
2 0
3 0
4 0
Co efficient for Year is 0.12795086619034354
Co efficient for Engine_HP is -0.015142081226758822
Co efficient for Engine_Cylinders is -0.4874611334108649
Co efficient for highway_MPG is 0.4410894679555171
Co efficient for Popularity is 3.102779517592471e-05
Co efficient for MSRP is 8.390933189373478e-06
Co efficient for Transmission_Type_AUTOMATED_MANUAL is -10.972343474157594
Co efficient for Transmission_Type_AUTOMATIC is -11.256303676369456
Co efficient for Transmission_Type_DIRECT_DRIVE is 45.812234118377674
Co efficient for Transmission_Type_MANUAL is -11.211947388437244
Co efficient for Transmission_Type_UNKNOWN is -12.371639579413452
Intercept is -232.22076881867463
Traceback (most recent call last):
File "C:\Users\mridu\OneDrive\Desktop\Pesu_io_ai\Week_2\Prog.py", line 38, in <module>
print(model.score(X_test, y_test))
File "C:\Users\mridu\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\base.py", line 386, in score
return r2_score(y, self.predict(X), sample_weight=sample_weight,
File "C:\Users\mridu\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\linear_model\base.py", line 256, in predict
return self._decision_function(X)
File "C:\Users\mridu\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\linear_model\base.py", line 239, in _decision_function
X = check_array(X, accept_sparse=['csr', 'csc', 'coo'])
File "C:\Users\mridu\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 453, in check_array
_assert_all_finite(array)
File "C:\Users\mridu\AppData\Local\Programs\Python\Python36\lib\site-packages\sklearn\utils\validation.py", line 44, in _assert_all_finite
" or a value too large for %r." % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Edit: The dataset that I'm using
Also note that I'm not using a few columns though it's given and I couldn't send the source as I've changed the dataset a bit to remove null rows.

Categories

Resources