How to use cross validation with sm.GLM (sm = statsmodels.api)?
I am trying to fill the "model" parameter from the cross_val_score. However, since I need to use the sm.GLM I don't know how to use it.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 1)
n = 5
df_cut, bins = pd.cut(X_train, n, retbins=True, right=True)
df_steps = pd.concat([X_train, df_cut, y_train], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 5 bins
bin_mapping = np.digitize(X_test, bins, right=True)
X_valid = pd.get_dummies(bin_mapping)
# Removing any outliers
# X_valid = pd.get_dummies(bin_mapping).drop([6], axis=1)# Prediction
pred2 = fit3.predict(X_valid)
# Calculating RMSE
rms = sqrt(mean_squared_error(y_test, pred2))
print(rms)
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
cv=cv, n_jobs=-1)
Therefore, instead of computing the RMSE like above, I would like to use the cross_val_score. For example, in the model parameter, if I would like to use lasso I would put model = lasso(). However, here I can not put model = sm.GLM().
I hope it is clear...
Related
I made a random forest regression with filter for y and x variables and I also wanted to add more about shapley values by creating a graph and table with column for the variable and column for the shaply value result. The code plots the graph, but the table is not showing.
So far my code looks like this:
x=widgets.SelectMultiple(
options=list(dataset.select_dtypes('number').columns),
disabled=False,
value=("NUMBER_SPOTS",)
)
def randomforest(y, x):
x = dataset[list(x)]
y = dataset[y]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
shap.initjs()
model = RandomForestRegressor(random_state=0).fit(X_train, y_train)
y_predict = model.predict(X_test)
mean_squared_error(y_test, y_predict)**(0.5)
print('Mean Squared Error:', mean_squared_error(y_test, y_predict)**(0.5))
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, features=X_train, feature_names=X_train.columns, plot_size=[15,8])
shap_vals = shap_values[0, :]
feature_importance = pd.DataFrame(list(zip(X_train.columns, shap_vals)), columns=['X_train', 'shap_vals'])
feature_importance.sort_values(by=['shap_vals'], ascending=False,inplace=True)
feature_importance
interact(randomforest, y = list(dataset.select_dtypes('number').columns), x = x)
So i've made a model for values prediction using linear regression. And now i need to get it to predict for 2022-2024 years into the future. how can i do it? maybe add rows 2023-2024 to dataframe? but will it be correct?
Data
data['Year'] = pd.to_datetime(data['Year'])
data.index = data['Year']
data.drop(['Year'], axis=1, inplace=True)
data = data.bfill().ffill()
y = data['x4']
X = data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
If you want to predict you just have to put the new data:
X_new = new_data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
y_new = model.predict(X_new)
model is your linear regression which was trained, now you predict with the new data, in the same order and same format you did for your X_train/X_test and that's it
i have built this model for the dataset Concrete Compressive Strength.
X_train_new, X_test_new, Y_train_new, Y_test_new = train_test_split(new_X, Y, test_size=0.3, shuffle='False')
new_X = pd.concat((X**(i+1) for i in range(n)), axis=1)
colnames=[]
for j in range(8*n):
colnames.append(j)
new_X.columns = colnames
X_train_new_scaled = scaler.fit_transform(X_train_new)
X_test_new_scaled = scaler.transform(X_test_new)
ols.fit(X_train_new_scaled, Y_train)
poly_train_pred = ols.predict(X_train_new_scaled)
poly_test_pred = ols.predict(X_test_new_scaled)
poly_mae_train_error = metrics.mean_absolute_error(Y_train_new, poly_train_pred)
poly_mae_test_error = metrics.mean_absolute_error(Y_test_new, poly_test_pred)
print(poly_mae_train_error, "\n", poly_mae_test_error)```
8 is the number of the features.
ols is a LinearRegression class that i created earlier.
Is it ok that the test_error of the this model is higher than the one of the simple linear model?
Also how i implement the fucntion test_poly_regression(X_train, y_train, X_test, y_test, n=2) to run this model for multiple n and check the errors?
I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.
I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
print(results)
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X=X[features]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
reg.fit(X_train_n,Y_train);
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
X_val_n.shape,X_train_n.shape,X_test_n.shape
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
Y_new.head()
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.