i have built this model for the dataset Concrete Compressive Strength.
X_train_new, X_test_new, Y_train_new, Y_test_new = train_test_split(new_X, Y, test_size=0.3, shuffle='False')
new_X = pd.concat((X**(i+1) for i in range(n)), axis=1)
colnames=[]
for j in range(8*n):
colnames.append(j)
new_X.columns = colnames
X_train_new_scaled = scaler.fit_transform(X_train_new)
X_test_new_scaled = scaler.transform(X_test_new)
ols.fit(X_train_new_scaled, Y_train)
poly_train_pred = ols.predict(X_train_new_scaled)
poly_test_pred = ols.predict(X_test_new_scaled)
poly_mae_train_error = metrics.mean_absolute_error(Y_train_new, poly_train_pred)
poly_mae_test_error = metrics.mean_absolute_error(Y_test_new, poly_test_pred)
print(poly_mae_train_error, "\n", poly_mae_test_error)```
8 is the number of the features.
ols is a LinearRegression class that i created earlier.
Is it ok that the test_error of the this model is higher than the one of the simple linear model?
Also how i implement the fucntion test_poly_regression(X_train, y_train, X_test, y_test, n=2) to run this model for multiple n and check the errors?
Related
So i've made a model for values prediction using linear regression. And now i need to get it to predict for 2022-2024 years into the future. how can i do it? maybe add rows 2023-2024 to dataframe? but will it be correct?
Data
data['Year'] = pd.to_datetime(data['Year'])
data.index = data['Year']
data.drop(['Year'], axis=1, inplace=True)
data = data.bfill().ffill()
y = data['x4']
X = data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
If you want to predict you just have to put the new data:
X_new = new_data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
y_new = model.predict(X_new)
model is your linear regression which was trained, now you predict with the new data, in the same order and same format you did for your X_train/X_test and that's it
I am running a RandomSearch cross validation on the KNN classifier model with k-fold cross validation. Without using any libraries I wish to split my dataset into 3 (folds). Now, we will have 3 groups of data. I want to make (group 1 + group 2),(group 1 + group 3),(group 2 + group 3) as training data and group 3, group 2 and group 3 as test data respectively.
x,y = make_classification(n_samples=10000, n_features=2, n_informative=2, n_redundant= 0, n_clusters_per_class=1, random_state=60)
X_train, X_test, y_train, y_test = train_test_split(x,y,stratify=y,random_state=42)
def RandomSearchCV(x_train,y_train,classifier, param_range, folds):
group1_x=X_train[0:int(len(X_train)/folds)]
group2_x=X_train[int(len(X_train)/folds):2*int(len(X_train)/folds)]
group3_x=X_train[0:int(len(X_train)/folds)]
group1_y=y_train[0:int(len(y_train)/folds)]
group2_y=y_train[int(len(y_train)/folds):2*int(len(y_train)/folds)]
group3_y=y_train[0:int(len(y_train)/folds)]
cv1_train_x=np.concatenate((group1_x,group2_x))
cv1_test_x=group3_x
cv1_train_y=np.concatenate((group1_y,group2_y))
cv1_test_y=group3_y
cv2_train_x=np.concatenate((group1_x,group3_x))
cv2_test_x=group2_x
cv2_train_y=np.concatenate((group1_y,group3_y))
cv2_test_y=group2_y
cv3_train_x=np.concatenate((group2_x,group3_x))
cv3_test_x=group1_x
cv3_train_y=np.concatenate((group2_y,group3_y))
cv3_test_y=group1_y
trainscores=[]
testscores=[]
for k in tqdm(params['n_neighbors']):
trainscores_folds=[]
testscores_folds=[]
for j in range(0,folds):
classifier.n_neighbors=k
classifier.fit(cv1_train_x,cv1_train_y)
Y_predicted = classifier.predict(cv1_test_x)
testscores_folds.append(accuracy_score(cv1_test_y, Y_predicted))
Y_predicted = classifier.predict(cv1_train_x)
trainscores_folds.append(accuracy_score(cv1_train_y, Y_predicted))
#
classifier.fit(cv2_train_x,cv2_train_y)
Y_predicted = classifier.predict(cv2_test_x)
testscores_folds.append(accuracy_score(cv2_test_y, Y_predicted))
Y_predicted = classifier.predict(cv2_train_x)
trainscores_folds.append(accuracy_score(cv2_train_y, Y_predicted))
#
classifier.fit(cv3_train_x,cv3_train_y)
Y_predicted = classifier.predict(cv3_test_x)
testscores_folds.append(accuracy_score(cv3_test_y, Y_predicted))
Y_predicted = classifier.predict(cv3_train_x)
trainscores_folds.append(accuracy_score(cv3_train_y, Y_predicted))
trainscores.append(np.mean(np.array(trainscores_folds)))
testscores.append(np.mean(np.array(testscores_folds)))
return trainscores,testscores
My code is not clean and I want to have a cleaner version of this.
Note that,(1)The splitting should not be random and based on the index. (2) The n_neighbor values have been outside the function. please help me out with this.
How to use cross validation with sm.GLM (sm = statsmodels.api)?
I am trying to fill the "model" parameter from the cross_val_score. However, since I need to use the sm.GLM I don't know how to use it.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 1)
n = 5
df_cut, bins = pd.cut(X_train, n, retbins=True, right=True)
df_steps = pd.concat([X_train, df_cut, y_train], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 5 bins
bin_mapping = np.digitize(X_test, bins, right=True)
X_valid = pd.get_dummies(bin_mapping)
# Removing any outliers
# X_valid = pd.get_dummies(bin_mapping).drop([6], axis=1)# Prediction
pred2 = fit3.predict(X_valid)
# Calculating RMSE
rms = sqrt(mean_squared_error(y_test, pred2))
print(rms)
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
cv=cv, n_jobs=-1)
Therefore, instead of computing the RMSE like above, I would like to use the cross_val_score. For example, in the model parameter, if I would like to use lasso I would put model = lasso(). However, here I can not put model = sm.GLM().
I hope it is clear...
I am training a model by LR algorithm on this white-wine dataset ( https://archive.ics.uci.edu/ml/datasets/wine). After having model trained on Python, I printed out model.coef to see level of importance for all model just to notice that "residual sugar" is assigned quite large weight ( 1.3 ). However when looking at correlation matrix ( image below ), the correlation coefficent between independent feature ( residual sugar ) and dependent feature is pretty low compared to other independent features, so I just wonder whether weights assigned are not factors to consider how importance a feature is and if it's not how I evaluate whether a feature is important. Below is also my code, if anything is wrong pls help me correct as I am new to this area
enter code here
engine = create_engine("mysql+mysqlconnector://root:21041996#localhost/mydatabase")
con = engine.connect()
dataframe= pd.read_sql('select * from wine_quality',con)
df = dataframe[dataframe['type']=='white']
seaborn.heatmap(df.corr(),annot= True)
plt.show()
y = df['"quality"']
x = df.drop(columns=['type','"quality"'])
x = x.to_numpy()
y=y.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
y_train = y_train>6
y_test = y_test>6
model = LogisticRegression(solver='liblinear',max_iter=2000)
model.fit(X_train,y_train)
print(model.score(X_train,y_train))
print(model.score(X_test,y_test))
print(model.coef_)
I am training a model to predict true or false based on some data. I drop the product number from the list of features when training and testing the model.
X = df.drop(columns = 'Product Number', axis = 1)
y = df['result']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
SVC = LinearSVC(max_iter = 1200)
SVC.fit(X_train, y_train)
y_pred = SVC.predict(X_test)
Is there any way for me to recover the product number and its features for the item that has passed or failed? How do I get/relate the results of y_pred to which product number it corresponds to?
I also plan on using cross validation so the data gets shuffled, would there still be a way for me to recover the product number for each test item?
I realised I'm using cross validation only to evaluate my model's performance so I decided to just run my code without shuffling the data to see the results for each datapoint.
Edit: For evaluation without cross validation, I drop the irrelevant columns only when I pass it to the classifier as shown below:
cols = ['id', 'label']
X = train_data.copy()
y = train_data['label']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=2)
knn = make_pipeline(StandardScaler(),KNeighborsClassifier(n_neighbors=10))
y_val_pred = knn.fit(X_train.drop(columns=cols), y_train).predict(X_val.drop(columns=cols))
X_val['y_val_pred'] = y_val_pred
I join the y_val_pred after prediction to check which datapoints have been misclassified.