I made a random forest regression with filter for y and x variables and I also wanted to add more about shapley values by creating a graph and table with column for the variable and column for the shaply value result. The code plots the graph, but the table is not showing.
So far my code looks like this:
def randomforest(y, x):
x = dataset[list(x)]
y = dataset[y]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)
model = RandomForestRegressor(random_state=0).fit(X_train, y_train)
y_predict = model.predict(X_test)
mean_squared_error(y_test, y_predict)**(0.5)
print('Mean Squared Error:', mean_squared_error(y_test, y_predict)**(0.5))
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_train)
shap.summary_plot(shap_values, features=X_train, feature_names=X_train.columns, plot_size=[15,8])
shap_vals = shap_values[0, :]
feature_importance = pd.DataFrame(list(zip(X_train.columns, shap_vals)), columns=['X_train', 'shap_vals'])
feature_importance.sort_values(by=['shap_vals'], ascending=False,inplace=True)
interact(randomforest, y = list(dataset.select_dtypes('number').columns), x = x)
So i've made a model for values prediction using linear regression. And now i need to get it to predict for 2022-2024 years into the future. how can i do it? maybe add rows 2023-2024 to dataframe? but will it be correct?
data['Year'] = pd.to_datetime(data['Year'])
data.index = data['Year']
data.drop(['Year'], axis=1, inplace=True)
data = data.bfill().ffill()
y = data['x4']
X = data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))
If you want to predict you just have to put the new data:
X_new = new_data[['x1','x3','x5','x6','x7','x8','x9','x10','x11','x14','x15','x17']]
y_new = model.predict(X_new)
model is your linear regression which was trained, now you predict with the new data, in the same order and same format you did for your X_train/X_test and that's it
How to use cross validation with sm.GLM (sm = statsmodels.api)?
I am trying to fill the "model" parameter from the cross_val_score. However, since I need to use the sm.GLM I don't know how to use it.
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 1)
n = 5
df_cut, bins = pd.cut(X_train, n, retbins=True, right=True)
df_steps = pd.concat([X_train, df_cut, y_train], keys=['age','age_cuts','wage'], axis=1)
# Create dummy variables for the age groups
df_steps_dummies = pd.get_dummies(df_cut)
# Fitting Generalised linear models
fit3 = sm.GLM(df_steps.wage, df_steps_dummies).fit()
# Binning validation set into same 5 bins
bin_mapping = np.digitize(X_test, bins, right=True)
X_valid = pd.get_dummies(bin_mapping)
# Removing any outliers
# X_valid = pd.get_dummies(bin_mapping).drop([6], axis=1)# Prediction
pred2 = fit3.predict(X_valid)
# Calculating RMSE
rms = sqrt(mean_squared_error(y_test, pred2))
scores = cross_val_score(model, X, y, scoring='neg_mean_squared_error',
cv=cv, n_jobs=-1)
Therefore, instead of computing the RMSE like above, I would like to use the cross_val_score. For example, in the model parameter, if I would like to use lasso I would put model = lasso(). However, here I can not put model = sm.GLM().
I hope it is clear...
so I was able to output my predicted numerical values into an excel file but I was wondering if it is possible to instead of the numerical value, it actual exports the string instead.
Currently it looks like this,
Column 1
Answer Key
Something Something
Instead of it returning 3, I would like for 3 be replaced as the actual string its associated with (for example, idk Truck).
here is my code so far, I know I have to mess with the exporting part of the code but I cannot seem to figure this out.
texts = df['without_Tags'].astype('str')
vector = TfidfVectorizer(ngram_range=(1, 2), min_df = 2, max_df = .95)
X = vector.fit_transform(texts) #features
LE = LabelEncoder()
df['tower_values'] = LE.fit_transform(df['Tower'])
y = df['tower_values'].values
lsa = TruncatedSVD (n_components=100, n_iter=10, random_state=3)
X = lsa.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .3, shuffle = True, stratify = y, random_state = 3)
SG = SGDClassifier(random_state=3, loss='log')
SG.fit(X_train, y_train)
y_pred = SG.predict(X_test)
print("SG model accuracy:", accuracy_score(y_test, y_pred))
print("SG model Recall:", recall_score(y_test, y_pred, average="macro"))
print("SG model Precision:", precision_score(y_test, y_pred, average="macro"))
print("SG model F1 Score:", f1_score(y_test, y_pred, average="macro"))
y_pred = pd.DataFrame(y_pred, columns=['predictions']).to_csv('prediction.csv')
final = pd.read_csv('prediction.csv')
final['pre'] = y_pred
Try using inverse transform method of LabelEncoder()
y_pred = LE.inverse_transform(y_pred)
I've got a dataset containing a lot of missing values (NAN). I want to use linear or multilinear regression in python and fill all the missing values. You can find the dataset here: Dataset
I have used f_regression(X_train, Y_train) to select which feature should I use.
first of all I convert df['country'] to dummy then used important features then I have used regression but the results Not good.
I have defined following functions to select features and missing values:
def select_features(target,df):
'''Get dataset and terget and print which features are important.'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies.dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.30, random_state=40)
f,pval = f_regression(X_train, Y_train)
inds = np.argsort(pval)[::1]
results = pd.DataFrame(np.vstack((f[inds],pval[inds])), columns=X_train.columns[inds], index=['f_values','p_values']).iloc[:,:15]
And I have defined following function to predict missing values.
def train(target,features,df,deg=1):
'''Get dataset, target and features and predict nan in target column'''
df_dummies = pd.get_dummies(df,prefix='',prefix_sep='',drop_first=True)
df_nonan = df_dummies[[*features,target]].dropna()
X = df_nonan.drop([target],axis=1)
Y = df_nonan[target]
pol = PolynomialFeatures(degree=deg)
X = pd.get_dummies(X)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.40, random_state=40)
X_test, X_val, Y_test, Y_val = train_test_split(X_test, Y_test, test_size=0.50, random_state=40)
# X_train.shape, X_test.shape, Y_train.shape, Y_test.shape
X_train_n = pol.fit_transform(X_train)
reg = linear_model.Lasso()
X_test_n = pol.fit_transform(X_test)
Y_predtrain = reg.predict(X_train_n)
print('train',r2_score(Y_train, Y_predtrain))
Y_pred = reg.predict(X_test_n)
print('test',r2_score(Y_test, Y_pred))
# val
X_val_n = pol.fit_transform(X_val)
Y_valpred = reg.predict(X_val_n)
print('val',r2_score(Y_val, Y_valpred))
X_names = X.columns.values
X_new = df_dummies[X_names].dropna()
X_new = X_new[df_dummies[target].isna()]
X_new_n = pol.fit_transform(X_new)
Y_new = df_dummies.loc[X_new.index,target]
Y_new = reg.predict(X_new_n)
Y_new = pd.Series(Y_new, index=X_new.index)
return Y_new, X_names, X_new.index
Then I am using these functions to fill nan for features with p_values<0.05.
But I am not sure is it a good way or not.
With this way many missing remain unpredicted.
I have a dataset that I have run a K-means algorithm on (scikit-learn), and I want to build a decision tree on each cluster. I can recuperate the values from the cluster, but not the "class" values (I'm doing supervised learning, each element can belong to one of two classes and I need the value associated with the data to build my trees)
Ex: unfiltered data set:
[val1 val2 class]
X_train=[val1 val2]
The clustering code is this:
X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
(X_train, X_test, y_train, y_test) = train_test_split(X, y,
kmeans = KMeans(n_clusters=3, n_init=5, max_iter=3000, random_state=1)
kmeans.fit(X_train, y_train)
y_pred = kmeans.predict(X_test)
And this is my (unbelievably clunky!) code for extracting the values to build the tree. The issue is the Y values; they aren't consistent with the X values
cl={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)}
for j in range(0,len(k_means_labels_unique)):
#for i in range(0,len(k_means_labels_unique)):
indexes = cl.get(j,0)
for i, row in X.iterrows():
if i in indexes:
if Xc is not None:
Xc = np.vstack([Xc, [row['first occurrence of \'AB\''],row['similarity to \'AB\'']]])
Xc = np.array([row['first occurrence of \'AB\''],row['similarity to \'AB\'']])
if Y is not None:
Y = np.vstack([Y, y[i]])
Y = np.array(y[i])
Xc = pd.DataFrame(data=Xc, index=range(0, len(X)),
columns=['first occurrence of \'AB\'',
'similarity to \'AB\'']) # 1st row as the column names
Y = pd.DataFrame(data=Y, index=range(0, len(Y)),columns=['Class'])
print("\n\t-----Classifier ", j + 1,"----")
(X_train, X_test, y_train, y_test) = train_test_split(X, Y,
classifier = DecisionTreeClassifier(criterion='entropy',max_depth = 2)
classifier = getResults(
filename='classif'+str(3 + i),
Any ideas (or downright more efficient ways) of taking the clustered data to make a decision tree from?
Did not read all the code but my guess is that passing an index vector into the train_test_split function would help you keep track of the samples.
X = clusterDF[clusterDF.columns[clusterDF.columns.str.contains('\'AB\'')]]
y = clusterDF['Class']
indices = clusterDF.index
X_train, X_test, y_train, y_test, indices_train, indices_test = train_test_split(X, y, indices)